Stream Processing Using Kafka

So far we have predominently dealt with processing data-at-rest; i.e., the data that is collected and saved in HDFS and then processing that collected data at a later point of time.

Sometimes, you are interested in processing the data as it is being collected that is also called streaming data or data-in-motion. The readings spewed from sensors is a classic example of data-in-motion. You typically want to process the data as it is collected to take corrective actions immediately.

Here are some more use-cases for streaming data:

  • Stock trading systems
  • On demand, self service claim management in insurance companies
  • Connected cars systems
  • Real time game play
  • Realtime fraud detection systems

In this lesson, you will learn Apache Kafka, a big data tool used for processing such streaming data. Kafka was built by engineers at LinkedIn and later released as open source under Apache umbrella and quickly became a de facto standard for processing data-in-motion. Today over a third of fortune 500 companies used Kafka.

Publish - Subscribe Paradigm

Kafka is built using the Publish-Subscribe messaging paradigm. You publish and subscribe to streams of messages. This is similar to other messaging products in the market like RabbitMQ, ActiveMQ etc., however, with Kafka, you can scale to handle Big Data, can also process and store data and hence is not just a message broker like other products. While Hadoop computes and stores Big Data at rest, Kafka computes and stores Big Data in motion. In simple terms, KAFKA is a distributed streaming platform.

Architecturally the key concepts on which Kafka is built are:

  • It is event driven using message queues
  • It computes with low latency
  • Unit of data here is a message - similar to a row or a record in RDBMS
  • A message can have optional metadata which is a key

Mechanics of Stream

Messages are categorized into topics to which messages are published. Typically a single topic of data is referred to as a stream. A topic can be additionally broken down into partitions. A partition is an ordered, immutable sequence of messages that are constantly appended. Partitions are replicated based on the configured replication factor and are distributed across the cluster. Each message has an ID (a.k.a offset) that uniquely identifies the message. Messages are retained irrespective of if they are consumed or not and the retention setting is configurable.

Producers, Consumers and Brokers

Producers and consumers are the two users of Kafka. Producers (in other messaging systems called publishers) produce messages to specific topics. By default messages are balanced across all partitions. However, if the message has a key then the messages that have the same key will be directed to a specific partition.

Consumers (in other message systems called subscribers) consume messages by subscribing to one or more topics and reads the messages in a sequential order maintained by the partition. Consumers keep track of the messages consumed by them by keeping track of the offset (ID). Consumers work in groups to process a topic and each partition is worked on by one consumer.

A single Kafka server is called a broker.

results matching ""

    No results matching ""