Exploring the Role of Kafka in High Throughput Data Processing Architectures

In today's fast-paced digital landscape, delivering real-time data efficiently is paramount for the success of various online services. From food delivery to messaging apps and ride-hailing platforms, the ability to process high volumes of data in real-time is crucial. However, the limitations of traditional databases often hinder this process, leading to data loss and delays. This is where Kafka comes into play as a powerful solution to bridge the gap.

What is ops? Number of operation that can be done that is read/insert

Challenges with Traditional Databases

Traditional databases, while proficient in storage and data retrieval, often struggle with handling high-throughput operations and real-time data processing. Consider the example of a food delivery service such as Swiggy, where tracking the live location of delivery drivers is crucial. Storing this data directly in a traditional database might lead to delays and potential data loss during system downtimes.

Real-time Data Processing Challenges

Messaging platforms like WhatsApp, serving millions of users simultaneously, face the challenge of storing real-time conversations efficiently. With thousands of concurrent conversations, traditional databases often find it challenging to manage the data flow in real-time.

The Role of Kafka in Data Processing Architectures

Kafka, with its high throughput capabilities, serves as an effective solution to the limitations posed by traditional databases. By allowing producers to distribute data to multiple consumers, Kafka ensures efficient data distribution and processing. Its dual role as both a messaging system and a distributed streaming platform makes it a versatile solution for managing high data volumes in real-time scenarios.

Integration of Kafka and Traditional Databases

Integrating Kafka with traditional databases is crucial to leverage the strengths of both systems. Producers can send data to Kafka, which efficiently distributes it to various consumers and services. Bulk insert operations that might overwhelm the database can be effectively managed through Kafka, preventing data loss and ensuring efficient data processing.

Comparing Kafka with Other Messaging Systems

While systems like RabbitMQ and Redis operate primarily as queues in the background, Kafka offers a unique combination of queue and pub/sub functionalities. This versatility enables Kafka to effectively manage different data processing requirements, making it a preferred choice for high-throughput data processing architectures.

Architecture of Kafka

Producer: A producer in Kafka is responsible for publishing messages to a Kafka topic. It is essentially a source of data streams that pushes records to Kafka topics.

Consumer: A consumer in Kafka is a part of an application that reads data from Kafka topics. Consumers subscribe to one or more topics and process the feed of published messages.

Kafka server: Kafka server refers to the overall infrastructure that manages the storage, processing, and distribution of data. It handles the data flow from producers to consumers, ensuring reliable and scalable data streaming.

Topic: In Kafka, a topic is a category or feed name to which records are published. Topics are divided into partitions, allowing data to be distributed across multiple nodes within the Kafka cluster. They are the fundamental unit for data organization and distribution in Kafka.

We can further divide topic into partitions, just like sharding in db where we devide our db into 2 parts and storing them according to our requirement so that ops can be less costly and more efficient.

💡

In a typical deployment of Kafka with many topics and partitions, scaling the Kafka consumer efficiently is one of the important tasks in maintaining overall smooth Kafka operations.

As you can see above, there are only 4 partitions and 4 consumers, so each partition is being occupied each consumer, but what if there are 5 consumers? one would be left out since all 4 partitions have been occupied by rest and thus 5th consumer being in idle state due to autoscaling.

Understanding Kafka Consumer Groups

Kafka's concept of consumer groups addresses the limitation of allowing only one consumer to process data from a single partition. By implementing consumer groups, Kafka functions as both a queue and a pub/sub system. This approach ensures efficient data distribution, allowing multiple consumers to process data from different partitions simultaneously. Example:

Now we can see the role of consumer group here, different partition can be now occupied by multiple consumers.

Why make it so complicated with consumer groups?

Kafka, being a distributed streaming platform, can function both as a queue and as a pub/sub (publish/subscribe) system, depending on the context and the use case. The behavior of Kafka as a queue and as a pub/sub system is determined by how data is produced and consumed within the Kafka architecture.

Kafka as a Queue:

Kafka behaves as a queue when data is consumed in a traditional point-to-point manner. In this scenario, each message within a topic is consumed by only one consumer. This process is similar to a queue, where messages are delivered in a first-in-first-out (FIFO) manner. The key characteristics of Kafka acting as a queue are:

Point-to-Point Consumption: Each message is consumed by only one consumer within a consumer group.
Processing Order: Messages are processed in the order in which they are received.
Partitioning: Each partition within a topic is consumed by only one consumer within a consumer group.

Kafka as a Pub/Sub System:

Kafka behaves as a pub/sub system when data needs to be distributed to multiple consumers or services. In this mode, data is published to a topic, and multiple consumers can subscribe to this topic and receive data independently. This process is similar to the pub/sub model, where publishers distribute messages to multiple subscribers. The key characteristics of Kafka acting as a pub/sub system are:

Multiple Subscribers: Multiple consumers can subscribe to the same topic and receive data independently.
Scalability: The system can scale easily as more subscribers can be added without affecting the data distribution process.
Broadcasting: Data published to a topic is broadcasted to all subscribers.

Hybrid Functionality:

Kafka's ability to act as both a queue and a pub/sub system is what makes it a powerful tool for data processing in various contexts. It allows for the flexibility to implement a variety of data processing architectures, catering to the specific requirements of different applications. The choice of how Kafka behaves, whether as a queue or a pub/sub system, depends on the way data is produced, consumed, and distributed within the system, providing versatility and scalability for complex data processing scenarios.

How is Kafka Internally built?

It's a log. A log is a file where u write one line and when there is a new event you write another line. Every line will have a delimiter. Every line is appended after the delimiter.

Why Kafka doesn't work on queues(priority)?

Queues are implemented using linked lists or an array. The mechanism of Kafka doesn't give any priority. Kafka is fast, scalable, and distributed in nature by its design, partitioned and replicated commit log service. So there is no priority on topic or message.

All events belonging to a topic need to be ordered.

For example,

Let's say we have topics A B C and P Q R.

For instance, if all this is jumbled like P Q A B C R

Rules:

P comes before Q

And Q comes before R

The original order need not be maintained. We need to care about ordering between inside the topics not in between topics. Wherein C will be consumed by the consumer only after B gets consumed.

What is Causal ordering?

As long as relevant messages are sent to the same partition, message (causal) ordering is still guaranteed. In Kafka, this is achieved using keyed messages. Kafka partitioners would hash keys and ensure a key always goes to the same partition given the number of partitions is not changed.

How do u search in a log?

A log has a key and value. The log is already sorted by key. How we usually query is based on timestamp or the machine id. Instead of taking any random key, we take the 128 bits key and use 64 bits to store the timestamp.

For Example:

18th August_123

Where ‘_’ is used as a delimiter.

Elastic search

The basic idea is it is a no-sequel database. It has key-value pairs. It is good at text search and aggregations. You can add tags to it. An Elasticsearch index is a collection of documents that are related to each other.

Benefits of using Kafka

Limited persistence
Fast in terms of writing (if u want to read fast use another database).
Used for debugging

Finally, If a very popular search text exists how do u find the document that it exists in and store it efficiently?

Conclusion

In modern high-throughput data processing architectures, integrating Kafka alongside traditional databases has become imperative. Leveraging Kafka's high throughput capabilities and combining them with the storage and querying advantages of traditional databases, businesses can ensure efficient real-time data processing and management, meeting the demands of dynamic and data-intensive applications.

Apache Kafka