HLD or High Level System Design of Apache Kafka Startup (original) (raw)

Last Updated : 28 Mar, 2026

**Apache Kafka is a distributed data store optimized for ingesting and lower latency processing streaming data in real time. It can handle the constant inflow of data sequentially and incrementally generated by thousands of data sources.

Importance

Let’s look at the problem that inspired Kafka in the first place on Linkedin. The problem is simple: Linkedin was getting a lot of logging data, like log messages, metrics, events, and other monitoring/observability data from multiple services. They wanted to utilize this data in two ways:

Have an online near-real-time system that can process and analyze this data.
Have an offline system that can process this data over a longer period.

Most of the processing was done for analysis, for example, analyzing user behavior, how users use LinkedIn, etc.

Requirement Gathering

This section outlines the key requirements for designing a system like Kafka for handling large-scale data streaming.

The problem may seem simple, but the solution becomes complex due to high scale, performance needs, and flexibility in message handling. Below are the important requirements:

**High Scalability: The system should handle massive volumes of data (events, logs, metrics) that can reach tens or hundreds of TBs daily, requiring a highly scalable distributed architecture.
**High Throughput: It should support extremely high traffic, processing hundreds of thousands (or even millions) of messages per second efficiently.
**Producer-Consumer Model: The system must allow producers to send messages and multiple consumers to subscribe and process them independently.
**Flexible Consumption: Consumers should have control over how and when they consume messages (real-time or batch processing).
**Asynchronous Processing: Messages should be processed asynchronously to decouple producers and consumers, improving system performance and reliability.
**Message Immutability: Messages are generally immutable (append-only), meaning once written, they are not modified or deleted.
**Loose Delivery Guarantees: Strong transactional guarantees are not always required; instead, the system focuses on high availability and performance.

Message Brokers vs Kafka

Maybe using message brokers such as RabbitMQ, and ActiveMQ, can solve the above problem, but they cannot, and let's see why:

**Message Batching: Since we are pulling a lot of messages on the consumer, it doesn’t make sense to pull messages one by one. Most of the time, you’d want to batch messages. Otherwise, most of your time would be wasted on-network calls.
Since message brokers aren’t really meant to support such high throughput, they generally don’t provide good ways to batch messages.
**Different consumers with different consumption requirements: We discussed having two types of consumers, one online system which processes messages in real-time and the other an offline system that might want to read messages received in the past twelve or twenty-four hours.
This pattern doesn’t work with most message brokers or queues. This is because some message brokers, like RabbitMQ, use a push-based model, pushing messages from the broker to the consumer. This leads to lesser flexibility for the consumer since the consumer cannot decide how and when to consume messages.
**Small and simple messages: Message sizes are generally larger in most message brokers. This isn’t a bug, but it’s by design. Message brokers often support many features, like different options for routing messages, message guarantees, being able to acknowledge every message individually, etc., which leads to large individual message headers.
Large messages are fine as long as you don’t have a lot of them and you don’t have to store them, but that is precisely what we want to do in our system.
**Distributed high-throughput system: One of the most important requirements is very high throughput. We want to support hundreds of thousands of messages per second, even going up to millions per second. Running this system in a single node is infeasible.
We need a distributed system that can support this throughput, which many message brokers don’t.
**Large queues: Message brokers often have varying support for large queue sizes. This depends on the message broker you are using and your configuration, but the internet is filled with people facing issues with message broker queue sizes.

So, let's now understand what should be the architecture of the Kafka system with the above mentioned requirements.

High-Level Design

This section provides an overview of the system architecture, showing how different components interact at a high level.

High-level design of Apache Kafka

Components of the Above Design

This section explains the core components involved in a Kafka-based messaging system.

**Topics: Topics represent a stream of messages where data is stored and organized. Producers send messages to topics, and consumers read (poll) messages from them.

**Producers: Producers are applications that generate and send messages to topics. They specify the topic, message, key, and optional metadata before publishing to the broker.

**Consumers: Consumers are applications that read messages from topics. They continuously poll the broker and track their progress using offsets (last read message).

**Consumer Groups: Consumers are grouped together to process messages in parallel. Instead of a single consumer handling all messages, multiple consumers in a group share the workload, improving throughput and scalability.

Messages from a topic are distributed among consumers within the same group.
Each message is processed by only one consumer in the group.
Helps handle high-volume data efficiently.

**Example
Suppose multiple services publish user activity events (like searches or job postings) to a topic.

A Recommendation Service consumes these events in real-time to update suggestions.
Another Analytics Service processes the same data in batches (e.g., every 24 hours).
Since real-time processing has high load, multiple consumers are added in a consumer group to divide the workload and handle messages efficiently.

Partitions in topics for better scale

Having a close look at topics, we see that every topic is divided into a configurable number of 'partitions'. Every single message in a topic is sent to exactly one partition.

Partitions

Depending on the configuration and the message, this can be either based on the message's key or in a round-robin fashion. Regardless, what’s important is that a message sent to a topic eventually goes into a single partition.

And partitions aren’t very complex. They are an append-only-like system to store messages. Think of them like a log file and the message like lines in a log file.

Consumers from a consumer group aren’t directly listening to topics. Instead, they listen to zero, one, or more partitions of the topic. Every consumer gets messages only from the partitions it listens to.

Since every consumer is assigned its own partitions on startup, consumers don’t need to discuss which messages have already been consumed. This is also helpful as it helps to scale Kafka linearly since adding more partitions/nodes doesn’t increase the work or communication between existing partitions/nodes. These partitions are often in different brokers running on different machines.

Kafka storage layout

Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.