HLD or High Level System Design of Apache Kafka Startup (original) (raw)

Last Updated : 28 Mar, 2026

**Apache Kafka is a distributed data store optimized for ingesting and lower latency processing streaming data in real time. It can handle the constant inflow of data sequentially and incrementally generated by thousands of data sources.

Importance

Let’s look at the problem that inspired Kafka in the first place on Linkedin. The problem is simple: Linkedin was getting a lot of logging data, like log messages, metrics, events, and other monitoring/observability data from multiple services. They wanted to utilize this data in two ways:

Most of the processing was done for analysis, for example, analyzing user behavior, how users use LinkedIn, etc.

Requirement Gathering

This section outlines the key requirements for designing a system like Kafka for handling large-scale data streaming.

The problem may seem simple, but the solution becomes complex due to high scale, performance needs, and flexibility in message handling. Below are the important requirements:

Message Brokers vs Kafka

Maybe using message brokers such as RabbitMQ, and ActiveMQ, can solve the above problem, but they cannot, and let's see why:

So, let's now understand what should be the architecture of the Kafka system with the above mentioned requirements.

High-Level Design

This section provides an overview of the system architecture, showing how different components interact at a high level.

High-level design of Apache Kafka

Components of the Above Design

This section explains the core components involved in a Kafka-based messaging system.

**Topics: Topics represent a stream of messages where data is stored and organized. Producers send messages to topics, and consumers read (poll) messages from them.

**Producers: Producers are applications that generate and send messages to topics. They specify the topic, message, key, and optional metadata before publishing to the broker.

**Consumers: Consumers are applications that read messages from topics. They continuously poll the broker and track their progress using offsets (last read message).

**Consumer Groups: Consumers are grouped together to process messages in parallel. Instead of a single consumer handling all messages, multiple consumers in a group share the workload, improving throughput and scalability.

**Example
Suppose multiple services publish user activity events (like searches or job postings) to a topic.

Partitions in topics for better scale

Having a close look at topics, we see that every topic is divided into a configurable number of 'partitions'. Every single message in a topic is sent to exactly one partition.

Partitions

Depending on the configuration and the message, this can be either based on the message's key or in a round-robin fashion. Regardless, what’s important is that a message sent to a topic eventually goes into a single partition.

And partitions aren’t very complex. They are an append-only-like system to store messages. Think of them like a log file and the message like lines in a log file.

Consumers from a consumer group aren’t directly listening to topics. Instead, they listen to zero, one, or more partitions of the topic. Every consumer gets messages only from the partitions it listens to.

Since every consumer is assigned its own partitions on startup, consumers don’t need to discuss which messages have already been consumed. This is also helpful as it helps to scale Kafka linearly since adding more partitions/nodes doesn’t increase the work or communication between existing partitions/nodes. These partitions are often in different brokers running on different machines.

Kafka storage layout

Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.