What is Apache Kafka and How Does it Work? (original) (raw)

Last Updated : 23 Jul, 2025

**Apache Kafka is a distributed, high-throughput, **real-time, low-latency data streaming platform. It's built to transport large volumes of data in real-time between systems, without needing to develop hundreds of intricate integrations. Rather than integrating each system with each other system, you connect everything to Kafka and leave the data movement to Kafka, as a high-speed, fault-tolerant message bus.

Originally developed at **LinkedIn and now under the maintenance of the Apache Software Foundation, Kafka is relied on by industry leaders such as Netflix, Uber, Walmart, and LinkedIn to process real-time data ingestion, streaming analytics, and event processing. Whether for user behavior tracking, log aggregation, fraud detection, or fueling recommendation engines — Kafka scales with ease, provides millisecond-level performance, and guarantees data never goes missing.

What is Apache Kafka?

Apache Kafka allows you to decouple your data streams and systems. So the idea is that the source systems will have the responsibility to send their data into Apache Kafka, and then any target systems that want to get access to this data feed this data stream will have to query and read from Apache Kafka to get the stream of data from these 3 systems and so by having this decoupling we are putting the responsibility of receiving and sending the data all on Apache Kafka.

Apache Kafka Working

So this is not a new way of doing things this is called **pub-sub, but Apache Kafka is revolutionary because it scales really well and it can really handle big amounts of messages per second. So what could be the source systems and the target systems? For example, your source system could be website events, pricing data, financial transactions, or user interaction, and then the target systems may be a database, analytics system, email system, or audit.

How Does Apache Kafka Work?

Apache Kafka is a distributed, high-performance platform for real-time data streaming and message processing. But if you're just starting out with it, you may be thinking how they works:

Think of Kafka as a large, really fast post office that receives messages (data) from various sources and sends them to their respective destinations.

Producers send the messages (similar to people sending letters).
Kafka brokers behave like postal staff – sorting, storing, and controlling the flow.
Consumers get the messages (such as individuals opening their mailbox).
Topics are similar to labeled bins or folders that categorize where messages go.
Partitions assist in dividing the load so Kafka can process massive amounts of data in a timely manner.

Kafka Architecture

Kafka architecture is based on producer-subscriber model and follows distributed architecture, runs as cluster.

1. **Kafka Producers

These are the applications or systems that produce data into Kafka.
For example, a mobile application producing user clicks, or an online store producing order information.
Producers publish messages to a Kafka topic.

2. **Kafka Topics

A topic is a named stream of data.
It groups messages by category. For example: orders, payments, or user-activity.
Topics can be divided into partitions to enhance performance.

3. **Kafka Partitions

Partitions are similar to lanes in a highway. Every topic is assigned one or more partitions to deal with high-traffic volume.
Messages within partitions are persisted in the same order they arrived

4. Kafka Brokers

A Kafka broker is a server that continues the data and serves requests.
Multiple brokers make up a Kafka cluster in big systems to share the load and ensure high availability.

5. Kafka Consumers

Consumers are apps or services reading data from a Kafka topic.
They subscribe to a topic and pull messages periodically.
Consumers can operate as consumer groups for load sharing to avoid missing data.

For more details refer Kafka Architecture

How Kafka Transfers Data

Here how the Apache Kafka transfer the data step by step:

A producer application sends a message to Kafka.
Kafka stores the message in a specific partition inside a topic.
The message is held in Kafka's disk-based log storage — it’s not deleted immediately.
A consumer application reads the message from that partition.
After reading, Kafka doesn’t delete the message (unless configured) — this allows multiple consumers to read the same data independently.

Why Apache Kafka?

This was a project that originated within LinkedIn and it was very successful. It was open-source and then this open-source project found its home under the Apache Software Foundation (ASF) and so this is why Kafka is called Apache Kafka.
And so this is an open-source project but there are some private corporations maintaining the project, some of them may be Confluent, IBM, and Cloudera but many others as well but the main organization supporting the Kafka project is Confluent. Confluent is a private organization and they have a whole business model around Apache Kafka bringing their own enterprise software on top of the project.
Apache Kafka is very very good and very very popular because it is distributed has a resilient type of architecture and is fault-tolerant
It also has some very nice scalability because it is horizontally scalable, which means that to just add the capacity you need to add more servers, and in Apache Kafka, a Server is called a broker so Apache Kafka can scale to hundreds of brokers and it can scale to millions tens of millions of messages per second and actually Twitter is having hundreds of millions per session per second.
It has very very high performance with a latency of fewer than 10 milliseconds which makes it a real-time system.
It is used by thousands of firms including 60% of the Fortune 100 firms in the world and so some of the big names using Apache Kafka that you may know to include Linkedin, Airbnb, Netflix, Uber, and Walmart.

Kafka Data Retention and Storage

Kafka does not delete messages after being consumed. Instead, it holds them for some amount of time (like 24 hours, 7 days, etc.). This is known as **Kafka message retention. In the given timeframe, multiple consumers might consume the same data on separate instances — and hence, Kafka is ideal for fault-tolerant applications, event reprocessing, and guaranteed data delivery.

Kafka storage is designed to handle high throughput and durability. Messages are written on disk in sequence logs, which allow for rapid reads and writes even for enormous amounts of data.

Kafka also has built-in support for log compaction, or keeping only the most recent value of each unique key. This is useful for keeping track of the last known state of a record, such as user profiles, account balances, or inventory. Log compaction ensures that even when older ones are discarded, the latest and most relevant data are present.

By combining time-based retention and key-based compaction, Kafka provides compact data storage that's flexible an important reason it's used with real-time analytics, streaming data pipelines, and event-driven systems.

Also Read: How to Use Apache Kafka for Real-Time Data Streaming?

**Use Cases of Apache Kafka

It could be used as a messaging system.
Activity Tracking.
It could be used to gather metrics from many different locations.
It can be used to gather application logs at scale. And the metrics and the logs were actually one of the first use cases of Apache Kafka for LinkedIn.
It can be used for stream processing as we'll see with the Kafka streams API for example, it can be used to decouple the system dependencies in the microservice architectures
and also he has a lot of integration with big data technology such as Spark, Flink, Storm and Hadoop in order to perform big data.

Real-World Usage of Apache Kafka

**Netflix uses Kafka to apply a recommendation in real-time while you're watching TV shows.
**Uber uses Kafka to gather taxi user and trip data in real time and they will use it to compute and forecast demand and then they can compute the infamous search pricing in real-time to know how much to charge you for a ride in case there is a lot of high demand.
**LinkedIn uses Kafka to prevent spam and collect user interactions to make better connection recommendations in real time.
Here is the example of how the food delivery apps work using kafka:
- You place an order → this event is sent to Kafka (producer).
- Kafka stores this in the orders topic, partitioned by order ID.
- The delivery system (consumer) reads the message and assigns a driver.
- The payment system (another consumer) reads the same message to process your payment.
- Kafka keeps the data for a set time (e.g., 7 days) for audit or analytics.

Conclusion

Apache Kafka is like a nervous system for your data infrastructure. It wires up your source systems (such as apps, sites, databases) to your target systems (such as analytics platforms, storage layers, and microservices) — all in real-time, with high reliability and low latency.

Kafka answers a long-standing IT fix: transferring data between systems at scale, without buckling under stress. With distributed architecture, inherent fault tolerance, horizontal scalability, and high message throughput, Kafka can handle millions of messages per second, making it ideal for today's businesses that are dependent on real-time insights.

From processing customer orders for food delivery apps to anti-spam on social networks and fueling AI suggestions on streaming platforms — Kafka demonstrates its worth across sectors on a daily basis.