Event sourcing with Kafka Streams - DNA Technology - Medium (original) (raw)

Event sourcing with Kafka Streams

The main goal of this article is to share lessons learnt during creation of a product using Apache Kafka Streams library and based on the Event Sourcing architectural pattern. To give you a proper background, I intend to:

describe both the business and technical context of our project,
introduce you quickly to Kafka (and help you understand how it differs from other message brokers),
characterise Kafka Streams and its defining features,
present our solution using an example,
and finally, discuss the advantages of using Kafka Streams for this purpose and explain what kind of pitfalls are waiting for you, so you can decide if it’s a good fit for your project.

Let’s go!

Some background (and a disclaimer)

Before this project, none of our team members had any practical experience with DDD, Event Sourcing or Kafka. Thinking in terms of events about everything is a challenge at the beginning but tends to solve a lot of problems in certain domains. The complexity of Kafka itself doesn’t help too.

For DDD fans: we certainly used various aspects of a DDD approach (and CQRS), but it’s not a purely DDD-based solution in its bones. Naming may be different but you’ll notice the resemblance.

Business context

I’m not allowed to share too many details in this section, but I can tell you it’s a greenfield FinTech product constituting a part of a larger system. Its main purpose is to be a ledger service tracking users’ money flow. It also handles bank transfers and other financial information.

The service is part of a regulated environment and therefore the audit trail is utterly important. In most user scenarios there were no requirements about real-time processing and some delays or even downtimes were acceptable.

Technical context

As numerous services in the system were created by the time we joined, a few system-wide decisions were already made. The whole system was supposed to utilise event sourcing and use Kafka as the Source Of Truth. This limited possible solutions.

Our small team (5 ppl) felt most comfortable with Java and we read that Java Kafka client libraries have the broadest support for Kafka features. Moreover, as we already had lots of new stuff waiting for us, the technology choice was a quick one. The stack is a standard one:

Spring Boot
Spring Kafka
Spring Web, Data, Security
PostgreSQL (just for query-only projections meant for the back-office)
Spring Cloud Sleuth
Cucumber for BDD-like integration tests
Modular Monolith architecture (unknown domain + quick start needed)
AWS
Protobuf (Kafka message format to be compatible with other technologies and have structured data)

As it all started a while back, we could find limited information about Kafka Streams. Various articles said it’s perfect for event sourcing, others that it’s a no-go.

We were particularly tempted by one feature offered by Kafka Streams:
out-of-the-box exactly-once semantics which work together with Kafka Streams state stores (lots of jargon in a single sentence, we’ll get to it 😉 ). This seemed ideal for our use case and easy enough to use compared to other possible solutions. It was indeed but brought other challenges on the way.

If you are already familiar with Kafka and Kafka Streams, you may want to skip directly to our event sourcing solution or lessons learnt.

Apache Kafka recap

There’re enough articles already about Kafka and how it operates, so just a quick recap:

created and open-sourced by LinkedIn in 2011
it’s a distributed streaming platform, not only a message broker
underneath it’s an abstraction of a distributed immutable commit log
we have Kafka brokers storing the data; and producers and consumers in the application
data is stored in topics which in turn are divided into a number of partitions (both defined by a developer)
data is stored as long as you wish, even indefinitely
the ordering of messages is guaranteed only on the partition level
messages from a single partition are processed by a single consumer in a given consumer group (partitions enable scalability and prevent race conditions)
consumers commit offsets to the broker and thanks to that know which messages they have already processed. Offsets are stored by the broker but it’s up to the consumer to commit (send) them
Kafka is scalable, fault-tolerant, highly available and really fast (sounds great, right?)

If you need more basics before reading further, this is a good explanation with nice images: https://www.cloudkarafka.com/blog/part1-kafka-for-beginners-what-is-apache-kafka.html

Kafka Streams crash course

Even though Kafka Streams library has been available for more than 5 years, it’s not as popular as Kafka itself. Let me do a quick introduction.

Kafka Streams main features

Client library available for Java for processing and analyzing data stored in Kafka
Built on top of standard consumers and producers, and as such dependent only on Apache Kafka. Because it uses (mostly) standard Kafka features, Kafka Streams are also elastic, highly scalable and fault-tolerant (still sounds great, right?)
Supports fault-tolerant local state stores enabling aggregations and windowed joins (more details soon)
Supports exactly-once processing semantics to guarantee that each record will be processed once and only once even in case of a failure in the middle of processing.
Provides stream processing programming API
Used by really big players (e.g. Zalando, The New York Times, TransferWise) for various purposes

Processor topology and some code

Kafka Streams use a notion of a topology for modelling computational logic:

https://kafka.apache.org/28/documentation/streams/tutorial

Going from a concept to code, a simple example borrowed from Apache Kafka Streams tutorial shows an application counting the occurrence of the words split from the source text stream (streams-plaintext-input topic). It is then continuously producing the current values to the output topic (streams-wordcount-output):

We as developers define such a topology in the code which is then analysed by Kafka Streams and turned into a running and already scalable application.

Kafka Streams State Stores

For us, this is the defining part of Kafka Streams — State Stores which enable stateful operations. You’ve already seen this in the example: the total count of particular words needs to be stored on the go but should also survive restarts/crashes of the application.

Kafka Streams State Stores are an enabler for all stateful operations: joins, aggregates, grouping, etc.

They are part of Kafka Streams and adhere to similar principles, but provide a few extra features:

can be in-memory or fully persistent key-value stores
offer fault-tolerance and automatic recovery
allow direct read-only queries of the state stores (via Interactive Queries)
support exactly-once semantics

They achieve persistence and fault-tolerance by simply using custom Kafka topics underneath, called changelog topics, as storage. The data is loaded during the startup from the topic to memory (or disk). Thanks to that all neat tricks applied to topics apply also to State Stores. One of these features is…

Processing guarantees

Kafka Streams guarantee that for any record read from the source Kafka topics, its processing results will be reflected exactly once in the output topic as well as in the state stores for stateful operations.

This is usually quite hard to achieve in distributed systems when things can break and migrate at any time. But as Kafka offers transactions, Kafka Streams make use of them and perform the following operations atomically in a single transaction:

commit on the input topic offset
update on the state store
writes to the output topic

All you need to do is to set a single property in Kafka Streams configuration:

processing.guarantee = “exactly_once”

Our event sourcing solution

The concept of event sourcing seems pretty simple on the surface:

Commands, which can request a change in the system, are issued. They are validated and accepted or rejected. Once accepted, they are handled and as a result, an event is produced. An event means a fact, something that happened, and cannot be changed or rejected. Because we store the whole history of events, we can calculate the current state of an entity at any point in time.

How can this concept be implemented using Kafka Streams?

A command is sent to the command topic. There is a Kafka Streams Application with a defined topology that starts to handle the command. The first step is validation and to do that we sometimes need to peek into the current state of an entity, so we read the state from the State Store.

Once the validation is completed, we send a reply to the command_reply topic, so that the service requesting a change knows whether it was accepted or rejected.

The processing goes on. Based on the command, an event is created. Sometimes the command doesn’t contain all the information needed to build an event, and we need to look into the State Store again to retrieve extra data about the entity. The event is then sent to the event topic. This topic has infinite retention which means the messages are stored forever.

Finally, based on the event, we update the state held in the State Store. One can say we apply the event on top of the current state.

Additionally, we push the state to the state topic if it’s needed by other domains. The structure of data in this topic differs from the State Store structure — we don’t want internal implementation details to leak outside.

All of the processing happens in a single Kafka transaction. This means the offset commit for the command topic, 3 produced messages and the state update will happen exactly once.

At this point, some of you may be wondering why it’s called event sourcing if the command is the trigger for the whole flow. What would happen if one needed to change the structure of the internal State Store and replay all the events from the beginning? After all, the state in event sourcing is derived from the events which are the Source of Truth. It may turn out that a different internal representation of the internal state is needed at some point.

There’s an alternative path for it:

We have an extra processor consuming from the event topic (which we just produced an event to) and updating the state. We version both the events and the state. In the every day flow (when the state is calculated in real-time, based on a single new event) these updates are skipped — the state is already up-to-date.

In a case where we need to recreate the state, we clear the state store and reset offsets to zero for the event topic (e.g. using Kafka Streams Application Reset Tool). This is currently not ideal, because we depend on the consumer algorithm which tries to process messages from the topic with older timestamps. However, there’s no ordering guarantee when subscribing to 2 different topics (command and event in this case). A better solution would involve a flag not to consume any commands when recreating the state. This procedure implies also some downtime because the Kafka Streams Application needs to be down to be able to perform manual work on the offsets/topics.

Simple implementation

This snippet represents a topology for a single Kafka Streams Application. It implements the flow we discussed earlier. The validation part as well as the internal implementation of the processor nodes are skipped.

Using output logs from Kafka Streams and this great visualization tool we can see how the actual topology will look like:

Lessons learnt

Advantages

Exactly-once processing is great and solves a lot of problems with inconsistencies in different scenarios. If you process multiple commands in the row, you can be sure the state is updated based on the previous events.
State stores, although sourced from changelog topics on startup, are kept in memory (or on a disk if taking too much memory) and are very fast. They provide snapshots of the state out-of-the-box as the newest state is always persisted, so the warmup time is not dependent on the number of events in the topic, but rather on the number of different entities (e.g. a number of users).
You don’t need to worry about fault-tolerance and automatic recovery. Autoscaling is easy too, just add more nodes and Kafka will handle the rest (if you have a correct number of partitions and proper standby replicas configuration).
Kafka Streams API is developer-friendly and similar to other streaming APIs.
Easy testing thanks to TopologyTestDriver — write fast unit tests for your topologies, no need for dockerized or embedded Kafka. Check out an example in the docs.

Pitfalls

It’s not a ready framework for event sourcing. You need to write the implementation. It took us a couple of iterations to come up with the current design — and it’s still not perfect.
Kafka Streams Architecture. Read this once, then another 3 times. A lot is happening out-of-the-box but you should know how your code is transformed into topologies, then into tasks. State stores are local to the tasks and this has consequences. Remember it’s a single thread handling each partition. We managed to define a topology that consumed messages from another domain to create a state for validation, and at the same time it consumed commands which were validated using this state. Imagine what happened if the command arrives 10ms after the message needed to update the state for the positive validation when you have a single thread for these two topics…
Even though exactly-once processing is guaranteed, bugs exist everywhere, even in popular libraries. We discovered a scenario that broke these guarantees and our command got processed a few times during a crash (KAFKA-9891). This has already been fixed.
Only starting from Kafka Streams 2.6 there’s a solution for zero-downtime deployment (we haven’t tested it yet). Anyway, you still need maintenance windows when you need to replay the events to change the internal state store (unless you come up with a smart migration mechanism). Moreover, resetting the stream application requires using scripts with a direct access to the cluster.
Interactive queries is sometimes a necessity, as data is distributed among different nodes. It’s easy to use, but you need to actually implement the communication between nodes to retrieve the data.
Ordering is important in event sourcing. It is guaranteed only on the partition level. When producing a message, 2 factors are considered: a key of the message and a number of partitions in the topic (in the default partitioning algorithm). Therefore, it’s better to start with a high-enough number of partitions, and not change it later (btw you can’t decrease the number). If you increase the number of partitions on the fly, new messages with the same key will be produced to a different partition. Of course, you can migrate data to a new topic, but it’s yet another process you need to think of.
Kafka learning curve is quite steep. It’s very flexible but at the same time has a lot of configuration options for the broker, producers, consumers, Kafka Streams…
Operations spanning multiple partitions are hard to implement as data is distributed and processing is done on the topic-partition level. Think about data partitioned by user_id, how to do some processing on all users’ data? We had to use a relational DB in some cases.
Backup. There’s replication between Kafka brokers and retention of the data is set to infinity, so the data should be safe. Imagine now a scenario in which someone makes a mistake and changes the retention of the data by accident… All old events will be deleted and the change will propagate to other brokers too. You can backup messages in DB or some other storage, but it’s hard to keep the consumer offsets in sync all the time (an offset is stored per consumer group, topic and partition so it’s a lot to store). You may run into problems during the disaster recovery if it turns out that the last offsets sync was 8 minutes ago and your consumers don’t know which commands have already been processed.

Summary

Some people ask if Kafka Streams have been a good choice. Unfortunately, lacking experience in other frameworks for event sourcing, it would be unfair to answer. It for sure brings a lot of great value, but not for free. We nevertheless see Kafka Streams benefits for other streaming uses, so you may consider applying this library in your project.

I had to omit some details not to make the article too long, and I encourage you to ask questions in the comments.

You may want to read another story about implementing Kafka consumers health check in Spring Boot Actuator or check out more DNA Technology articles.