Types of ETLs :: Red Planet Labs Documentation (original) (raw)
Stream topologies process data as it comes in. Data is processed from a depot partition in the order in which they were appended, but otherwise the processing of data is independent. Let’s take a look at our SimpleWordCountModule
definition again:
public void define(Setup setup, Topologies topologies) {
setup.declareDepot("*depot", Depot.random());
StreamTopology s = topologies.stream("s");
s.pstate("$$wordCounts", PState.mapSchema(Object.class, Object.class));
s.source("*depot").out("*token")
.hashPartition("*token")
.compoundAgg("$$wordCounts", CompoundAgg.map("*token", Agg.count()));
}
Suppose this module is running as two tasks, and suppose the depot partitions receive the following appends:
- Task 0: "a", "b"
- Task 1: "c", "d"
In this case, "a" will definitely begin processing before "b", but the relationship between the data processed by each task is independent. So "c" could be processed before "a" or it could be processed after "b".
On the other hand, there’s no guarantee that the processing for "a" will finish before "b" because of the partitioner in this topology. If "a" and "b" partition to different tasks, it’s possible "b" will finish first even though it was sent after "a". Once they go to different tasks, the processing is independent and concurrent.
Rama keeps track of which records have successfully finished processing for each stream topology. Even though the processing of a record can trigger a dynamic amount of downstream computation on many other tasks, Rama efficiently detects when processing has succeeded, failed (e.g. due to an exception), or timed out. Rama will retry failed records according to the retry policy configured for the topology. For more details on how retries work and other features of streaming, see the page on stream topologies.
Stream topologies integrate with depot appends, allowing you to coordinate client-side code with the completion of processing associated with depot appends. When you do a depot append like the following code, the append blocks until processing of all stream topologies colocated with the depot finish processing the data.
There’s also a non-blocking version of append
that returns a CompletableFuture which asynchronously notifies you of completion:
The implementation of stream topologies uses its tracking of downstream computation to efficiently notify waiting depot clients about success or failure of processing.
Note that depot appends don’t have to wait for colocated stream topologies to process. There are other variants of append
and appendAsync
that let you specify different behavior. This is discussed in more detail in the page about depots.