Stream Processing
Batch vs streaming, event time vs processing time, windowing, watermarks, exactly-once semantics, stateful processing, backpressure, and the tools (Kafka, Flink, Spark Structured Streaming, Kafka Streams).
Stream Processing
Batch processing answers "what happened yesterday?" Stream processing answers "what is happening right now?" The shift from batch to streaming is not just about speed - it forces you to rethink time, state, and correctness, because you are computing over data that never stops arriving and never arrives perfectly in order.
This page covers the fundamentals every streaming system makes you confront: time semantics, windowing, watermarks, delivery guarantees, state, and backpressure. For the batch counterpart, see Apache Spark; for where streams land, see Lakehouse.
Batch vs Streaming
| Dimension | Batch | Streaming |
|---|---|---|
| Data boundary | Bounded (a finite dataset) | Unbounded (never ends) |
| Latency | Minutes to hours | Milliseconds to seconds |
| Trigger | Schedule (e.g. nightly) | Continuous, event-driven |
| State | Recomputed each run | Maintained across events |
| Correctness model | Reprocess to fix | Must handle late/out-of-order data |
| Mental model | "Query a table" | "Standing query over a flow" |
The deep insight: a batch is just a bounded stream. Modern engines (Flink, Spark) increasingly unify the two - the same logic runs on a finite file or an infinite topic. But streaming surfaces problems batch lets you ignore, chiefly: events do not arrive in the order they happened.
Event Time vs Processing Time
This is the single most important concept in streaming.
- Event time: when the event actually occurred (a timestamp in the data).
- Processing time: when your system observed the event (wall clock at ingestion).
These diverge constantly. A mobile app buffers events offline and uploads them an hour later; a network hiccup delays a batch; a partition rebalances. If you aggregate by processing time, your "9 AM sales" bucket silently includes events that happened at 8 AM but arrived at 9.
# The same event, two different timestamps
event = {
"user_id": 42,
"action": "purchase",
"event_time": "2026-05-30T09:00:00Z", # when the user clicked buy
"processing_time": "2026-05-30T09:43:12Z" # when our pipeline saw it (43 min late)
}
# Aggregating "purchases per hour" by processing_time would be WRONG:
# this 09:00 purchase would land in the 09:43... no, the 09:00-10:00 bucket
# only by luck. A purchase at 09:55 arriving at 10:05 lands in the wrong hour.Correct streaming almost always means event-time processing: assign each event to a window based on when it happened, not when you saw it. That, in turn, requires a mechanism to decide when a window is "done" despite stragglers - watermarks (below).
Windowing
You cannot compute "the average" of an infinite stream. You compute it over windows - finite slices of the stream.
| Window type | Definition | Use case |
|---|---|---|
| Tumbling | Fixed-size, non-overlapping | "Sales per 5-minute bucket" |
| Sliding | Fixed-size, overlapping by a slide interval | "5-min sales, updated every 1 min" |
| Session | Gap-based, closes after inactivity | "User activity bursts, 30-min idle gap" |
Tumbling (size 10): [0────10)[10───20)[20───30)
Sliding (size 10, [0────10)
slide 5): [5────15)
[10───20)
Session (gap 5): [evt evt evt]·····[evt evt] (gap > 5 closes session)# Spark Structured Streaming: tumbling and sliding event-time windows
from pyspark.sql.functions import window, col
# Tumbling: non-overlapping 5-minute buckets
tumbling = (events
.withWatermark("event_time", "10 minutes")
.groupBy(window(col("event_time"), "5 minutes"), col("product_id"))
.count())
# Sliding: 10-minute window, recomputed every 5 minutes
sliding = (events
.withWatermark("event_time", "10 minutes")
.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))
.agg({"amount": "sum"}))Session windows are the trickiest: there is no fixed boundary. The window stays open as long as events keep arriving within the gap threshold, and closes once a gap of inactivity passes. They model user sessions, IoT device bursts, and any "activity then quiet" pattern.
Watermarks
A watermark is the engine's assertion: "I do not expect any more events with an event time earlier than T." It is how streaming systems reason about completeness over unbounded, out-of-order data.
The watermark trails the maximum observed event time by an allowed lateness:
watermark = max(event_time seen so far) - allowed_latenessWhen the watermark passes the end of a window, the engine finalizes that window's result and (optionally) drops events that arrive even later.
# Flink: event-time watermark with 10s of allowed out-of-orderness
from pyflink.common.watermark_strategy import WatermarkStrategy
from pyflink.common import Duration
strategy = (WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_seconds(10))
.with_timestamp_assigner(lambda event, ts: event.event_time_ms))
stream = source.assign_timestamps_and_watermarks(strategy)The watermark is a latency vs completeness tradeoff:
- Short allowed lateness (low watermark delay) -> windows close fast, low latency, but more late events are dropped or mishandled.
- Long allowed lateness -> windows wait longer, higher latency, but more stragglers are captured.
There is no free lunch. You pick where to sit on that curve based on whether your use case tolerates dropping late data (real-time dashboards: yes) or must capture all of it (billing: no - use a longer watermark plus a late-data side output).
Delivery Semantics
What happens when a node crashes mid-stream? The three guarantees:
| Guarantee | Meaning | Risk |
|---|---|---|
| At-most-once | Every event processed 0 or 1 times | Data loss on failure |
| At-least-once | Every event processed 1+ times | Duplicates on failure |
| Exactly-once | Every event affects the result exactly once | Most expensive to achieve |
"Exactly-once" is widely misunderstood. It does not mean an event is physically delivered once - it means its effect on the output state is applied once, even if the event is physically reprocessed after a failure. This is achieved through coordinated checkpointing plus transactional/idempotent sinks.
# Exactly-once in Kafka Streams: transactional processing
props = {
"processing.guarantee": "exactly_once_v2", # the magic setting
"application.id": "order-aggregator",
# Consumer offsets and output writes commit in one atomic transaction.
# On crash, the whole transaction rolls back and replays - no double counting.
}The mechanism behind exactly-once: the engine periodically checkpoints both the read positions (offsets) and the computed state. On recovery, it restores the last checkpoint and replays from there. The sink must be transactional (commit offsets and output together) or idempotent (re-applying the same write is a no-op). Without a cooperating sink, "exactly-once processing" degrades to "at-least-once delivery" at the output - which is why idempotent writes (see the merge pattern in ETL & ELT) matter even in streaming.
Stateful Processing
Stateless operations (filter, map) treat each event independently. Stateful operations - aggregations, joins, deduplication, pattern detection - must remember information across events. Managing that state correctly is what makes streaming hard.
# Flink: a stateful function that detects 3+ failed logins in a row per user
from pyflink.datastream.functions import KeyedProcessFunction
from pyflink.datastream.state import ValueStateDescriptor
class FailedLoginDetector(KeyedProcessFunction):
def open(self, ctx):
# Per-key (per-user) state, managed and checkpointed by Flink
self.count = ctx.get_state(ValueStateDescriptor("fail_count", int))
def process_element(self, login, ctx):
if login.success:
self.count.clear() # reset on success
return
current = (self.count.value() or 0) + 1
self.count.update(current)
if current >= 3:
yield Alert(user=ctx.get_current_key(), reason="3 failed logins")
self.count.clear()Key concerns with streaming state:
- It must be checkpointed so it survives failures (this is what enables exactly-once).
- It grows unbounded unless you expire it. Use TTLs or window boundaries to evict old state, or your job runs out of memory.
- It must be keyed and partitioned so the same key's state always lands on the same worker.
Backpressure
A streaming job is a chain of operators. If a downstream operator (or sink) cannot keep up with the rate of incoming data, pressure builds. Backpressure is the mechanism that propagates that slowness upstream so the source slows down rather than the system running out of memory.
Source ──fast──> Map ──fast──> Window ──SLOW SINK──> Database
│
backpressure signal propagates ◄────┘
(source throttles to match the slow sink)Healthy systems handle backpressure gracefully: Flink propagates it through its network buffers automatically; Kafka consumers naturally apply it by polling slower (offsets just advance more slowly). The danger sign is unbounded buffering - a queue with no limit absorbs backpressure by consuming memory until the job dies.
Diagnosing backpressure:
- Consumer lag (Kafka): the gap between the latest offset and the consumer's committed offset. Growing lag means you are falling behind.
- Busy/backpressured time (Flink): the UI reports the percentage of time each operator spends blocked.
- The fix is usually one of: scale out the bottleneck operator, optimize the slow sink (batch writes, async I/O), or - if the spike is transient - let lag absorb it and catch up.
The Tool Landscape
| Tool | What it is | Best for |
|---|---|---|
| Kafka | Distributed log / message broker | The transport layer - durable, replayable event storage |
| Flink | True streaming compute engine | Low-latency, complex stateful processing, event-time correctness |
| Spark Structured Streaming | Micro-batch (and continuous) engine | Teams already on Spark; unified batch + stream code |
| Kafka Streams | JVM library, embedded in your app | Stream processing without a separate cluster |
Kafka: the substrate
Kafka is not a processing engine - it is the durable, partitioned, replayable log that streaming engines read from and write to. Its key property is replayability: because events are retained, a consumer can reset its offset and reprocess history. This is what makes exactly-once recovery and reprocessing possible.
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
"orders",
bootstrap_servers="kafka:9092",
group_id="order-processor",
enable_auto_commit=False, # commit manually for at-least-once control
auto_offset_reset="earliest",
value_deserializer=lambda m: json.loads(m.decode("utf-8")),
)
for message in consumer:
process(message.value)
consumer.commit() # commit only after successful processingFlink vs Spark Structured Streaming
This is the central architecture choice.
- Flink is a true event-at-a-time streaming engine. Sub-second latency, sophisticated event-time and state handling, native exactly-once. Choose it when latency and stateful correctness are paramount.
- Spark Structured Streaming processes in micro-batches (small batches every few hundred ms to seconds). Latency is higher, but you get the entire Spark ecosystem and identical code for batch and streaming. Choose it when you are already on Spark and second-scale latency is acceptable.
A rule of thumb: if "real-time" means seconds, Spark Structured Streaming is fine and simpler to operate. If it means milliseconds, or you need complex session/pattern logic with large state, reach for Flink.
When NOT to Stream
Streaming is operationally expensive: always-on clusters, state management, watermark tuning, harder debugging. Before committing, ask whether you actually need it.
- Do you need it? Most "real-time" requirements are satisfied by micro-batches every few minutes. Streaming infrastructure for a dashboard refreshed hourly is waste. (The SLO conversation in Data Quality usually settles this.)
- Can you tolerate the complexity? Streaming is harder to test, deploy, and reason about than batch. The team must be ready for it.
- Is the source even streamable? A nightly file dump is bounded; forcing it through a streaming engine adds complexity with no benefit.
Stream when the business value of low latency exceeds the operational cost - fraud detection, real-time personalization, monitoring and alerting, and operational dashboards where minutes matter. Otherwise, a well-built incremental batch pipeline is cheaper, simpler, and easier to trust.