Windowing, watermarks, joins, state, exactly-once, change data capture, common pipelines

Patterns

The vocabulary every stream processor uses, plus the building blocks of real pipelines.

Event Time vs Processing Time

Two clocks matter:

Clock	What it is	When it matters
Event time	When the event happened (in the real world)	Almost always; correctness
Processing time	When the system observed the event	Latency, not correctness
Ingestion time	When the broker received it	Compromise; rarely the right choice

Stream processors mostly use event time because it gives the correct answer even when events arrive late or out of order.

Event: { user_id: 7, ts: 14:00:00, ... }   ← event_time
                                            ← processing_time = 14:00:03 (3s late)

Watermarks

A watermark is the system's best guess at "we've seen all events up to time T." It controls when window computations become final.

Events arrive (out of order):
  ts=14:00:01
  ts=14:00:05
  ts=14:00:03   ← late
  ts=14:00:10
  ts=14:00:07   ← late
  ts=14:00:15

Watermark at 14:00:12 means: "all events ≤ 14:00:12 have arrived"
→ windows ending at or before 14:00:12 can be closed.

Late events past the watermark are either dropped, sent to a side output, or rewrite the closed window — your choice via configuration. Watermark configuration is one of the trickier parts of getting streaming right.

Windowing

Aggregations need windows because the stream is unbounded:

Window type	Behavior	Example
Tumbling	Non-overlapping fixed buckets	Clicks per 5-min bucket: 14:00-14:05, 14:05-14:10, ...
Sliding	Overlapping fixed-size windows	5-min rolling click count, updated every 1 min
Session	Window per "session" of activity	All events from a user with < 30 min gap
Global	One big window	Use sparingly; bounded state matters

-- Flink SQL: tumbling 5-min windows
SELECT
  user_id,
  TUMBLE_START(ts, INTERVAL '5' MINUTES) AS window_start,
  COUNT(*) AS clicks
FROM clicks
GROUP BY user_id, TUMBLE(ts, INTERVAL '5' MINUTES);

-- Sliding: 5-min count updated every 1 min
HOP(ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTES)

-- Session: gap of 30 min between activities
SESSION(ts, INTERVAL '30' MINUTES)

Joins

Stream joins are unique in their complications:

Stream-Stream Joins

Both sides are streams; needs windowing:

-- Join clicks with purchases within 15 minutes
SELECT
  c.user_id,
  c.url,
  p.amount
FROM clicks c
JOIN purchases p
  ON c.user_id = p.user_id
  AND p.ts BETWEEN c.ts AND c.ts + INTERVAL '15' MINUTES;

State grows with the window size — bound it, or you'll OOM.

Stream-Table Joins (Enrichment)

Stream joined against a "table" (CDC of a slow-changing dimension):

-- Enrich each click with user attributes from CDC
SELECT
  c.user_id,
  u.name,
  u.tier,
  c.url
FROM clicks c
LEFT JOIN users FOR SYSTEM_TIME AS OF c.ts AS u
  ON c.user_id = u.id;

"Temporal join" — Flink and Materialize support this natively. State holds the user table, kept fresh via the source.

Interval Joins

Like stream-stream but bounded by a time interval. Most common in fraud / matching scenarios.

State

Stream processors keep state — often a lot of it.

State type	Examples
Aggregations	Running count, sum, percentile
Joins	Right-side rows waiting for matches
Dedup tables	Seen-this-event-already markers
Materialized views	Pre-computed query results

State lives:

Storage	Examples
In-memory + RocksDB on local disk	Flink (default), Kafka Streams
Distributed across the cluster	Flink with HA
In a SQL store on each compute node	Materialize, RisingWave
External (Redis, Cassandra)	Less common; latency

Big state is expensive — disk, network, recovery time. Bound your aggregations to windows or TTLs.

Exactly-Once

The holy grail. Three delivery semantics:

Semantics	What it means
At-most-once	Every event processed at most once; some may be lost
At-least-once	Every event processed at least once; some may be duplicated
Exactly-once	Every event processed effectively once (logically)

True exactly-once requires the entire pipeline (source → processor → sink) to participate:

Source must support replay (Kafka offsets).
Processor must checkpoint state atomically.
Sink must support transactional writes.

Flink + Kafka does this end-to-end. Kafka Streams within a single Kafka cluster does this. Materialize does this for views over Kafka. Output to non-transactional sinks (most HTTP APIs, plain databases without idempotency) — you fall back to at-least-once + idempotent consumers.

Change Data Capture (CDC)

Streaming changes from a database into a stream processor:

PostgreSQL ──► Debezium ──► Kafka topic ──► Stream processor
            (logical            (with each
             replication)        INSERT/UPDATE/DELETE)

CDC is the foundation of:

Real-time analytical replicas.
Search index sync (Elasticsearch, Algolia, Meilisearch).
Audit trails.
Microservice event sourcing without dual-writes.

Materialize, Flink, and Kafka Streams all consume CDC sources natively. The pattern of choice when you want stream output but your sources are databases.

Common Pipeline Shapes

Real-Time Dashboard

Events → Kafka → Materialize/Flink (aggregate per window) → live dashboard

The dashboard subscribes to view updates, no polling.

Fraud Detection

Events → Kafka → Flink (pattern match + ML scoring) → if risky → alerts topic → block / notify

State holds the user's recent activity; pattern match fires alerts.

Enrichment + Sink

Events → Kafka → Flink (join with reference data) → enriched events → object storage / data warehouse

Decouple OLTP from analytics. The processor enriches and sinks to a data lake / warehouse.

Materialized Views for an API

DB → CDC → Kafka → Materialize (SQL views) → app queries Materialize directly

Replaces complex caching layers — Materialize is the cache, always fresh.

Microservice Event Sourcing

Service writes events → Kafka → other services consume → maintain their own derived state

Each service has its own view of the world; no shared DB. Heavy commitment; powerful when the team can sustain it.

Backpressure

When a downstream is slower than the source:

Strategy	Effect
Buffer in memory	Limits scale; OOM risk
Slow down the source	Backpressure flows back through the pipeline
Drop messages	Sometimes OK for low-value events
Spill to disk	Expensive but bounded

Flink, Kafka Streams handle backpressure built-in by slowing the source consumer. Materialize / RisingWave do similar via control flow.

Monitor backpressure metrics — it's the early indicator of capacity issues.

Idempotency Everywhere

Even with exactly-once at the framework level, build idempotent consumers:

Sink writes use upsert with a key.
HTTP calls use idempotency tokens (request ID).
DB writes use ON CONFLICT or INSERT IGNORE.

When the framework fails over, retries may double-write. Idempotency makes this safe.

What's Next

You know the vocabulary and the patterns. Best Practices covers operations — state sizing, observability, recovery, latency budgets, when not to use this at all.

Patterns

On this page