Patterns
Windowing, watermarks, joins, state, exactly-once, change data capture, common pipelines
Patterns
The vocabulary every stream processor uses, plus the building blocks of real pipelines.
Event Time vs Processing Time
Two clocks matter:
| Clock | What it is | When it matters |
|---|---|---|
| Event time | When the event happened (in the real world) | Almost always; correctness |
| Processing time | When the system observed the event | Latency, not correctness |
| Ingestion time | When the broker received it | Compromise; rarely the right choice |
Stream processors mostly use event time because it gives the correct answer even when events arrive late or out of order.
Event: { user_id: 7, ts: 14:00:00, ... } ← event_time
← processing_time = 14:00:03 (3s late)Watermarks
A watermark is the system's best guess at "we've seen all events up to time T." It controls when window computations become final.
Events arrive (out of order):
ts=14:00:01
ts=14:00:05
ts=14:00:03 ← late
ts=14:00:10
ts=14:00:07 ← late
ts=14:00:15
Watermark at 14:00:12 means: "all events ≤ 14:00:12 have arrived"
→ windows ending at or before 14:00:12 can be closed.Late events past the watermark are either dropped, sent to a side output, or rewrite the closed window — your choice via configuration. Watermark configuration is one of the trickier parts of getting streaming right.
Windowing
Aggregations need windows because the stream is unbounded:
| Window type | Behavior | Example |
|---|---|---|
| Tumbling | Non-overlapping fixed buckets | Clicks per 5-min bucket: 14:00-14:05, 14:05-14:10, ... |
| Sliding | Overlapping fixed-size windows | 5-min rolling click count, updated every 1 min |
| Session | Window per "session" of activity | All events from a user with < 30 min gap |
| Global | One big window | Use sparingly; bounded state matters |
-- Flink SQL: tumbling 5-min windows
SELECT
user_id,
TUMBLE_START(ts, INTERVAL '5' MINUTES) AS window_start,
COUNT(*) AS clicks
FROM clicks
GROUP BY user_id, TUMBLE(ts, INTERVAL '5' MINUTES);-- Sliding: 5-min count updated every 1 min
HOP(ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTES)-- Session: gap of 30 min between activities
SESSION(ts, INTERVAL '30' MINUTES)Joins
Stream joins are unique in their complications:
Stream-Stream Joins
Both sides are streams; needs windowing:
-- Join clicks with purchases within 15 minutes
SELECT
c.user_id,
c.url,
p.amount
FROM clicks c
JOIN purchases p
ON c.user_id = p.user_id
AND p.ts BETWEEN c.ts AND c.ts + INTERVAL '15' MINUTES;State grows with the window size — bound it, or you'll OOM.
Stream-Table Joins (Enrichment)
Stream joined against a "table" (CDC of a slow-changing dimension):
-- Enrich each click with user attributes from CDC
SELECT
c.user_id,
u.name,
u.tier,
c.url
FROM clicks c
LEFT JOIN users FOR SYSTEM_TIME AS OF c.ts AS u
ON c.user_id = u.id;"Temporal join" — Flink and Materialize support this natively. State holds the user table, kept fresh via the source.
Interval Joins
Like stream-stream but bounded by a time interval. Most common in fraud / matching scenarios.
State
Stream processors keep state — often a lot of it.
| State type | Examples |
|---|---|
| Aggregations | Running count, sum, percentile |
| Joins | Right-side rows waiting for matches |
| Dedup tables | Seen-this-event-already markers |
| Materialized views | Pre-computed query results |
State lives:
| Storage | Examples |
|---|---|
| In-memory + RocksDB on local disk | Flink (default), Kafka Streams |
| Distributed across the cluster | Flink with HA |
| In a SQL store on each compute node | Materialize, RisingWave |
| External (Redis, Cassandra) | Less common; latency |
Big state is expensive — disk, network, recovery time. Bound your aggregations to windows or TTLs.
Exactly-Once
The holy grail. Three delivery semantics:
| Semantics | What it means |
|---|---|
| At-most-once | Every event processed at most once; some may be lost |
| At-least-once | Every event processed at least once; some may be duplicated |
| Exactly-once | Every event processed effectively once (logically) |
True exactly-once requires the entire pipeline (source → processor → sink) to participate:
- Source must support replay (Kafka offsets).
- Processor must checkpoint state atomically.
- Sink must support transactional writes.
Flink + Kafka does this end-to-end. Kafka Streams within a single Kafka cluster does this. Materialize does this for views over Kafka. Output to non-transactional sinks (most HTTP APIs, plain databases without idempotency) — you fall back to at-least-once + idempotent consumers.
Change Data Capture (CDC)
Streaming changes from a database into a stream processor:
PostgreSQL ──► Debezium ──► Kafka topic ──► Stream processor
(logical (with each
replication) INSERT/UPDATE/DELETE)CDC is the foundation of:
- Real-time analytical replicas.
- Search index sync (Elasticsearch, Algolia, Meilisearch).
- Audit trails.
- Microservice event sourcing without dual-writes.
Materialize, Flink, and Kafka Streams all consume CDC sources natively. The pattern of choice when you want stream output but your sources are databases.
Common Pipeline Shapes
Real-Time Dashboard
Events → Kafka → Materialize/Flink (aggregate per window) → live dashboardThe dashboard subscribes to view updates, no polling.
Fraud Detection
Events → Kafka → Flink (pattern match + ML scoring) → if risky → alerts topic → block / notifyState holds the user's recent activity; pattern match fires alerts.
Enrichment + Sink
Events → Kafka → Flink (join with reference data) → enriched events → object storage / data warehouseDecouple OLTP from analytics. The processor enriches and sinks to a data lake / warehouse.
Materialized Views for an API
DB → CDC → Kafka → Materialize (SQL views) → app queries Materialize directlyReplaces complex caching layers — Materialize is the cache, always fresh.
Microservice Event Sourcing
Service writes events → Kafka → other services consume → maintain their own derived stateEach service has its own view of the world; no shared DB. Heavy commitment; powerful when the team can sustain it.
Backpressure
When a downstream is slower than the source:
| Strategy | Effect |
|---|---|
| Buffer in memory | Limits scale; OOM risk |
| Slow down the source | Backpressure flows back through the pipeline |
| Drop messages | Sometimes OK for low-value events |
| Spill to disk | Expensive but bounded |
Flink, Kafka Streams handle backpressure built-in by slowing the source consumer. Materialize / RisingWave do similar via control flow.
Monitor backpressure metrics — it's the early indicator of capacity issues.
Idempotency Everywhere
Even with exactly-once at the framework level, build idempotent consumers:
- Sink writes use upsert with a key.
- HTTP calls use idempotency tokens (request ID).
- DB writes use ON CONFLICT or
INSERT IGNORE.
When the framework fails over, retries may double-write. Idempotency makes this safe.
What's Next
You know the vocabulary and the patterns. Best Practices covers operations — state sizing, observability, recovery, latency budgets, when not to use this at all.