Steven's Knowledge

Patterns

Windowing, watermarks, joins, state, exactly-once, change data capture, common pipelines

Patterns

The vocabulary every stream processor uses, plus the building blocks of real pipelines.

Event Time vs Processing Time

Two clocks matter:

ClockWhat it isWhen it matters
Event timeWhen the event happened (in the real world)Almost always; correctness
Processing timeWhen the system observed the eventLatency, not correctness
Ingestion timeWhen the broker received itCompromise; rarely the right choice

Stream processors mostly use event time because it gives the correct answer even when events arrive late or out of order.

Event: { user_id: 7, ts: 14:00:00, ... }   ← event_time
                                            ← processing_time = 14:00:03 (3s late)

Watermarks

A watermark is the system's best guess at "we've seen all events up to time T." It controls when window computations become final.

Events arrive (out of order):
  ts=14:00:01
  ts=14:00:05
  ts=14:00:03   ← late
  ts=14:00:10
  ts=14:00:07   ← late
  ts=14:00:15

Watermark at 14:00:12 means: "all events ≤ 14:00:12 have arrived"
→ windows ending at or before 14:00:12 can be closed.

Late events past the watermark are either dropped, sent to a side output, or rewrite the closed window — your choice via configuration. Watermark configuration is one of the trickier parts of getting streaming right.

Windowing

Aggregations need windows because the stream is unbounded:

Window typeBehaviorExample
TumblingNon-overlapping fixed bucketsClicks per 5-min bucket: 14:00-14:05, 14:05-14:10, ...
SlidingOverlapping fixed-size windows5-min rolling click count, updated every 1 min
SessionWindow per "session" of activityAll events from a user with < 30 min gap
GlobalOne big windowUse sparingly; bounded state matters
-- Flink SQL: tumbling 5-min windows
SELECT
  user_id,
  TUMBLE_START(ts, INTERVAL '5' MINUTES) AS window_start,
  COUNT(*) AS clicks
FROM clicks
GROUP BY user_id, TUMBLE(ts, INTERVAL '5' MINUTES);
-- Sliding: 5-min count updated every 1 min
HOP(ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTES)
-- Session: gap of 30 min between activities
SESSION(ts, INTERVAL '30' MINUTES)

Joins

Stream joins are unique in their complications:

Stream-Stream Joins

Both sides are streams; needs windowing:

-- Join clicks with purchases within 15 minutes
SELECT
  c.user_id,
  c.url,
  p.amount
FROM clicks c
JOIN purchases p
  ON c.user_id = p.user_id
  AND p.ts BETWEEN c.ts AND c.ts + INTERVAL '15' MINUTES;

State grows with the window size — bound it, or you'll OOM.

Stream-Table Joins (Enrichment)

Stream joined against a "table" (CDC of a slow-changing dimension):

-- Enrich each click with user attributes from CDC
SELECT
  c.user_id,
  u.name,
  u.tier,
  c.url
FROM clicks c
LEFT JOIN users FOR SYSTEM_TIME AS OF c.ts AS u
  ON c.user_id = u.id;

"Temporal join" — Flink and Materialize support this natively. State holds the user table, kept fresh via the source.

Interval Joins

Like stream-stream but bounded by a time interval. Most common in fraud / matching scenarios.

State

Stream processors keep state — often a lot of it.

State typeExamples
AggregationsRunning count, sum, percentile
JoinsRight-side rows waiting for matches
Dedup tablesSeen-this-event-already markers
Materialized viewsPre-computed query results

State lives:

StorageExamples
In-memory + RocksDB on local diskFlink (default), Kafka Streams
Distributed across the clusterFlink with HA
In a SQL store on each compute nodeMaterialize, RisingWave
External (Redis, Cassandra)Less common; latency

Big state is expensive — disk, network, recovery time. Bound your aggregations to windows or TTLs.

Exactly-Once

The holy grail. Three delivery semantics:

SemanticsWhat it means
At-most-onceEvery event processed at most once; some may be lost
At-least-onceEvery event processed at least once; some may be duplicated
Exactly-onceEvery event processed effectively once (logically)

True exactly-once requires the entire pipeline (source → processor → sink) to participate:

  • Source must support replay (Kafka offsets).
  • Processor must checkpoint state atomically.
  • Sink must support transactional writes.

Flink + Kafka does this end-to-end. Kafka Streams within a single Kafka cluster does this. Materialize does this for views over Kafka. Output to non-transactional sinks (most HTTP APIs, plain databases without idempotency) — you fall back to at-least-once + idempotent consumers.

Change Data Capture (CDC)

Streaming changes from a database into a stream processor:

PostgreSQL ──► Debezium ──► Kafka topic ──► Stream processor
            (logical            (with each
             replication)        INSERT/UPDATE/DELETE)

CDC is the foundation of:

  • Real-time analytical replicas.
  • Search index sync (Elasticsearch, Algolia, Meilisearch).
  • Audit trails.
  • Microservice event sourcing without dual-writes.

Materialize, Flink, and Kafka Streams all consume CDC sources natively. The pattern of choice when you want stream output but your sources are databases.

Common Pipeline Shapes

Real-Time Dashboard

Events → Kafka → Materialize/Flink (aggregate per window) → live dashboard

The dashboard subscribes to view updates, no polling.

Fraud Detection

Events → Kafka → Flink (pattern match + ML scoring) → if risky → alerts topic → block / notify

State holds the user's recent activity; pattern match fires alerts.

Enrichment + Sink

Events → Kafka → Flink (join with reference data) → enriched events → object storage / data warehouse

Decouple OLTP from analytics. The processor enriches and sinks to a data lake / warehouse.

Materialized Views for an API

DB → CDC → Kafka → Materialize (SQL views) → app queries Materialize directly

Replaces complex caching layers — Materialize is the cache, always fresh.

Microservice Event Sourcing

Service writes events → Kafka → other services consume → maintain their own derived state

Each service has its own view of the world; no shared DB. Heavy commitment; powerful when the team can sustain it.

Backpressure

When a downstream is slower than the source:

StrategyEffect
Buffer in memoryLimits scale; OOM risk
Slow down the sourceBackpressure flows back through the pipeline
Drop messagesSometimes OK for low-value events
Spill to diskExpensive but bounded

Flink, Kafka Streams handle backpressure built-in by slowing the source consumer. Materialize / RisingWave do similar via control flow.

Monitor backpressure metrics — it's the early indicator of capacity issues.

Idempotency Everywhere

Even with exactly-once at the framework level, build idempotent consumers:

  • Sink writes use upsert with a key.
  • HTTP calls use idempotency tokens (request ID).
  • DB writes use ON CONFLICT or INSERT IGNORE.

When the framework fails over, retries may double-write. Idempotency makes this safe.

What's Next

You know the vocabulary and the patterns. Best Practices covers operations — state sizing, observability, recovery, latency budgets, when not to use this at all.

On this page