Best Practices

Production stream processing - state, observability, recovery, latency, when not to use this

Best Practices

Stream processors look declarative but operate stateful, latency-sensitive, hard-to-recover systems. These habits keep them well-behaved.

Bound Your State

The single most important rule. State grows monotonically unless you bound it.

Bound state via:

Pattern	Effect
Window with explicit close	State drops after window expires
TTL on keyed state	"Drop entries older than 30 days"
Compacted topics as state input	Only latest per key
Bounded join intervals	Stream-stream joins with `INTERVAL`
Periodic state cleanup	Reaper jobs

A pipeline without bounded state will eventually OOM or fall over recovering. Set explicit limits at design time.

Watermark Tuning

Watermarks are the trickiest knob:

Too aggressive → late events dropped (incorrect results).
Too conservative → high latency (slow downstream updates).

Start by measuring lateness in production data:

P50 lateness: 100 ms
P95 lateness: 2 s
P99 lateness: 30 s
P99.9 lateness: 5 min

Set the watermark slack at P99 to P99.9 — most events handled correctly, with acceptable lateness budget for stragglers. Configure a side-output for events past the watermark; you can re-process or alert on volume.

Checkpointing and Recovery

Every stateful stream processor must checkpoint to recover from failure:

Aspect	Tuning
Checkpoint interval	Tradeoff: shorter = faster recovery, more I/O cost
Checkpoint storage	Local disk is fast but doesn't survive node loss; cloud blob storage survives
State backend	RocksDB on disk (default Flink) scales further than heap
Incremental checkpoints	Big state — only changes since last checkpoint
Externalized checkpoints	Survive job restart, not just node restart

Test recovery before going to production. Kill a worker; verify the pipeline catches up correctly.

Latency Budget

Define latency targets explicitly:

Use case	Target
Real-time dashboard	< 10s end-to-end
Fraud detection	< 1s
Materialized view for an API	< 500ms
Operational analytics	< 1 min

The latency budget shapes:

Watermark slack (faster = less slack for late events).
Checkpoint interval (longer = more recovery latency).
Source poll interval (Kafka consumer wait).
Sink write strategy (batch vs streaming).

Monitor end-to-end latency, not just per-stage. Event time at source → event observed in sink is what users see.

Observability

What to monitor:

Metric	Why
End-to-end latency	Did we meet SLO?
Throughput (events/s in, out)	Bottleneck per stage
Backpressure indicator	Source consumer slowing → downstream is behind
Checkpoint duration	Long = state too big
Checkpoint failure rate	Things breaking
State size	Capacity planning
Watermark lag	How far behind real time
Restart count / job uptime	Stability

Flink ships Prometheus metrics. Materialize has built-in observability views. Kafka Streams exposes JMX metrics. Pipe everything to Prometheus & Grafana.

Schema Evolution

Streams live forever; schemas don't:

Use a schema registry (Confluent Schema Registry, Apicurio).
Avro / Protobuf / JSON Schema with documented evolution rules.
Backward compatibility: new schema can read old events.
Forward compatibility: old schema can read new events.
Both = "full compatibility" — strongest, what you want for long-lived pipelines.

A breaking schema change in production stops your pipeline. Test schema changes against historical data before deploy.

Deployment

Stream pipelines are infrastructure, not just applications:

Concern	Recommendation
Version control pipelines	SQL files / Flink jobs in git
CI tests on representative data	Catches state migration bugs
Staging environment with prod-like load	Performance / state bugs visible
Blue-green deployment	Run new version in parallel; cut over reads
State migration plan	Either replay from scratch or migrate state files
Pipeline failures alert on-call	Like databases, not "just a service"

Some pipelines can replay from Kafka's beginning to rebuild state — others have too much. Know which yours is before you need to recover.

Scale Considerations

Scale	Typical choice
< 10K events/s, small state	Materialize / RisingWave / single-instance Kafka Streams
10K-100K events/s	Distributed Kafka Streams or single-instance Flink
100K-10M events/s	Flink cluster
> 10M events/s	Flink, with serious operations team

For state size: in-memory works up to a few GB; RocksDB on local disk handles hundreds of GB to TBs. Beyond that, partition aggressively.

Cost

Stream processors are continuously running compute. Costs:

Driver	Mitigation
Always-on workers	Right-size; aggressive when no traffic
State storage	Bound state; cheaper backends; lifecycle
Checkpoint storage I/O	Tune interval; incremental
Kafka brokers feeding	Topic retention; compression
Cross-AZ traffic	Co-locate when possible

Managed services (Confluent Flink, Materialize Cloud, AWS Kinesis Data Analytics) cost more per unit but include operations. The cheapest option is rarely the cheapest total cost of ownership.

When NOT to Use Stream Processing

Honest cases:

You can run a batch job hourly and that's fine for the use case. Don't reach for streaming if you don't need the latency.
You have one consumer that needs to react to events — a Background Job consumer with idempotency is simpler.
You have a few aggregations that can live in your OLTP DB — Postgres can maintain a counter via triggers.
You don't have someone to own the pipeline. Stream processors are stateful systems that need care; underestimating that has burned every team that tried.

If you have 1-2 high-value pipelines (real-time analytics, materialized views, fraud), stream processing is worth it. If you're rebuilding your entire application around streaming because it's the new shiny, stop.

Common Pitfalls

Pitfall	Symptom	Fix
Unbounded state	OOM, slow recovery	Window or TTL everything
Aggressive watermark	Missing late events	Measure real lateness; tune
Processing-time aggregations	Wrong answers on out-of-order data	Use event time
Schema change without versioning	Pipeline stops	Schema registry + compatibility rules
Sink failures swallowed	Silent data loss	Alert on sink errors
One pipeline, many concerns	Hard to evolve	Decompose into per-purpose pipelines
No staging that matches prod load	Surprises in production	Realistic load tests
Replaying from scratch is the only recovery	Hours of downtime	Externalize checkpoints; bounded replay
Long-running joins without TTL	Memory creep	Always TTL or window joins
No on-call ownership	Pipeline silently dies	Treat as Tier-1 system

Best Practices

Best Practices

Bound Your State

Watermark Tuning

Checkpointing and Recovery

Latency Budget

Observability

Schema Evolution

Deployment

Scale Considerations

Cost

When NOT to Use Stream Processing

Common Pitfalls

Checklist

On this page