Best Practices
Production stream processing - state, observability, recovery, latency, when not to use this
Best Practices
Stream processors look declarative but operate stateful, latency-sensitive, hard-to-recover systems. These habits keep them well-behaved.
Bound Your State
The single most important rule. State grows monotonically unless you bound it.
Bound state via:
| Pattern | Effect |
|---|---|
| Window with explicit close | State drops after window expires |
| TTL on keyed state | "Drop entries older than 30 days" |
| Compacted topics as state input | Only latest per key |
| Bounded join intervals | Stream-stream joins with INTERVAL |
| Periodic state cleanup | Reaper jobs |
A pipeline without bounded state will eventually OOM or fall over recovering. Set explicit limits at design time.
Watermark Tuning
Watermarks are the trickiest knob:
- Too aggressive → late events dropped (incorrect results).
- Too conservative → high latency (slow downstream updates).
Start by measuring lateness in production data:
P50 lateness: 100 ms
P95 lateness: 2 s
P99 lateness: 30 s
P99.9 lateness: 5 minSet the watermark slack at P99 to P99.9 — most events handled correctly, with acceptable lateness budget for stragglers. Configure a side-output for events past the watermark; you can re-process or alert on volume.
Checkpointing and Recovery
Every stateful stream processor must checkpoint to recover from failure:
| Aspect | Tuning |
|---|---|
| Checkpoint interval | Tradeoff: shorter = faster recovery, more I/O cost |
| Checkpoint storage | Local disk is fast but doesn't survive node loss; cloud blob storage survives |
| State backend | RocksDB on disk (default Flink) scales further than heap |
| Incremental checkpoints | Big state — only changes since last checkpoint |
| Externalized checkpoints | Survive job restart, not just node restart |
Test recovery before going to production. Kill a worker; verify the pipeline catches up correctly.
Latency Budget
Define latency targets explicitly:
| Use case | Target |
|---|---|
| Real-time dashboard | < 10s end-to-end |
| Fraud detection | < 1s |
| Materialized view for an API | < 500ms |
| Operational analytics | < 1 min |
The latency budget shapes:
- Watermark slack (faster = less slack for late events).
- Checkpoint interval (longer = more recovery latency).
- Source poll interval (Kafka consumer wait).
- Sink write strategy (batch vs streaming).
Monitor end-to-end latency, not just per-stage. Event time at source → event observed in sink is what users see.
Observability
What to monitor:
| Metric | Why |
|---|---|
| End-to-end latency | Did we meet SLO? |
| Throughput (events/s in, out) | Bottleneck per stage |
| Backpressure indicator | Source consumer slowing → downstream is behind |
| Checkpoint duration | Long = state too big |
| Checkpoint failure rate | Things breaking |
| State size | Capacity planning |
| Watermark lag | How far behind real time |
| Restart count / job uptime | Stability |
Flink ships Prometheus metrics. Materialize has built-in observability views. Kafka Streams exposes JMX metrics. Pipe everything to Prometheus & Grafana.
Schema Evolution
Streams live forever; schemas don't:
- Use a schema registry (Confluent Schema Registry, Apicurio).
- Avro / Protobuf / JSON Schema with documented evolution rules.
- Backward compatibility: new schema can read old events.
- Forward compatibility: old schema can read new events.
- Both = "full compatibility" — strongest, what you want for long-lived pipelines.
A breaking schema change in production stops your pipeline. Test schema changes against historical data before deploy.
Deployment
Stream pipelines are infrastructure, not just applications:
| Concern | Recommendation |
|---|---|
| Version control pipelines | SQL files / Flink jobs in git |
| CI tests on representative data | Catches state migration bugs |
| Staging environment with prod-like load | Performance / state bugs visible |
| Blue-green deployment | Run new version in parallel; cut over reads |
| State migration plan | Either replay from scratch or migrate state files |
| Pipeline failures alert on-call | Like databases, not "just a service" |
Some pipelines can replay from Kafka's beginning to rebuild state — others have too much. Know which yours is before you need to recover.
Scale Considerations
| Scale | Typical choice |
|---|---|
| < 10K events/s, small state | Materialize / RisingWave / single-instance Kafka Streams |
| 10K-100K events/s | Distributed Kafka Streams or single-instance Flink |
| 100K-10M events/s | Flink cluster |
| > 10M events/s | Flink, with serious operations team |
For state size: in-memory works up to a few GB; RocksDB on local disk handles hundreds of GB to TBs. Beyond that, partition aggressively.
Cost
Stream processors are continuously running compute. Costs:
| Driver | Mitigation |
|---|---|
| Always-on workers | Right-size; aggressive when no traffic |
| State storage | Bound state; cheaper backends; lifecycle |
| Checkpoint storage I/O | Tune interval; incremental |
| Kafka brokers feeding | Topic retention; compression |
| Cross-AZ traffic | Co-locate when possible |
Managed services (Confluent Flink, Materialize Cloud, AWS Kinesis Data Analytics) cost more per unit but include operations. The cheapest option is rarely the cheapest total cost of ownership.
When NOT to Use Stream Processing
Honest cases:
- You can run a batch job hourly and that's fine for the use case. Don't reach for streaming if you don't need the latency.
- You have one consumer that needs to react to events — a Background Job consumer with idempotency is simpler.
- You have a few aggregations that can live in your OLTP DB — Postgres can maintain a counter via triggers.
- You don't have someone to own the pipeline. Stream processors are stateful systems that need care; underestimating that has burned every team that tried.
If you have 1-2 high-value pipelines (real-time analytics, materialized views, fraud), stream processing is worth it. If you're rebuilding your entire application around streaming because it's the new shiny, stop.
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Unbounded state | OOM, slow recovery | Window or TTL everything |
| Aggressive watermark | Missing late events | Measure real lateness; tune |
| Processing-time aggregations | Wrong answers on out-of-order data | Use event time |
| Schema change without versioning | Pipeline stops | Schema registry + compatibility rules |
| Sink failures swallowed | Silent data loss | Alert on sink errors |
| One pipeline, many concerns | Hard to evolve | Decompose into per-purpose pipelines |
| No staging that matches prod load | Surprises in production | Realistic load tests |
| Replaying from scratch is the only recovery | Hours of downtime | Externalize checkpoints; bounded replay |
| Long-running joins without TTL | Memory creep | Always TTL or window joins |
| No on-call ownership | Pipeline silently dies | Treat as Tier-1 system |
Checklist
Production stream processing checklist
- State bounded explicitly (windows, TTLs, compaction)
- Watermark configured based on measured lateness
- Late-arrival policy chosen (drop / side-output / re-process)
- Checkpoints externalized; can recover from a node loss
- End-to-end latency SLO defined; monitored
- Backpressure alerting
- Schema registry with compatibility rules
- Pipeline code in git; deployments versioned
- Staging with prod-like data volume
- State migration plan for pipeline upgrades
- On-call ownership; runbook for "pipeline stopped"
- Sinks idempotent (in case of replay)
- Source uses replayable storage (Kafka, not at-most-once queues)
- Cost dashboard; alert on unexpected resource use
- Recovery time tested