Steven's Knowledge

Best Practices

Production stream processing - state, observability, recovery, latency, when not to use this

Best Practices

Stream processors look declarative but operate stateful, latency-sensitive, hard-to-recover systems. These habits keep them well-behaved.

Bound Your State

The single most important rule. State grows monotonically unless you bound it.

Bound state via:

PatternEffect
Window with explicit closeState drops after window expires
TTL on keyed state"Drop entries older than 30 days"
Compacted topics as state inputOnly latest per key
Bounded join intervalsStream-stream joins with INTERVAL
Periodic state cleanupReaper jobs

A pipeline without bounded state will eventually OOM or fall over recovering. Set explicit limits at design time.

Watermark Tuning

Watermarks are the trickiest knob:

  • Too aggressive → late events dropped (incorrect results).
  • Too conservative → high latency (slow downstream updates).

Start by measuring lateness in production data:

P50 lateness: 100 ms
P95 lateness: 2 s
P99 lateness: 30 s
P99.9 lateness: 5 min

Set the watermark slack at P99 to P99.9 — most events handled correctly, with acceptable lateness budget for stragglers. Configure a side-output for events past the watermark; you can re-process or alert on volume.

Checkpointing and Recovery

Every stateful stream processor must checkpoint to recover from failure:

AspectTuning
Checkpoint intervalTradeoff: shorter = faster recovery, more I/O cost
Checkpoint storageLocal disk is fast but doesn't survive node loss; cloud blob storage survives
State backendRocksDB on disk (default Flink) scales further than heap
Incremental checkpointsBig state — only changes since last checkpoint
Externalized checkpointsSurvive job restart, not just node restart

Test recovery before going to production. Kill a worker; verify the pipeline catches up correctly.

Latency Budget

Define latency targets explicitly:

Use caseTarget
Real-time dashboard< 10s end-to-end
Fraud detection< 1s
Materialized view for an API< 500ms
Operational analytics< 1 min

The latency budget shapes:

  • Watermark slack (faster = less slack for late events).
  • Checkpoint interval (longer = more recovery latency).
  • Source poll interval (Kafka consumer wait).
  • Sink write strategy (batch vs streaming).

Monitor end-to-end latency, not just per-stage. Event time at source → event observed in sink is what users see.

Observability

What to monitor:

MetricWhy
End-to-end latencyDid we meet SLO?
Throughput (events/s in, out)Bottleneck per stage
Backpressure indicatorSource consumer slowing → downstream is behind
Checkpoint durationLong = state too big
Checkpoint failure rateThings breaking
State sizeCapacity planning
Watermark lagHow far behind real time
Restart count / job uptimeStability

Flink ships Prometheus metrics. Materialize has built-in observability views. Kafka Streams exposes JMX metrics. Pipe everything to Prometheus & Grafana.

Schema Evolution

Streams live forever; schemas don't:

  • Use a schema registry (Confluent Schema Registry, Apicurio).
  • Avro / Protobuf / JSON Schema with documented evolution rules.
  • Backward compatibility: new schema can read old events.
  • Forward compatibility: old schema can read new events.
  • Both = "full compatibility" — strongest, what you want for long-lived pipelines.

A breaking schema change in production stops your pipeline. Test schema changes against historical data before deploy.

Deployment

Stream pipelines are infrastructure, not just applications:

ConcernRecommendation
Version control pipelinesSQL files / Flink jobs in git
CI tests on representative dataCatches state migration bugs
Staging environment with prod-like loadPerformance / state bugs visible
Blue-green deploymentRun new version in parallel; cut over reads
State migration planEither replay from scratch or migrate state files
Pipeline failures alert on-callLike databases, not "just a service"

Some pipelines can replay from Kafka's beginning to rebuild state — others have too much. Know which yours is before you need to recover.

Scale Considerations

ScaleTypical choice
< 10K events/s, small stateMaterialize / RisingWave / single-instance Kafka Streams
10K-100K events/sDistributed Kafka Streams or single-instance Flink
100K-10M events/sFlink cluster
> 10M events/sFlink, with serious operations team

For state size: in-memory works up to a few GB; RocksDB on local disk handles hundreds of GB to TBs. Beyond that, partition aggressively.

Cost

Stream processors are continuously running compute. Costs:

DriverMitigation
Always-on workersRight-size; aggressive when no traffic
State storageBound state; cheaper backends; lifecycle
Checkpoint storage I/OTune interval; incremental
Kafka brokers feedingTopic retention; compression
Cross-AZ trafficCo-locate when possible

Managed services (Confluent Flink, Materialize Cloud, AWS Kinesis Data Analytics) cost more per unit but include operations. The cheapest option is rarely the cheapest total cost of ownership.

When NOT to Use Stream Processing

Honest cases:

  • You can run a batch job hourly and that's fine for the use case. Don't reach for streaming if you don't need the latency.
  • You have one consumer that needs to react to events — a Background Job consumer with idempotency is simpler.
  • You have a few aggregations that can live in your OLTP DB — Postgres can maintain a counter via triggers.
  • You don't have someone to own the pipeline. Stream processors are stateful systems that need care; underestimating that has burned every team that tried.

If you have 1-2 high-value pipelines (real-time analytics, materialized views, fraud), stream processing is worth it. If you're rebuilding your entire application around streaming because it's the new shiny, stop.

Common Pitfalls

PitfallSymptomFix
Unbounded stateOOM, slow recoveryWindow or TTL everything
Aggressive watermarkMissing late eventsMeasure real lateness; tune
Processing-time aggregationsWrong answers on out-of-order dataUse event time
Schema change without versioningPipeline stopsSchema registry + compatibility rules
Sink failures swallowedSilent data lossAlert on sink errors
One pipeline, many concernsHard to evolveDecompose into per-purpose pipelines
No staging that matches prod loadSurprises in productionRealistic load tests
Replaying from scratch is the only recoveryHours of downtimeExternalize checkpoints; bounded replay
Long-running joins without TTLMemory creepAlways TTL or window joins
No on-call ownershipPipeline silently diesTreat as Tier-1 system

Checklist

Production stream processing checklist

  • State bounded explicitly (windows, TTLs, compaction)
  • Watermark configured based on measured lateness
  • Late-arrival policy chosen (drop / side-output / re-process)
  • Checkpoints externalized; can recover from a node loss
  • End-to-end latency SLO defined; monitored
  • Backpressure alerting
  • Schema registry with compatibility rules
  • Pipeline code in git; deployments versioned
  • Staging with prod-like data volume
  • State migration plan for pipeline upgrades
  • On-call ownership; runbook for "pipeline stopped"
  • Sinks idempotent (in case of replay)
  • Source uses replayable storage (Kafka, not at-most-once queues)
  • Cost dashboard; alert on unexpected resource use
  • Recovery time tested

On this page