Steven's Knowledge

Stream Processing

Flink, Kafka Streams, Materialize, Spark Streaming - continuous computation over data in motion

Stream Processing

A stream processor runs continuous computation over data that arrives over time — joining, aggregating, transforming, alerting — rather than running batch jobs over data at rest. The output keeps up with the input: when a new event arrives, the relevant computation updates within seconds, not hours.

This is the layer above Message Queues — Kafka holds the events; the stream processor does something useful with them.

Why Stream Processing

WithoutWith
Nightly batch reports lag a dayReal-time dashboards update in seconds
Each new aggregation = a new ETL jobExpress it as SQL, it just runs
State management bolted onto consumersBuilt-in state stores, exactly-once
Joining two streams = custom codeFirst-class joins with windowing
Detecting patterns requires scanning historical dataContinuous pattern matching
Fraud detection runs hours lateFires within milliseconds

The use cases:

  • Real-time analytics — dashboards that update as events flow in.
  • Fraud / anomaly detection — pattern match in flight.
  • Materialized views — keep a denormalized view fresh without full recomputation.
  • Enrichment pipelines — join events with reference data, write to a downstream sink.
  • Alerting on event patterns — "5 failed logins in 1 minute from the same IP."
  • Operational metrics — derive p95 latency from individual request events.
  • Change data capture (CDC) — turn database changes into a stream and react.

The Players

ToolTypeNotes
Apache FlinkDistributed, statefulThe most powerful; complex; the standard for serious streaming
Kafka StreamsJVM library (no separate cluster)If you're already on Kafka and on JVM, often the right call
ksqlDBSQL-on-KafkaEasy entry; built on Kafka Streams
Materialize"Streaming database"SQL-defined materialized views over streams; PostgreSQL-compatible wire protocol
RisingWave"Streaming database" (open-source)Materialize alternative; SQL-first
Apache Spark Structured StreamingBatch + streamingIf you're already on Spark for batch
Apache BeamAbstraction layerRun the same pipeline on Flink, Spark, Dataflow
Google DataflowManaged BeamAuto-scaling, GCP-native
Amazon Kinesis Data AnalyticsManaged Flink on AWSTight integration with Kinesis
BytewaxPython-nativeNewer; nice for ML pipelines
Estuary FlowSaaSModern, low-code

For new projects:

  • You want SQL and operational simplicityMaterialize or RisingWave.
  • JVM team, already on KafkaKafka Streams.
  • You need maximum power / scaleFlink (self-host or managed).
  • You're on AWS / GCP and want managedKinesis Data Analytics / Dataflow.
  • Python team doing ML / data engBytewax or Spark Structured Streaming.

Stream Processing vs Adjacent Things

Stream ProcessingBatch (Spark, dbt)Background JobsMessage Queues
Time orientationContinuous, low-latencyPeriodic, high-latencyEvent-driven, per-taskTransport
OutputContinuously updated state / sinkFresh table per runSide effectsMove bytes
ReasoningTime, windows, watermarksBig sets at a point in timeSingle units of workNone
LatencySecondsHoursSeconds-minutesSub-second
StateLong-lived, largePer-jobPer-taskNone

You'll often use several together: events flow through Kafka, get processed by Flink, write to Materialize for live queries, and trigger background jobs for side effects.

Learning Path

Two Mental Models

The space splits into two philosophies:

"Stream as a stream"

Flink, Kafka Streams, Spark Streaming. You write code (or SQL) that defines how each event transforms; the framework manages state, windowing, exactly-once. Powerful and flexible; steep learning curve.

// Kafka Streams: count clicks per user per 5-minute window
builder.stream<String, ClickEvent>("clicks")
  .groupBy { _, click -> click.userId }
  .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
  .count()
  .toStream()
  .to("clicks-per-user-5m")

"Stream as a database"

Materialize, RisingWave. You declare what you want with SQL; the database maintains it incrementally as data flows in.

CREATE MATERIALIZED VIEW clicks_per_user_5m AS
SELECT user_id, COUNT(*) AS click_count
FROM clicks
WHERE created_at > now() - INTERVAL '5 minutes'
GROUP BY user_id;

-- Query like any table; always fresh
SELECT * FROM clicks_per_user_5m ORDER BY click_count DESC LIMIT 10;

For 2026 in non-trivial workloads: start with the streaming-database model unless you have specific needs (huge state, custom logic, integration with Flink ecosystem). It's dramatically easier to reason about.

Stream processing has been "the next big thing" for 15 years. The real-world reality: most companies that adopt it use it for a few specific high-value pipelines (fraud, real-time dashboards, materialized views), not as a general application architecture. Pick concrete use cases; don't try to make everything streaming.

On this page