Stream Processing
Flink, Kafka Streams, Materialize, Spark Streaming - continuous computation over data in motion
Stream Processing
A stream processor runs continuous computation over data that arrives over time — joining, aggregating, transforming, alerting — rather than running batch jobs over data at rest. The output keeps up with the input: when a new event arrives, the relevant computation updates within seconds, not hours.
This is the layer above Message Queues — Kafka holds the events; the stream processor does something useful with them.
Why Stream Processing
| Without | With |
|---|---|
| Nightly batch reports lag a day | Real-time dashboards update in seconds |
| Each new aggregation = a new ETL job | Express it as SQL, it just runs |
| State management bolted onto consumers | Built-in state stores, exactly-once |
| Joining two streams = custom code | First-class joins with windowing |
| Detecting patterns requires scanning historical data | Continuous pattern matching |
| Fraud detection runs hours late | Fires within milliseconds |
The use cases:
- Real-time analytics — dashboards that update as events flow in.
- Fraud / anomaly detection — pattern match in flight.
- Materialized views — keep a denormalized view fresh without full recomputation.
- Enrichment pipelines — join events with reference data, write to a downstream sink.
- Alerting on event patterns — "5 failed logins in 1 minute from the same IP."
- Operational metrics — derive p95 latency from individual request events.
- Change data capture (CDC) — turn database changes into a stream and react.
The Players
| Tool | Type | Notes |
|---|---|---|
| Apache Flink | Distributed, stateful | The most powerful; complex; the standard for serious streaming |
| Kafka Streams | JVM library (no separate cluster) | If you're already on Kafka and on JVM, often the right call |
| ksqlDB | SQL-on-Kafka | Easy entry; built on Kafka Streams |
| Materialize | "Streaming database" | SQL-defined materialized views over streams; PostgreSQL-compatible wire protocol |
| RisingWave | "Streaming database" (open-source) | Materialize alternative; SQL-first |
| Apache Spark Structured Streaming | Batch + streaming | If you're already on Spark for batch |
| Apache Beam | Abstraction layer | Run the same pipeline on Flink, Spark, Dataflow |
| Google Dataflow | Managed Beam | Auto-scaling, GCP-native |
| Amazon Kinesis Data Analytics | Managed Flink on AWS | Tight integration with Kinesis |
| Bytewax | Python-native | Newer; nice for ML pipelines |
| Estuary Flow | SaaS | Modern, low-code |
For new projects:
- You want SQL and operational simplicity → Materialize or RisingWave.
- JVM team, already on Kafka → Kafka Streams.
- You need maximum power / scale → Flink (self-host or managed).
- You're on AWS / GCP and want managed → Kinesis Data Analytics / Dataflow.
- Python team doing ML / data eng → Bytewax or Spark Structured Streaming.
Stream Processing vs Adjacent Things
| Stream Processing | Batch (Spark, dbt) | Background Jobs | Message Queues | |
|---|---|---|---|---|
| Time orientation | Continuous, low-latency | Periodic, high-latency | Event-driven, per-task | Transport |
| Output | Continuously updated state / sink | Fresh table per run | Side effects | Move bytes |
| Reasoning | Time, windows, watermarks | Big sets at a point in time | Single units of work | None |
| Latency | Seconds | Hours | Seconds-minutes | Sub-second |
| State | Long-lived, large | Per-job | Per-task | None |
You'll often use several together: events flow through Kafka, get processed by Flink, write to Materialize for live queries, and trigger background jobs for side effects.
Learning Path
1. Getting Started
Run Materialize locally; create a SQL-defined materialized view over a Kafka topic that stays fresh
2. Patterns
Windowing, joins, watermarks, state, exactly-once, CDC, change detection
3. Best Practices
State sizing, observability, recovery, latency budgets, when not to use
Two Mental Models
The space splits into two philosophies:
"Stream as a stream"
Flink, Kafka Streams, Spark Streaming. You write code (or SQL) that defines how each event transforms; the framework manages state, windowing, exactly-once. Powerful and flexible; steep learning curve.
// Kafka Streams: count clicks per user per 5-minute window
builder.stream<String, ClickEvent>("clicks")
.groupBy { _, click -> click.userId }
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.count()
.toStream()
.to("clicks-per-user-5m")"Stream as a database"
Materialize, RisingWave. You declare what you want with SQL; the database maintains it incrementally as data flows in.
CREATE MATERIALIZED VIEW clicks_per_user_5m AS
SELECT user_id, COUNT(*) AS click_count
FROM clicks
WHERE created_at > now() - INTERVAL '5 minutes'
GROUP BY user_id;
-- Query like any table; always fresh
SELECT * FROM clicks_per_user_5m ORDER BY click_count DESC LIMIT 10;For 2026 in non-trivial workloads: start with the streaming-database model unless you have specific needs (huge state, custom logic, integration with Flink ecosystem). It's dramatically easier to reason about.
Stream processing has been "the next big thing" for 15 years. The real-world reality: most companies that adopt it use it for a few specific high-value pipelines (fraud, real-time dashboards, materialized views), not as a general application architecture. Pick concrete use cases; don't try to make everything streaming.