Flink, Kafka Streams, Materialize, Spark Streaming - continuous computation over data in motion

Stream Processing

A stream processor runs continuous computation over data that arrives over time — joining, aggregating, transforming, alerting — rather than running batch jobs over data at rest. The output keeps up with the input: when a new event arrives, the relevant computation updates within seconds, not hours.

This is the layer above Message Queues — Kafka holds the events; the stream processor does something useful with them.

Why Stream Processing

Without	With
Nightly batch reports lag a day	Real-time dashboards update in seconds
Each new aggregation = a new ETL job	Express it as SQL, it just runs
State management bolted onto consumers	Built-in state stores, exactly-once
Joining two streams = custom code	First-class joins with windowing
Detecting patterns requires scanning historical data	Continuous pattern matching
Fraud detection runs hours late	Fires within milliseconds

The use cases:

Real-time analytics — dashboards that update as events flow in.
Fraud / anomaly detection — pattern match in flight.
Materialized views — keep a denormalized view fresh without full recomputation.
Enrichment pipelines — join events with reference data, write to a downstream sink.
Alerting on event patterns — "5 failed logins in 1 minute from the same IP."
Operational metrics — derive p95 latency from individual request events.
Change data capture (CDC) — turn database changes into a stream and react.

The Players

Tool	Type	Notes
Apache Flink	Distributed, stateful	The most powerful; complex; the standard for serious streaming
Kafka Streams	JVM library (no separate cluster)	If you're already on Kafka and on JVM, often the right call
ksqlDB	SQL-on-Kafka	Easy entry; built on Kafka Streams
Materialize	"Streaming database"	SQL-defined materialized views over streams; PostgreSQL-compatible wire protocol
RisingWave	"Streaming database" (open-source)	Materialize alternative; SQL-first
Apache Spark Structured Streaming	Batch + streaming	If you're already on Spark for batch
Apache Beam	Abstraction layer	Run the same pipeline on Flink, Spark, Dataflow
Google Dataflow	Managed Beam	Auto-scaling, GCP-native
Amazon Kinesis Data Analytics	Managed Flink on AWS	Tight integration with Kinesis
Bytewax	Python-native	Newer; nice for ML pipelines
Estuary Flow	SaaS	Modern, low-code

For new projects:

You want SQL and operational simplicity → Materialize or RisingWave.
JVM team, already on Kafka → Kafka Streams.
You need maximum power / scale → Flink (self-host or managed).
You're on AWS / GCP and want managed → Kinesis Data Analytics / Dataflow.
Python team doing ML / data eng → Bytewax or Spark Structured Streaming.

Stream Processing vs Adjacent Things

	Stream Processing	Batch (Spark, dbt)	Background Jobs	Message Queues
Time orientation	Continuous, low-latency	Periodic, high-latency	Event-driven, per-task	Transport
Output	Continuously updated state / sink	Fresh table per run	Side effects	Move bytes
Reasoning	Time, windows, watermarks	Big sets at a point in time	Single units of work	None
Latency	Seconds	Hours	Seconds-minutes	Sub-second
State	Long-lived, large	Per-job	Per-task	None

You'll often use several together: events flow through Kafka, get processed by Flink, write to Materialize for live queries, and trigger background jobs for side effects.

Learning Path

1. Getting Started

Run Materialize locally; create a SQL-defined materialized view over a Kafka topic that stays fresh

2. Patterns

Windowing, joins, watermarks, state, exactly-once, CDC, change detection

3. Best Practices

State sizing, observability, recovery, latency budgets, when not to use

Two Mental Models

The space splits into two philosophies:

"Stream as a stream"

Flink, Kafka Streams, Spark Streaming. You write code (or SQL) that defines how each event transforms; the framework manages state, windowing, exactly-once. Powerful and flexible; steep learning curve.

// Kafka Streams: count clicks per user per 5-minute window
builder.stream<String, ClickEvent>("clicks")
  .groupBy { _, click -> click.userId }
  .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
  .count()
  .toStream()
  .to("clicks-per-user-5m")

"Stream as a database"

Materialize, RisingWave. You declare what you want with SQL; the database maintains it incrementally as data flows in.

CREATE MATERIALIZED VIEW clicks_per_user_5m AS
SELECT user_id, COUNT(*) AS click_count
FROM clicks
WHERE created_at > now() - INTERVAL '5 minutes'
GROUP BY user_id;

-- Query like any table; always fresh
SELECT * FROM clicks_per_user_5m ORDER BY click_count DESC LIMIT 10;

For 2026 in non-trivial workloads: start with the streaming-database model unless you have specific needs (huge state, custom logic, integration with Flink ecosystem). It's dramatically easier to reason about.

Stream processing has been "the next big thing" for 15 years. The real-world reality: most companies that adopt it use it for a few specific high-value pipelines (fraud, real-time dashboards, materialized views), not as a general application architecture. Pick concrete use cases; don't try to make everything streaming.

Stream Processing

1. Getting Started

2. Patterns

3. Best Practices

On this page