Steven's Knowledge

Kafka

Apache Kafka in depth - partitions, consumer groups, producer/consumer code, topic configuration, operational notes

Kafka

Kafka is a distributed event streaming platform built around a partitioned, replicated, append-only log. Producers write to topics; consumers read from offsets they control; messages stick around for a retention period (or forever).

Core Vocabulary

TermWhat it is
TopicA named log; the unit of categorisation
PartitionA sub-log within a topic; ordering is per-partition
BrokerA Kafka server; a cluster has many
ProducerWrites records to topics
ConsumerReads records by partition + offset
Consumer groupA set of consumers cooperating; each partition assigned to one consumer in the group
OffsetA consumer's position in a partition
Replication factorHow many copies of each partition exist across brokers
ISR (in-sync replicas)Replicas keeping up with the leader
ControllerThe broker coordinating the cluster (KRaft-mode; ZooKeeper-free)

Partitions and Ordering

Every topic has N partitions. The producer chooses one per message:

  • No key → round-robin across partitions
  • Key set → hash(key) % partitions → same key always lands in the same partition

Ordering is guaranteed within a partition, not across. If you need "all events for user X in order," set the user ID as the key — every event for that user goes to the same partition.

key=user-7      ┐
key=user-7      │── partition 2 (ordered)
key=user-7      ┘

key=user-9      ┐── partition 0
key=user-9      ┘

Consumer Groups

A consumer group is the unit of parallelism:

Topic: orders (6 partitions)        Consumer Group A (3 consumers)
                                    Consumer A1 ── partitions 0,1
                                    Consumer A2 ── partitions 2,3
                                    Consumer A3 ── partitions 4,5

Rules to internalise:

  • One partition is read by exactly one consumer in a group. If you add more consumers than partitions, the extras sit idle.
  • Two groups are independent. "Order processor" and "audit logger" can both read every message at their own pace.
  • Rebalancing happens when consumers join, leave, or crash — partitions are reassigned. Briefly disruptive; tune with partition.assignment.strategy.

To scale processing, add partitions (up-front) and add consumers to the group (any time, up to the partition count).

Producer

// Node.js — kafkajs
const { Kafka, CompressionTypes } = require('kafkajs');

const kafka = new Kafka({
  clientId: 'order-service',
  brokers: ['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
});

const producer = kafka.producer({
  idempotent: true,              // exactly-once on a single producer session
  maxInFlightRequests: 5,
});

await producer.connect();

await producer.send({
  topic: 'orders',
  compression: CompressionTypes.LZ4,
  messages: [
    {
      key: order.userId,                                  // partition by user
      value: JSON.stringify(order),
      headers: { 'content-type': 'application/json' },
    },
  ],
  acks: -1,                       // wait for all in-sync replicas
});

Producer Knobs Worth Knowing

SettingWhat it controlsSensible default
acksDurability — 0 fire-and-forget, 1 leader only, -1/all all ISRsall for production
idempotent: trueProducer dedupes retries within a sessiontrue
maxInFlightRequestsParallel unacked requests per broker5 (with idempotence)
compressionnone, gzip, snappy, lz4, zstdlz4 or zstd
linger.msWait this long for more messages to batch5-50ms for throughput
batch.sizeBytes per batch16-64 KB
retriesRetry attempts on transient failureMany; rely on delivery.timeout.ms for overall deadline

Consumer

// Node.js — kafkajs
const consumer = kafka.consumer({
  groupId: 'order-processor',
  sessionTimeout: 30000,
  heartbeatInterval: 3000,
});

await consumer.connect();
await consumer.subscribe({ topic: 'orders', fromBeginning: false });

await consumer.run({
  autoCommit: false,                              // commit manually for safety
  eachBatchAutoResolve: false,
  partitionsConsumedConcurrently: 3,

  eachBatch: async ({ batch, resolveOffset, heartbeat, commitOffsetsIfNecessary }) => {
    for (const message of batch.messages) {
      await processOrder(JSON.parse(message.value.toString()));
      resolveOffset(message.offset);
      await heartbeat();
    }
    await commitOffsetsIfNecessary();
  },
});

Commit Patterns

PatternBehaviorTrade-off
autoCommit: true (default)Periodic offset commitAt-least-once; you'll re-process on crash
Commit after processingManual commit per batchAt-least-once, fewer duplicates
Commit before processingRareAt-most-once; messages lost on crash
External offset store (DB) + transactional outboxCommit alongside DB writesTruly exactly-once with care

Design every consumer to be idempotent. "Exactly once" is a property of the system, not a checkbox.

Topic Configuration

kafka-topics --create \
  --bootstrap-server kafka:9092 \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000 \           # 7 days
  --config cleanup.policy=delete \
  --config min.insync.replicas=2 \
  --config compression.type=producer

Numbers that matter:

ConfigEffect
partitionsCaps your consumer parallelism. Hard to increase later (breaks key-based ordering for existing data). Start with more than you think you need.
replication.factorDurability. 3 is the production minimum; tolerates one broker loss.
min.insync.replicasProducer acks=all requires this many in-sync replicas. Set to replication.factor - 1 (e.g. 2 with RF 3).
retention.ms / retention.bytesWhen to delete old segments.
cleanup.policy=deleteTime-based deletion (the default).
cleanup.policy=compactKeep the latest value per key forever — perfect for state snapshots (current-balance, user-profile topics).

Schemas

Producers and consumers drift. Use a schema registry (Confluent Schema Registry, Apicurio) with Avro / Protobuf / JSON-Schema:

const { SchemaRegistry } = require('@kafkajs/confluent-schema-registry');

const registry = new SchemaRegistry({ host: 'http://schema-registry:8081' });
const schemaId = await registry.getLatestSchemaId('orders-value');

await producer.send({
  topic: 'orders',
  messages: [{
    key: order.userId,
    value: await registry.encode(schemaId, order),
  }],
});

The registry rejects publishes that break compatibility (forward / backward / full) — schema drift becomes a build-time error instead of a 3 a.m. parse error.

Kafka Streams and ksqlDB

For computation on top of Kafka topics — joins, windowed aggregations, materialised views — Kafka has its own stream-processing layer:

  • Kafka Streams — a Java library; runs in your app's JVM
  • ksqlDB — SQL on Kafka topics; runs as a separate cluster
  • Flink — the heavy-duty alternative for complex streaming, exactly-once semantics, large state

Decision: small in-app aggregations? Streams. Bigger, cross-team? Flink.

Operational Notes

Cluster Sizing

LeverGuidance
BrokersStart with 3 (tolerates one loss); scale by partitions × throughput
Partitions per brokerKeep under ~4000 total partitions per broker for healthy ops
DiskFast SSD; sequential write workload; size for retention × throughput
Network10 GbE+ for clusters above ~100 MB/s aggregate
JVM heap6 GB is plenty; the OS page cache does the heavy lifting

KRaft vs ZooKeeper

Kafka has moved off ZooKeeper (in KRaft mode) — don't deploy a new cluster with ZooKeeper. KRaft is GA and the future; the controller role is built into Kafka brokers.

Tools You'll Want

ToolUse
kafka-topics, kafka-console-producer, kafka-console-consumerFirst-line debugging
kafka-consumer-groups --describeSee lag per partition
Kafka UI (provectuslabs/kafka-ui)Web UI for topics, messages, consumer groups
Cruise ControlPartition rebalancing
BurrowConsumer lag monitoring

Lag Is the Number to Watch

A consumer's lag is "how many records behind the latest." Growing lag = your consumer can't keep up. Alert on it.

docker compose exec kafka kafka-consumer-groups \
  --bootstrap-server kafka:9092 \
  --group order-processor --describe
GROUP            TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-processor  orders  0          12453           12500            47
order-processor  orders  1          12200           12500           300

Best Practices Checklist

Production-ready Kafka checklist

  • Replication factor ≥ 3, min.insync.replicas ≥ 2
  • Producer acks=all and enable.idempotence=true
  • Consumers idempotent at the application level
  • Manual commits after successful processing (not before)
  • Schema registry in front of every topic with real consumers
  • Compression set (lz4 or zstd) on every producer
  • Partition counts sized for max expected consumer parallelism (over-provision)
  • Consumer lag monitored and alerted on
  • Retention policy explicit (time and/or bytes)
  • Compacted topics for "current state" data
  • KRaft mode (no ZooKeeper) for any new cluster
  • TLS + SASL/SCRAM (or mTLS) between clients and brokers in production
  • Per-topic ACLs; no shared admin credentials

On this page