Kafka at Startup Scale

The fintech startup adopted Kafka early — we were processing market events at rates that would have overwhelmed any request-response queue. Two years in, with a five-broker cluster handling 200k messages/sec at peak, the operational experience was significantly different from what I’d expected based on the documentation and conference talks.

Why Kafka and Not Something Simpler

The honest answer: market event processing has properties that make Kafka a natural fit, and we had engineers who knew it.

The properties that matter for this workload:

Ordered delivery within a partition. Market events have causal ordering. A price update for EUR/USD at 10:00:00.100 must be processed after the one at 10:00:00.050. Kafka’s partition ordering guarantee makes this easy. SQS, RabbitMQ, and similar systems don’t provide ordering.

Replay. The data pipeline had bugs. Being able to reprocess the last 24 hours of market events from Kafka rather than from a backup was invaluable. Retention was set to 48 hours; most replays completed from hot storage.

Consumer groups with independent offsets. Three different consumers needed the same event stream: the position aggregator, the analytics pipeline, and the compliance recorder. Kafka’s consumer group model let them advance at different rates without interfering. A traditional queue would require either duplicating messages or routing, both ugly.

Throughput. 200k messages/sec on a 5-broker cluster is trivially within Kafka’s capability. It would have required careful engineering on most alternatives.

The Defaults That Will Hurt You

The default Kafka configuration is tuned for durability in a large enterprise cluster. On a small startup cluster, several defaults are wrong.

replication.factor defaults to 1. Topic creation without specifying a replication factor creates a topic with no redundancy. One broker failure loses the topic. Always set replication factor to 3 for production topics. We enforced this with a startup check that listed all topics and failed if any had replication factor < 3.

min.insync.replicas defaults to 1. Even with 3 replicas, acks=all (producer durability) and min.insync.replicas=1 means a broker acknowledges a write when only one replica has it. Set min.insync.replicas=2 for production: requires 2 of 3 replicas to acknowledge before the producer gets confirmation. This is the setting that actually prevents data loss on broker failure.

auto.offset.reset=latest. When a new consumer group starts, it begins consuming from the latest offset — it doesn’t see events that arrived before it first ran. For a service that needs to process historical events on first startup, this is wrong. Use earliest and plan for idempotent processing of events that may already have been processed.

log.retention.ms defaults to 7 days. Correct for most use cases, but worth being explicit. We set it per-topic: 48 hours for high-volume market data (disk cost), 30 days for audit-critical events (compliance).

Partition Count Planning

Kafka’s throughput is proportional to partition count (more partitions = more parallelism). But partition count has costs:

More partitions = more metadata overhead on brokers
More partitions = more file handles per broker
Changing partition count is disruptive (consumers must rebalance, ordering within a topic changes)

Our rule: start with a partition count that gives you 3–5× the parallelism you currently need. For a consumer that needs 4 consumer threads, start with 12–20 partitions.

The market event topics: 32 partitions. This gave us up to 32 concurrent consumer threads, well ahead of our actual parallelism (8 threads). We never needed to repartition.

Avoid auto.create.topics.enable=true in production. Topics created automatically get default settings (replication factor 1, 7-day retention). Create topics explicitly with correct configuration before deploying consumers.

Consumer Lag as a First-Class Metric

Consumer lag (the number of messages a consumer group is behind the latest offset) is the most important operational metric for Kafka-based systems.

We exported consumer lag to Prometheus via kafka-consumer-groups.sh output scraped every 30 seconds, and alerted when:

Lag exceeded 10,000 messages (5 seconds of normal throughput) on the market data topic
Lag was increasing for more than 60 seconds

High consumer lag means one of:

Consumer is too slow (CPU bottleneck, database bottleneck)
Producers are publishing faster than usual (market event burst)
Consumer is down (replicas catching up after a pod restart)

The alert gave us early warning before lag accumulated to the point where the system was processing stale data.

The Rebalancing Problem

Consumer group rebalancing — when a consumer joins or leaves, Kafka redistributes partition assignments — pauses all consumption for the group during the rebalance. In a cluster where consumers restart frequently (rolling deployments, autoscaling), this creates gaps.

The fix: cooperative incremental rebalancing (partition.assignment.strategy=CooperativeStickyAssignor). Instead of revoking all partitions and reassigning from scratch, cooperative rebalancing only moves the partitions that need to move. Pauses during rebalancing drop from seconds to milliseconds for most consumers.

Switched to cooperative rebalancing after a deployment that triggered 8-second consumption gaps on each of 3 rolling pod restarts. The P99 latency spike on the analytics pipeline was visible in Grafana. After the switch, the same deployment caused no visible latency impact.

What We Got Wrong

Not monitoring broker disk. Kafka writes to disk. A broker disk filling up causes that broker to stop accepting writes. We had a broker fill up due to a topic with no retention policy (the default 7 days, but a topic we’d used for load testing and forgotten). The topic had accumulated 800GB over 3 days. The broker stopped accepting writes; the topic went offline; two dependent services degraded. Fixed with a disk usage alert at 70%, and an audit of topic retention policies.

Under-sized consumer groups for burst. Our analytics consumer had 4 threads for a 32-partition topic. During a burst where message volume was 4× normal for 20 minutes, lag accumulated. The fix was elastic consumer scaling, but we’d designed the consumer pool as fixed-size.

Not testing broker failure. We believed our replication setup was correct (it was), but had never tested broker failure in production. We ran a chaos exercise: killed a broker, watched the leader election happen (took ~30 seconds — longer than expected because unclean.leader.election.enable was false and we waited for an ISR leader). Found that one application had a connection pool that didn’t reconnect on broker failure; fixed before the real failure.

Kafka is worth the operational investment for the right workloads. The right workloads have at least two of: ordering requirement, replay requirement, multiple independent consumers, or very high throughput. If your workload has none of these, a simpler queue is probably better. If it has all of them, Kafka will serve you well — but not without reading the configuration documentation carefully.

Why Kafka and Not Something Simpler#

The Defaults That Will Hurt You#

Partition Count Planning#

Consumer Lag as a First-Class Metric#

The Rebalancing Problem#

What We Got Wrong#