Latency vs Throughput: The False Dichotomy I Learned the Hard Way

In my first performance review at the trading firm, I described a component I’d optimised as “high throughput.” My manager asked what the p99 latency was. I didn’t know. He asked what happened to latency during peak throughput. I didn’t know that either. The conversation went downhill from there.

That exchange forced me to be precise about what I was actually optimising for — and why throughput and latency, while related, are fundamentally different properties.

Definitions That Actually Matter

Throughput: the number of operations completed per unit of time. Measured in messages/second, requests/second, transactions/second.

Latency: the time elapsed between a request being submitted and a response being received. Measured as a distribution — p50, p99, p999 — not as an average.

They’re related but not equivalent:

High throughput system:              Low-latency system:

 ┌────────────────────────┐           ┌────────────────────────┐
 │ 1M messages / second   │           │ p99 < 500µs            │
 │ p99 = 40ms             │           │ throughput = 50k/s     │
 └────────────────────────┘           └────────────────────────┘

 High throughput does not             Low latency does not
 imply low latency.                   imply high throughput.

A system that processes 1 million messages per second with a 40ms p99 is excellent for batch processing and useless for real-time trading. A system that responds in under 500µs at the 99th percentile but saturates at 50,000 messages/second may be fine for order routing but inadequate for market data distribution.

Why They Pull in Different Directions

The techniques that maximise throughput often hurt latency, and vice versa.

Batching improves throughput by amortising per-operation overhead across many operations. A database that writes 10,000 rows in one transaction is faster per row than 10,000 individual inserts. But each row now waits for the batch to fill before being processed — adding latency proportional to batch size.

Buffering smooths throughput by absorbing bursts. A bounded queue between producer and consumer lets the producer run ahead of the consumer during peaks. But every item added to the queue adds wait time — latency goes up by the queue depth times the service time per item.

Compression improves throughput by reducing bytes transferred. But it adds CPU time on both ends — latency goes up by the compression/decompression cost.

Thread pools improve throughput by parallelising work. But a request that could be handled immediately now waits in a queue for an available thread — adding scheduling latency.

None of this means these techniques are wrong. It means they come with a latency cost that you have to decide is acceptable.

Little’s Law: The Relationship Formalized

Little’s Law connects the three quantities:

L = λ × W

L = average number of requests in the system
λ = throughput (requests per second)
W = average latency per request

Rearranged: W = L / λ. For a given queue depth (L), higher throughput (λ) means lower latency (W). This is the one case where improving throughput also improves latency — by processing the queue faster, each item waits less.

In practice: if your latency is high, it’s either because your throughput is low relative to arrival rate (queue is building up) or because the per-request processing time is high. Little’s Law tells you which is the bottleneck: measure queue depth alongside latency. If queue depth is low and latency is high, the processing is slow. If queue depth is growing, your throughput is insufficient for the load.

The Mistake I Made

The component I’d optimised was a feed aggregator. I’d increased its throughput from 40,000 messages/second to 90,000 by batching reads from the input queue: instead of processing one message at a time, I processed 50 at a time in a tight loop.

Throughput: doubled. ✓

What I hadn’t measured: latency distribution. The batching meant the first message in a batch of 50 waited for the other 49 to arrive before being processed. At 90,000 messages/second, 50 messages take ~555µs to accumulate. I’d added 555µs of minimum latency to every message.

Our tick-to-quote SLA was 1ms p99. I’d consumed more than half the budget before we’d done any useful work.

The fix: process messages one at a time, but yield the CPU back to the producer between bursts rather than spinning. Throughput dropped to 70,000 messages/second — still faster than the original — and p99 latency was 140µs instead of 600µs.

The Right Question

Before optimising, ask: what does this system need to be good at?

System type	Primary metric	Secondary
Real-time price distribution	p99 latency	Throughput sufficient to handle peaks
Trade reporting pipeline	Throughput	Latency (batch SLA is hours, not ms)
Order routing	p99.9 latency	Throughput (order rates are low)
Market data archival	Throughput	Latency (archival is offline)
Risk calculation (EOD)	Total time to completion	N/A (batch)
Client-facing API	p99 latency	Throughput

The appropriate design — batching sizes, queue depths, threading model, GC tuning — flows from this decision. A system designed for throughput and a system designed for latency look different at the architecture level, not just at the micro-optimisation level.

Conflating the two metrics produces systems that are neither fast-enough in the worst case nor efficient enough in the average case. Measuring both — and knowing which one your SLA actually cares about — is where the work starts.

Definitions That Actually Matter#

Why They Pull in Different Directions#

Little’s Law: The Relationship Formalized#

The Mistake I Made#

The Right Question#

Definitions That Actually Matter

Why They Pull in Different Directions

Little’s Law: The Relationship Formalized

The Mistake I Made

The Right Question