In my first performance review at the trading firm, I described a component I’d optimised as “high throughput.” My manager asked what the p99 latency was. I didn’t know. He asked what happened to latency during peak throughput. I didn’t know that either. The conversation went downhill from there.
That exchange forced me to be precise about what I was actually optimising for — and why throughput and latency, while related, are fundamentally different properties.
Definitions That Actually Matter
Throughput: the number of operations completed per unit of time. Measured in messages/second, requests/second, transactions/second.
Latency: the time elapsed between a request being submitted and a response being received. Measured as a distribution — p50, p99, p999 — not as an average.
They’re related but not equivalent:
High throughput system: Low-latency system:
┌────────────────────────┐ ┌────────────────────────┐
│ 1M messages / second │ │ p99 < 500µs │
│ p99 = 40ms │ │ throughput = 50k/s │
└────────────────────────┘ └────────────────────────┘
High throughput does not Low latency does not
imply low latency. imply high throughput.
A system that processes 1 million messages per second with a 40ms p99 is excellent for batch processing and useless for real-time trading. A system that responds in under 500µs at the 99th percentile but saturates at 50,000 messages/second may be fine for order routing but inadequate for market data distribution.
Why They Pull in Different Directions
The techniques that maximise throughput often hurt latency, and vice versa.
Batching improves throughput by amortising per-operation overhead across many operations. A database that writes 10,000 rows in one transaction is faster per row than 10,000 individual inserts. But each row now waits for the batch to fill before being processed — adding latency proportional to batch size.
Buffering smooths throughput by absorbing bursts. A bounded queue between producer and consumer lets the producer run ahead of the consumer during peaks. But every item added to the queue adds wait time — latency goes up by the queue depth times the service time per item.
Compression improves throughput by reducing bytes transferred. But it adds CPU time on both ends — latency goes up by the compression/decompression cost.
Thread pools improve throughput by parallelising work. But a request that could be handled immediately now waits in a queue for an available thread — adding scheduling latency.
None of this means these techniques are wrong. It means they come with a latency cost that you have to decide is acceptable.
Little’s Law: The Relationship Formalized
Little’s Law connects the three quantities:
L = λ × W
L = average number of requests in the system
λ = throughput (requests per second)
W = average latency per request
Rearranged: W = L / λ. For a given queue depth (L), higher throughput (λ) means lower latency (W). This is the one case where improving throughput also improves latency — by processing the queue faster, each item waits less.
In practice: if your latency is high, it’s either because your throughput is low relative to arrival rate (queue is building up) or because the per-request processing time is high. Little’s Law tells you which is the bottleneck: measure queue depth alongside latency. If queue depth is low and latency is high, the processing is slow. If queue depth is growing, your throughput is insufficient for the load.
The Mistake I Made
The component I’d optimised was a feed aggregator. I’d increased its throughput from 40,000 messages/second to 90,000 by batching reads from the input queue: instead of processing one message at a time, I processed 50 at a time in a tight loop.
Throughput: doubled. ✓
What I hadn’t measured: latency distribution. The batching meant the first message in a batch of 50 waited for the other 49 to arrive before being processed. At 90,000 messages/second, 50 messages take ~555µs to accumulate. I’d added 555µs of minimum latency to every message.
Our tick-to-quote SLA was 1ms p99. I’d consumed more than half the budget before we’d done any useful work.
The fix: process messages one at a time, but yield the CPU back to the producer between bursts rather than spinning. Throughput dropped to 70,000 messages/second — still faster than the original — and p99 latency was 140µs instead of 600µs.
The Right Question
Before optimising, ask: what does this system need to be good at?
| System type | Primary metric | Secondary |
|---|---|---|
| Real-time price distribution | p99 latency | Throughput sufficient to handle peaks |
| Trade reporting pipeline | Throughput | Latency (batch SLA is hours, not ms) |
| Order routing | p99.9 latency | Throughput (order rates are low) |
| Market data archival | Throughput | Latency (archival is offline) |
| Risk calculation (EOD) | Total time to completion | N/A (batch) |
| Client-facing API | p99 latency | Throughput |
The appropriate design — batching sizes, queue depths, threading model, GC tuning — flows from this decision. A system designed for throughput and a system designed for latency look different at the architecture level, not just at the micro-optimisation level.
Conflating the two metrics produces systems that are neither fast-enough in the worst case nor efficient enough in the average case. Measuring both — and knowing which one your SLA actually cares about — is where the work starts.