I spent a morning once very proud of a benchmark showing our new order-matching path had p99 latency of 180µs, down from 340µs. It was a 47% improvement. I presented it in a team meeting. An engineer asked one question: “Is that closed-loop or open-loop?”
I didn’t know what that meant. The benchmark was worthless.
The JVM Measurement Problem (The Easier One)
Before you can trust any Java microbenchmark, you need to handle the JVM’s measurement traps. JMH (Java Microbenchmark Harness) solves most of them:
JIT warm-up: The JVM interprets bytecode initially, then JIT-compiles hot methods. A benchmark that doesn’t warm up is measuring the interpreter, not the compiled code. JMH has configurable warm-up iterations.
Dead code elimination: The JIT is smart enough to recognise when a computation’s result is never used and eliminate it. Your benchmark of a “fast algorithm” may be benchmarking nothing at all. JMH requires you to consume results via Blackhole or by returning them.
Constant folding: If your benchmark inputs are constants, the JIT may precompute the result and your benchmark measures a memory read. Use @State with @Setup to initialise inputs that look non-constant.
GC interference: Allocations in a benchmark pollute the results with GC pauses. JMH doesn’t solve this automatically — you have to design your benchmark to minimise allocation, or accept that you’re measuring allocation + GC too.
| |
Run with java -jar benchmarks.jar -prof gc to see allocation rates alongside throughput.
Coordinated Omission: The Harder Problem
This is what the engineer’s question was pointing at. Gil Tene named and formalised the problem; the wrk2 and HdrHistogram documentation cover it well.
A closed-loop benchmark works like this:
- Send request
- Wait for response
- Record latency
- Go to step 1
The problem: if the service takes 1 second to respond, the benchmark generates 1 request/second. It never generates the load that reveals the tail latency under sustained throughput. Slow responses reduce offered load, which reduces the chance of seeing more slow responses. The benchmark coordinates with the system under test to hide the problem.
An open-loop benchmark drives requests at a target rate independent of response time:
- At time T, send request
- At time T + interval, send next request (regardless of whether first has returned)
- Record latency for each response when it arrives
Under sustained load, open-loop benchmarks reveal the actual tail latency distribution — including queuing latency that builds when the service can’t keep up.
The Fix: HdrHistogram + Proper Load Generation
HdrHistogram records latency distributions without losing tail information. It stores values in buckets with configurable resolution (1 significant digit is usually enough) across a range from 1µs to hours.
| |
For generating open-loop load, use io.dropwizard.metrics or write a simple ScheduledExecutorService-based sender that fires at target rate regardless of outstanding requests.
What Our Numbers Actually Showed
After fixing the benchmark:
| Percentile | Closed-loop | Open-loop (target: 10k/s) |
|---|---|---|
| p50 | 185 µs | 210 µs |
| p99 | 194 µs | 1.8 ms |
| p99.9 | 203 µs | 12 ms |
| p99.99 | 411 µs | 89 ms |
The p50 was roughly right. The p99.9 was 60x worse than we thought. The “47% improvement” I’d been celebrating was real at median, but the tail behaviour — what traders actually experience during volatility bursts — was invisible to our original benchmark.
The honest p99.9 led to a real fix: the queuing behaviour under sustained load was caused by a lock in the order normalisation path. Finding and removing it dropped the p99.9 to 380µs at the same throughput target.
That fix was only findable because the benchmark was finally honest.