Benchmarking Without Lying: JMH, Coordinated Omission, and Honest Numbers

I spent a morning once very proud of a benchmark showing our new order-matching path had p99 latency of 180µs, down from 340µs. It was a 47% improvement. I presented it in a team meeting. An engineer asked one question: “Is that closed-loop or open-loop?”

I didn’t know what that meant. The benchmark was worthless.

The JVM Measurement Problem (The Easier One)

Before you can trust any Java microbenchmark, you need to handle the JVM’s measurement traps. JMH (Java Microbenchmark Harness) solves most of them:

JIT warm-up: The JVM interprets bytecode initially, then JIT-compiles hot methods. A benchmark that doesn’t warm up is measuring the interpreter, not the compiled code. JMH has configurable warm-up iterations.

Dead code elimination: The JIT is smart enough to recognise when a computation’s result is never used and eliminate it. Your benchmark of a “fast algorithm” may be benchmarking nothing at all. JMH requires you to consume results via Blackhole or by returning them.

Constant folding: If your benchmark inputs are constants, the JIT may precompute the result and your benchmark measures a memory read. Use @State with @Setup to initialise inputs that look non-constant.

GC interference: Allocations in a benchmark pollute the results with GC pauses. JMH doesn’t solve this automatically — you have to design your benchmark to minimise allocation, or accept that you’re measuring allocation + GC too.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(2)
@State(Scope.Thread)
public class OrderMatchBenchmark {

    private Order incoming;
    private OrderBook book;

    @Setup
    public void setup() {
        book = new OrderBook("EURUSD");
        // ... populate book
        incoming = new Order(Side.BUY, 1_000_000, 1.08450);
    }

    @Benchmark
    public MatchResult measureMatch(Blackhole bh) {
        MatchResult result = book.match(incoming);
        bh.consume(result);
        return result;
    }
}

Run with java -jar benchmarks.jar -prof gc to see allocation rates alongside throughput.

Coordinated Omission: The Harder Problem

This is what the engineer’s question was pointing at. Gil Tene named and formalised the problem; the wrk2 and HdrHistogram documentation cover it well.

A closed-loop benchmark works like this:

Send request
Wait for response
Record latency
Go to step 1

The problem: if the service takes 1 second to respond, the benchmark generates 1 request/second. It never generates the load that reveals the tail latency under sustained throughput. Slow responses reduce offered load, which reduces the chance of seeing more slow responses. The benchmark coordinates with the system under test to hide the problem.

An open-loop benchmark drives requests at a target rate independent of response time:

At time T, send request
At time T + interval, send next request (regardless of whether first has returned)
Record latency for each response when it arrives

Under sustained load, open-loop benchmarks reveal the actual tail latency distribution — including queuing latency that builds when the service can’t keep up.

The Fix: HdrHistogram + Proper Load Generation

HdrHistogram records latency distributions without losing tail information. It stores values in buckets with configurable resolution (1 significant digit is usually enough) across a range from 1µs to hours.

1
2
3
4
5
6
7
8
Histogram histogram = new Histogram(TimeUnit.HOURS.toMicros(1), 3);

// For each completed request:
histogram.recordValue(latencyMicros);

// Report
histogram.outputPercentileDistribution(System.out, 1.0);
// p50, p75, p90, p95, p99, p99.9, p99.99, p100

For generating open-loop load, use io.dropwizard.metrics or write a simple ScheduledExecutorService-based sender that fires at target rate regardless of outstanding requests.

What Our Numbers Actually Showed

After fixing the benchmark:

Percentile	Closed-loop	Open-loop (target: 10k/s)
p50	185 µs	210 µs
p99	194 µs	1.8 ms
p99.9	203 µs	12 ms
p99.99	411 µs	89 ms

The p50 was roughly right. The p99.9 was 60x worse than we thought. The “47% improvement” I’d been celebrating was real at median, but the tail behaviour — what traders actually experience during volatility bursts — was invisible to our original benchmark.

The honest p99.9 led to a real fix: the queuing behaviour under sustained load was caused by a lock in the order normalisation path. Finding and removing it dropped the p99.9 to 380µs at the same throughput target.

That fix was only findable because the benchmark was finally honest.

The JVM Measurement Problem (The Easier One)#

Coordinated Omission: The Harder Problem#

The Fix: HdrHistogram + Proper Load Generation#

What Our Numbers Actually Showed#

The JVM Measurement Problem (The Easier One)

Coordinated Omission: The Harder Problem

The Fix: HdrHistogram + Proper Load Generation

What Our Numbers Actually Showed