Why Average Latency Is a Lie: HdrHistogram and Measuring What Matters

If someone tells you their system has 2ms average latency, they’ve told you almost nothing useful. A system that delivers 1ms 99% of the time and 100ms 1% of the time has 2ms average latency. So does a system that delivers 2ms every single time. These behave completely differently in production.

The problem isn’t measurement frequency — it’s that averages destroy the distribution.

The Coordinated Omission Problem

Before HdrHistogram, the standard approach to latency measurement was: record start time, record end time, subtract, add to running average and maybe a few fixed percentile buckets.

This is wrong in a subtle way that Gil Tene calls “coordinated omission.”

Imagine you’re measuring service latency by issuing one request per millisecond:

t=0ms:   request → 1ms response  ✓
t=1ms:   request → 1ms response  ✓
t=2ms:   request → 1ms response  ✓
t=3ms:   request → 1ms response  ✓
...
t=999ms: request → 1ms response  ✓
t=1000ms: request → 1000ms response  (system stalled)
t=2000ms: (next request issued after stall resolves)

Naïve measurement records 1001 samples: 1000 × 1ms and 1 × 1000ms. Reported p99 = 1ms. But the stall at t=1000ms didn’t just affect one request — it affected every request that would have been issued during that second. 999 requests were effectively delayed by up to 1000ms, but they were never issued and never recorded.

HdrHistogram’s solution: record not just the measured latency, but the expected inter-arrival time. If you were issuing requests at 1kHz and a response takes 1000ms, you record 1000 samples of latency ranging from 1ms to 1000ms — reflecting the actual experienced latency of every request that was waiting.

The HDR in HdrHistogram

HDR stands for High Dynamic Range. The data structure is a fixed-size array of buckets with two-level addressing: a power-of-two bucket index (coarse) and a linear sub-bucket index (fine). This gives you:

Latency range covered: 1 nanosecond to several hours
Memory footprint: ~1.8MB for 1ns resolution up to 1 hour
Precision: configurable — typically 3 significant digits
Add operation: O(1), no lock needed (with atomic variant)

The trick: latency naturally spans multiple orders of magnitude, and you need proportional precision at each scale. 10µs ± 100ns is meaningful. 10ms ± 100ns is noise — ± 10µs is what you care about. HDR gives you relative precision rather than absolute precision.

Compare this to a naïve approach of evenly-spaced buckets:

Naïve approach (1000 buckets, 0–10ms range):
  ✓ 1µs resolution
  ✗ Cannot record anything > 10ms
  ✗ 1000 buckets × 8 bytes = 8KB wasted on sub-microsecond range

HDR approach (same memory):
  ✓ Covers 1ns to 3600s
  ✓ ~0.1% relative precision at every scale
  ✓ p99.99 is meaningful, not just noise

Dropping It Into a Java Benchmark

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import org.HdrHistogram.Histogram;

Histogram histogram = new Histogram(
    TimeUnit.SECONDS.toNanos(10),  // max value: 10 seconds
    3                               // significant value digits
);

// In your benchmark loop:
long startNs = System.nanoTime();
doWork();
long endNs = System.nanoTime();
histogram.recordValue(endNs - startNs);

// After the run:
System.out.printf("p50:    %,d µs%n", histogram.getValueAtPercentile(50) / 1000);
System.out.printf("p90:    %,d µs%n", histogram.getValueAtPercentile(90) / 1000);
System.out.printf("p99:    %,d µs%n", histogram.getValueAtPercentile(99) / 1000);
System.out.printf("p99.9:  %,d µs%n", histogram.getValueAtPercentile(99.9) / 1000);
System.out.printf("p99.99: %,d µs%n", histogram.getValueAtPercentile(99.99) / 1000);
System.out.printf("max:    %,d µs%n", histogram.getMaxValue() / 1000);

The output that changes your understanding of a system:

p50:       47 µs
p90:       89 µs
p99:      312 µs
p99.9:  4,891 µs
p99.99: 23,440 µs
max:    67,108 µs

The p50 looks fine. The p99 is borderline. The p99.99 is 500× the median. If one in ten thousand requests hits 23ms and your trading system issues 100,000 messages/second, you’re hitting that tail 10 times per second. That’s not a rare event — it’s a continuous background noise of latency spikes.

What’s Causing the Tail

In JVM systems, the usual suspects:

Latency spike source          Typical magnitude   Detection
─────────────────────────────────────────────────────────────
GC pause (young gen)          1–20ms             -XX:+PrintGCDetails
GC pause (full/old)           50ms–seconds       -XX:+PrintGCDetails
Safepoint operation           0.1–10ms           -XX:+PrintSafepointStatistics
JIT deoptimisation            0.1–5ms            -XX:+TraceDeoptimization
OS scheduling jitter          0.05–2ms           hard to isolate
NUMA memory access            10–100µs           perf stat
CPU frequency scaling         variable           /proc/cpuinfo, cpupower
Network stack                 10–100µs           tcpdump timestamps

The process for using HdrHistogram to diagnose tail latency:

Run the workload with HdrHistogram recording
Identify the magnitude of the p99.9/p99.99 spikes
Match the magnitude to likely sources from the table above
Enable the relevant JVM flags and correlate timestamps

If p99.9 spikes correlate with GC log entries, that’s your answer. If they don’t, look at safepoints. If safepoints are clean, look at OS-level events.

Continuous Histogram Reporting in Production

HdrHistogram’s Recorder class supports thread-safe concurrent recording with snapshotting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Recorder recorder = new Recorder(
    TimeUnit.SECONDS.toNanos(10), 3
);

// Recording thread (hot path — lock-free):
recorder.recordValue(latencyNs);

// Reporting thread (every 10 seconds):
Histogram snapshot = recorder.getIntervalHistogram();
reportToMetrics(snapshot);
// Histogram automatically resets for next interval

The snapshotting is designed for this pattern: one or more threads record values concurrently; a separate thread periodically takes a snapshot and resets the interval. The snapshot is self-consistent — no values span interval boundaries.

In production, I emit the full HDR histogram in compressed base64 encoding to our time-series store, which lets us reconstruct the distribution later rather than just storing the percentile values we thought to query at the time. A common regret in latency analysis: “I was storing p99 but I need to know if p99.9 was above 5ms at 14:32 last Tuesday.” With compressed histograms, you can answer that question retrospectively.

The Benchmark Cycle

The workflow that’s become standard:

1. Implement change
2. Run benchmark with HdrHistogram, record p50/p99/p99.9/p99.99/max
3. Compare distributions, not just p50
4. If tail improved but median regressed, understand why
5. Check for coordinated omission — were you issuing at fixed rate?
6. Check for JIT warmup — are first N samples polluting the histogram?
   (Use histogram.reset() after warmup period)

The last point is important: JVM benchmarks that don’t separate warmup from measurement will have the interpreter’s latency (1–50× higher than compiled) mixed into the recorded histogram. Always measure after steady-state compilation.

Measuring averages is like averaging a photo. You get a grey blur that looks like nothing in the original. The distribution is the signal — everything else is noise.

The Coordinated Omission Problem#

The HDR in HdrHistogram#

Dropping It Into a Java Benchmark#

What’s Causing the Tail#

Continuous Histogram Reporting in Production#

The Benchmark Cycle#