If someone tells you their system has 2ms average latency, they’ve told you almost nothing useful. A system that delivers 1ms 99% of the time and 100ms 1% of the time has 2ms average latency. So does a system that delivers 2ms every single time. These behave completely differently in production.
The problem isn’t measurement frequency — it’s that averages destroy the distribution.
The Coordinated Omission Problem
Before HdrHistogram, the standard approach to latency measurement was: record start time, record end time, subtract, add to running average and maybe a few fixed percentile buckets.
This is wrong in a subtle way that Gil Tene calls “coordinated omission.”
Imagine you’re measuring service latency by issuing one request per millisecond:
t=0ms: request → 1ms response ✓
t=1ms: request → 1ms response ✓
t=2ms: request → 1ms response ✓
t=3ms: request → 1ms response ✓
...
t=999ms: request → 1ms response ✓
t=1000ms: request → 1000ms response (system stalled)
t=2000ms: (next request issued after stall resolves)
Naïve measurement records 1001 samples: 1000 × 1ms and 1 × 1000ms. Reported p99 = 1ms. But the stall at t=1000ms didn’t just affect one request — it affected every request that would have been issued during that second. 999 requests were effectively delayed by up to 1000ms, but they were never issued and never recorded.
HdrHistogram’s solution: record not just the measured latency, but the expected inter-arrival time. If you were issuing requests at 1kHz and a response takes 1000ms, you record 1000 samples of latency ranging from 1ms to 1000ms — reflecting the actual experienced latency of every request that was waiting.
The HDR in HdrHistogram
HDR stands for High Dynamic Range. The data structure is a fixed-size array of buckets with two-level addressing: a power-of-two bucket index (coarse) and a linear sub-bucket index (fine). This gives you:
Latency range covered: 1 nanosecond to several hours
Memory footprint: ~1.8MB for 1ns resolution up to 1 hour
Precision: configurable — typically 3 significant digits
Add operation: O(1), no lock needed (with atomic variant)
The trick: latency naturally spans multiple orders of magnitude, and you need proportional precision at each scale. 10µs ± 100ns is meaningful. 10ms ± 100ns is noise — ± 10µs is what you care about. HDR gives you relative precision rather than absolute precision.
Compare this to a naïve approach of evenly-spaced buckets:
Naïve approach (1000 buckets, 0–10ms range):
✓ 1µs resolution
✗ Cannot record anything > 10ms
✗ 1000 buckets × 8 bytes = 8KB wasted on sub-microsecond range
HDR approach (same memory):
✓ Covers 1ns to 3600s
✓ ~0.1% relative precision at every scale
✓ p99.99 is meaningful, not just noise
Dropping It Into a Java Benchmark
| |
The output that changes your understanding of a system:
p50: 47 µs
p90: 89 µs
p99: 312 µs
p99.9: 4,891 µs
p99.99: 23,440 µs
max: 67,108 µs
The p50 looks fine. The p99 is borderline. The p99.99 is 500× the median. If one in ten thousand requests hits 23ms and your trading system issues 100,000 messages/second, you’re hitting that tail 10 times per second. That’s not a rare event — it’s a continuous background noise of latency spikes.
What’s Causing the Tail
In JVM systems, the usual suspects:
Latency spike source Typical magnitude Detection
─────────────────────────────────────────────────────────────
GC pause (young gen) 1–20ms -XX:+PrintGCDetails
GC pause (full/old) 50ms–seconds -XX:+PrintGCDetails
Safepoint operation 0.1–10ms -XX:+PrintSafepointStatistics
JIT deoptimisation 0.1–5ms -XX:+TraceDeoptimization
OS scheduling jitter 0.05–2ms hard to isolate
NUMA memory access 10–100µs perf stat
CPU frequency scaling variable /proc/cpuinfo, cpupower
Network stack 10–100µs tcpdump timestamps
The process for using HdrHistogram to diagnose tail latency:
- Run the workload with HdrHistogram recording
- Identify the magnitude of the p99.9/p99.99 spikes
- Match the magnitude to likely sources from the table above
- Enable the relevant JVM flags and correlate timestamps
If p99.9 spikes correlate with GC log entries, that’s your answer. If they don’t, look at safepoints. If safepoints are clean, look at OS-level events.
Continuous Histogram Reporting in Production
HdrHistogram’s Recorder class supports thread-safe concurrent recording with snapshotting:
| |
The snapshotting is designed for this pattern: one or more threads record values concurrently; a separate thread periodically takes a snapshot and resets the interval. The snapshot is self-consistent — no values span interval boundaries.
In production, I emit the full HDR histogram in compressed base64 encoding to our time-series store, which lets us reconstruct the distribution later rather than just storing the percentile values we thought to query at the time. A common regret in latency analysis: “I was storing p99 but I need to know if p99.9 was above 5ms at 14:32 last Tuesday.” With compressed histograms, you can answer that question retrospectively.
The Benchmark Cycle
The workflow that’s become standard:
1. Implement change
2. Run benchmark with HdrHistogram, record p50/p99/p99.9/p99.99/max
3. Compare distributions, not just p50
4. If tail improved but median regressed, understand why
5. Check for coordinated omission — were you issuing at fixed rate?
6. Check for JIT warmup — are first N samples polluting the histogram?
(Use histogram.reset() after warmup period)
The last point is important: JVM benchmarks that don’t separate warmup from measurement will have the interpreter’s latency (1–50× higher than compiled) mixed into the recorded histogram. Always measure after steady-state compilation.
Measuring averages is like averaging a photo. You get a grey blur that looks like nothing in the original. The distribution is the signal — everything else is noise.