Latency

ZGC and Shenandoah: What Low-Pause GC Means for Trading Systems

In 2015, the state of GC for latency-sensitive Java was: use G1GC, tune it carefully, accept occasional 50–200ms pauses on large heaps, and work around them with off-heap storage and careful allocation management. The conventional wisdom was that sub-10ms GC pauses required small heaps (< 4GB) or near-zero allocation on the hot path. For trading systems with large position caches, this meant either expensive off-heap engineering or living with GC latency spikes. Then the early previews of ZGC (Oracle/Sun) and Shenandoah (Red Hat) started circulating. Both claimed sub-millisecond pause times regardless of heap size. The mechanisms were different, the implications were significant. ...

Understanding Safepoints: The JVM Pauses Nobody Talks About

We’d tuned GC to near-perfection. Pause times were sub-millisecond. The p99.9 latency was still spiking to 8ms several times a day, with no GC events anywhere near those spikes. It took three weeks to find the cause: safepoints, specifically a revoke-bias operation triggered by lock patterns in a third-party library. ...

Benchmarking Without Lying: JMH, Coordinated Omission, and Honest Numbers

I spent a morning once very proud of a benchmark showing our new order-matching path had p99 latency of 180µs, down from 340µs. It was a 47% improvement. I presented it in a team meeting. An engineer asked one question: “Is that closed-loop or open-loop?” I didn’t know what that meant. The benchmark was worthless. ...

Choosing a GC Collector for Low-Latency Java: A Practical Comparison

By mid-2014 we had run CMS, G1, and Parallel GC in production on the same workload, and had evaluated Azul Zing. Here’s what we found and the decision framework that came out of it. ...

Busy Spinning vs Blocking: Thread Strategies for Ultra-Low Latency

When a thread is waiting for work — a new event, a lock to release, a signal — it has two options. It can block (tell the OS “wake me up when there’s work”) or busy-spin (loop checking a condition, never yielding the CPU). Both are correct. They have very different performance profiles. ...

FX Pricing Engine Architecture: From Feeds to Executable Quotes

By 2014 we had rebuilt the pricing engine twice. The first rebuild moved from a blocking queue architecture to the Disruptor. The second addressed the aggregation logic and quote distribution. This post covers the architecture that emerged — not as a blueprint, but as an account of the decisions and why we made them. ...

Why Average Latency Is a Lie: HdrHistogram and Measuring What Matters

If someone tells you their system has 2ms average latency, they’ve told you almost nothing useful. A system that delivers 1ms 99% of the time and 100ms 1% of the time has 2ms average latency. So does a system that delivers 2ms every single time. These behave completely differently in production. The problem isn’t measurement frequency — it’s that averages destroy the distribution. ...

Stop-the-World GC Pauses Killed Our SLA — And What We Did About It

The incident happened at 08:31 on a Tuesday — Frankfurt open, high volatility session. Our tick-to-quote latency spiked to 340ms for about 2 seconds. The SLA was 1ms at p99. Trading desk noticed before our monitoring did. The culprit: a full GC triggered by a promotion failure. We had 12GB heap, CMS collector, and no one had looked at GC logs since the initial deployment. ...

Latency vs Throughput: The False Dichotomy I Learned the Hard Way

In my first performance review at the trading firm, I described a component I’d optimised as “high throughput.” My manager asked what the p99 latency was. I didn’t know. He asked what happened to latency during peak throughput. I didn’t know that either. The conversation went downhill from there. That exchange forced me to be precise about what I was actually optimising for — and why throughput and latency, while related, are fundamentally different properties. ...

Building a Price Feed Aggregator in Java: First Attempt

Three months into the job, I was given my first substantial project: build a component that subscribes to price feeds from five external venues, aggregates them into a single best-bid-offer (BBO) view per currency pair, and distributes that view to internal consumers. The spec was one page. The first implementation took two weeks. The rewrite after I measured it took another two weeks and was 40× faster. ...