Performance

Aeron: Reliable UDP Multicast for Market Data Distribution

Our market data distribution problem was straightforward to state and hard to solve: deliver price updates to a dozen internal consumers with sub-500µs latency at 400,000 messages/second, with no head-of-line blocking between consumers. TCP broadcasting is serial — slow consumers stall fast ones. ZeroMQ was promising but showed GC pressure from its buffer management. Kafka was built for durability, not microsecond latency. When Martin Thompson and Todd Montgomery open-sourced Aeron in 2014, it solved almost exactly this problem. ...

Choosing a GC Collector for Low-Latency Java: A Practical Comparison

By mid-2014 we had run CMS, G1, and Parallel GC in production on the same workload, and had evaluated Azul Zing. Here’s what we found and the decision framework that came out of it. ...

Slippage, Spread, and Rejection: Engineering Around Market Microstructure

A trading system that quotes a price is not the same as a trading system that executes at that price. Between the quote and the execution lies a gap — measured in time, measured in pips, measured in rejected orders — that engineering choices directly affect. Understanding slippage and rejection from the systems perspective is not just finance knowledge. It shapes how you design price feeds, order routing, and latency measurement. ...

Busy Spinning vs Blocking: Thread Strategies for Ultra-Low Latency

When a thread is waiting for work — a new event, a lock to release, a signal — it has two options. It can block (tell the OS “wake me up when there’s work”) or busy-spin (loop checking a condition, never yielding the CPU). Both are correct. They have very different performance profiles. ...

Off-Heap Memory in Java: sun.misc.Unsafe and Chronicle Map

One of the FX trading desks kept a reference data structure in memory: all live position limits and risk parameters for every currency pair, updated in real time from a risk management system. The structure held about 40GB of data and was read by the trading engine on every price update. Putting 40GB on the Java heap was not an option. GC pauses on a 40GB heap are seconds, not milliseconds. The solution was off-heap allocation — memory that exists outside the GC’s visibility, managed explicitly by the application. ...

Scala on the Hot Path: Where the Abstraction Cost Goes

We were using Scala in a few non-critical components at the trading firm — utility code, configuration, some tooling. Then someone proposed moving a market data normalisation component to Scala. The component processed 800,000 messages/second and had a 500µs latency budget per message at p99. The discussion that followed taught me more about the JVM than a year of reading. ...

Why Average Latency Is a Lie: HdrHistogram and Measuring What Matters

If someone tells you their system has 2ms average latency, they’ve told you almost nothing useful. A system that delivers 1ms 99% of the time and 100ms 1% of the time has 2ms average latency. So does a system that delivers 2ms every single time. These behave completely differently in production. The problem isn’t measurement frequency — it’s that averages destroy the distribution. ...

Order Book Implementation: Data Structures for Price-Level Aggregation

An order book is a data structure that maintains the set of outstanding buy and sell orders at each price level. For an FX market maker, maintaining a locally consistent view of the book at each LP (liquidity provider) is the foundation everything else sits on. At 100,000 book updates per second across 50 currency pairs, the implementation choices matter. ...

Java Chronicle: Off-Heap Persistence Without Serialisation Overhead

Every trade needs to be journaled. You need a durable, ordered record of every order, fill, and state change — for risk, reconciliation, and regulatory purposes. The naive solution is a database write on the hot path. That’s a roundtrip to an external process, a network call, and often a disk fsync. It’s also hundreds of milliseconds of latency per event. Chronicle Queue gave us persistent journaling at sub-microsecond overhead. Here’s how. ...

JVM JIT Compilation: What the C2 Compiler Does to Your Loops

Java’s “write once, run anywhere” promise is kept by the JVM. Its performance is kept by the JIT compiler. The gap between “Java is slow” (the 1998 opinion) and “Java is competitive with C++ for many workloads” (the 2013 reality, and more so now) is almost entirely the C2 compiler. Understanding what C2 does — and when it stops doing it — matters if you’re writing performance-sensitive Java. ...