Off-Heap Memory in Java: sun.misc.Unsafe and Chronicle Map

One of the FX trading desks kept a reference data structure in memory: all live position limits and risk parameters for every currency pair, updated in real time from a risk management system. The structure held about 40GB of data and was read by the trading engine on every price update. Putting 40GB on the Java heap was not an option. GC pauses on a 40GB heap are seconds, not milliseconds. The solution was off-heap allocation — memory that exists outside the GC’s visibility, managed explicitly by the application. ...

April 2, 2014 · 6 min · MW

FX Pricing Engine Architecture: From Feeds to Executable Quotes

By 2014 we had rebuilt the pricing engine twice. The first rebuild moved from a blocking queue architecture to the Disruptor. The second addressed the aggregation logic and quote distribution. This post covers the architecture that emerged — not as a blueprint, but as an account of the decisions and why we made them. ...

February 19, 2014 · 5 min · MW

Scala on the Hot Path: Where the Abstraction Cost Goes

We were using Scala in a few non-critical components at the trading firm — utility code, configuration, some tooling. Then someone proposed moving a market data normalisation component to Scala. The component processed 800,000 messages/second and had a 500µs latency budget per message at p99. The discussion that followed taught me more about the JVM than a year of reading. ...

January 8, 2014 · 6 min · MW

Why Average Latency Is a Lie: HdrHistogram and Measuring What Matters

If someone tells you their system has 2ms average latency, they’ve told you almost nothing useful. A system that delivers 1ms 99% of the time and 100ms 1% of the time has 2ms average latency. So does a system that delivers 2ms every single time. These behave completely differently in production. The problem isn’t measurement frequency — it’s that averages destroy the distribution. ...

November 27, 2013 · 5 min · MW

Order Book Implementation: Data Structures for Price-Level Aggregation

An order book is a data structure that maintains the set of outstanding buy and sell orders at each price level. For an FX market maker, maintaining a locally consistent view of the book at each LP (liquidity provider) is the foundation everything else sits on. At 100,000 book updates per second across 50 currency pairs, the implementation choices matter. ...

October 16, 2013 · 5 min · MW

Java Chronicle: Off-Heap Persistence Without Serialisation Overhead

Every trade needs to be journaled. You need a durable, ordered record of every order, fill, and state change — for risk, reconciliation, and regulatory purposes. The naive solution is a database write on the hot path. That’s a roundtrip to an external process, a network call, and often a disk fsync. It’s also hundreds of milliseconds of latency per event. Chronicle Queue gave us persistent journaling at sub-microsecond overhead. Here’s how. ...

September 5, 2013 · 4 min · MW

JVM JIT Compilation: What the C2 Compiler Does to Your Loops

Java’s “write once, run anywhere” promise is kept by the JVM. Its performance is kept by the JIT compiler. The gap between “Java is slow” (the 1998 opinion) and “Java is competitive with C++ for many workloads” (the 2013 reality, and more so now) is almost entirely the C2 compiler. Understanding what C2 does — and when it stops doing it — matters if you’re writing performance-sensitive Java. ...

July 30, 2013 · 6 min · MW

Market Connectivity: Building a Low-Latency Feed Handler

The feed handler is where the external world becomes internal data. A FIX or binary protocol stream arrives over the network, gets parsed into typed events, and gets handed to the internal processing pipeline. Nothing downstream can be faster than the feed handler’s latency. This is the design I evolved over three iterations at the trading firm. ...

June 11, 2013 · 6 min · MW

Comparing ArrayBlockingQueue to the Disruptor: Numbers Don't Lie

After writing about the Disruptor’s design, the obvious question is: how much faster is it, really? “Faster” is not a useful answer. Let’s look at actual numbers under controlled conditions. This is a benchmarking exercise, not a recommendation. The right data structure depends on your use case. The goal here is to understand the performance characteristics of each under different contention patterns. ...

May 22, 2013 · 4 min · MW

Disruptor Deep Dive: Memory Layout, Cache Lines, and False Sharing

The Disruptor’s performance isn’t magic. It’s the consequence of a set of deliberate memory layout decisions, each targeting a specific cache coherency problem. This post goes through those decisions one by one. ...

April 9, 2013 · 5 min · MW
Available for consulting Distributed systems · Low-latency architecture · Go · LLM integration & RAG · Technical leadership
hello@turboawesome.win