Performance

Market Connectivity: Building a Low-Latency Feed Handler

The feed handler is where the external world becomes internal data. A FIX or binary protocol stream arrives over the network, gets parsed into typed events, and gets handed to the internal processing pipeline. Nothing downstream can be faster than the feed handler’s latency. This is the design I evolved over three iterations at the trading firm. ...

Comparing ArrayBlockingQueue to the Disruptor: Numbers Don't Lie

After writing about the Disruptor’s design, the obvious question is: how much faster is it, really? “Faster” is not a useful answer. Let’s look at actual numbers under controlled conditions. This is a benchmarking exercise, not a recommendation. The right data structure depends on your use case. The goal here is to understand the performance characteristics of each under different contention patterns. ...

Disruptor Deep Dive: Memory Layout, Cache Lines, and False Sharing

The Disruptor’s performance isn’t magic. It’s the consequence of a set of deliberate memory layout decisions, each targeting a specific cache coherency problem. This post goes through those decisions one by one. ...

The LMAX Disruptor: How a Ring Buffer Changed My Mental Model of Queues

In mid-2013 we replaced our internal LinkedBlockingQueue-based event bus with the LMAX Disruptor. Median latency dropped by 30%. The 99th percentile dropped by more than half. The change touched about 400 lines of code. This post is about the conceptual model you need to understand why the Disruptor is fast — not just “it uses a ring buffer,” but what that actually means for your hardware. ...

Introduction to Lock-Free Programming in Java

Locks work. synchronized in Java is correct, well-understood, and wrong for our use case. A lock that’s contested causes a thread to block — the OS parks it, context-switches to something else, and eventually context-switches back. Each of those transitions costs microseconds. When your SLA is sub-millisecond and your hot path is called 200,000 times per second, locks are not an option. Lock-free programming replaces locks with atomic CPU instructions. The CPU handles the synchronisation at the hardware level, without OS involvement. ...

Mechanical Sympathy: Writing Java That Respects the Hardware

Martin Thompson coined the term “mechanical sympathy” — the idea that to write fast software you need to understand the machine it runs on. Not at the assembly level necessarily, but well enough to reason about what the CPU, memory hierarchy, and OS are actually doing with your code. This post is what that looks like in practice, writing Java for a system where microseconds matter. ...

Stop-the-World GC Pauses Killed Our SLA — And What We Did About It

The incident happened at 08:31 on a Tuesday — Frankfurt open, high volatility session. Our tick-to-quote latency spiked to 340ms for about 2 seconds. The SLA was 1ms at p99. Trading desk noticed before our monitoring did. The culprit: a full GC triggered by a promotion failure. We had 12GB heap, CMS collector, and no one had looked at GC logs since the initial deployment. ...

Latency vs Throughput: The False Dichotomy I Learned the Hard Way

In my first performance review at the trading firm, I described a component I’d optimised as “high throughput.” My manager asked what the p99 latency was. I didn’t know. He asked what happened to latency during peak throughput. I didn’t know that either. The conversation went downhill from there. That exchange forced me to be precise about what I was actually optimising for — and why throughput and latency, while related, are fundamentally different properties. ...

Why Your Java App Is Slow Before It Even Starts: Classloading Deep Dive

The service started in 3 seconds in development, but the first live trade after deployment took 800ms instead of the expected sub-10ms. The second trade was fine. We couldn’t reproduce it in load tests. The culprit was classloading. The trade execution path touched 47 classes that had never been loaded before. Loading them, verifying the bytecode, and running static initialisers took 800ms — once, at first use. Understanding exactly how that happens is worth the time. ...

Building a Price Feed Aggregator in Java: First Attempt

Three months into the job, I was given my first substantial project: build a component that subscribes to price feeds from five external venues, aggregates them into a single best-bid-offer (BBO) view per currency pair, and distributes that view to internal consumers. The spec was one page. The first implementation took two weeks. The rewrite after I measured it took another two weeks and was 40× faster. ...