Bits, Trades & Systems

Scala Akka Actors for Trading Workflows: Promises and Pitfalls

The case for Akka actors in a trading system sounds compelling: isolated mutable state (no shared memory, no locks), message-driven concurrency, built-in supervision hierarchies for fault tolerance, and location transparency for distributed deployments. We used Akka for the order lifecycle workflow layer — the component that orchestrated the state machine from order received to fill confirmed. Here’s what we learned. ...

Risk Aggregation in Real Time: Design Constraints from the Dealing Desk

Risk in a batch system is a solved problem: collect all positions, apply valuation models, sum the results, write a report. The dealing desk doesn’t want a report. They want a number that’s correct right now, updates in under a second when a trade comes in, and doesn’t go stale when a price moves. That’s a different problem. ...

Understanding Safepoints: The JVM Pauses Nobody Talks About

We’d tuned GC to near-perfection. Pause times were sub-millisecond. The p99.9 latency was still spiking to 8ms several times a day, with no GC events anywhere near those spikes. It took three weeks to find the cause: safepoints, specifically a revoke-bias operation triggered by lock patterns in a third-party library. ...

Memory-Mapped Files in Java: Chronicle and the Art of Zero-Copy I/O

Disk I/O and latency-sensitive systems don’t mix well. Or so the conventional wisdom goes. Memory-mapped files challenge that assumption: when the OS maps a file into virtual memory, writes and reads go through the OS page cache. If the working set fits in RAM (which it usually does for recent data), access times are in the hundreds of nanoseconds — comparable to normal memory access, not disk I/O. Chronicle Queue is built entirely on this foundation. Understanding what’s happening underneath explains both why it’s fast and what its failure modes are. ...

Building a Trade Blotter That Doesn't Lie Under Load

The trade blotter is the dashboard every trader stares at during the day: live positions, recent fills, P&L, risk utilisation. It’s the interface between the trading system and the humans who run it. The blotter has an interesting engineering property: it’s not in the critical path (orders execute without waiting for the blotter to update), but it has to be consistently correct and fast-updating, because traders make decisions based on what it shows. A blotter that shows stale positions leads to over-trading; one that misses fills leads to missed hedges. ...

Chronicle Queue vs Kafka: Choosing a Persistent Journal at Nanosecond Scale

By early 2015 we had both Chronicle Queue and Kafka in production — Chronicle for intra-day trade journaling, Kafka for end-of-day data pipelines. The question came up repeatedly: why not use one for both? The answer is that they solve different problems with incompatible design priorities. ...

End-of-Year Architecture Review: What Held, What Failed, What Changed

Three years into building trading systems, the end of 2014 felt like a good moment to stop and audit what we’d built. Not a full rewrite assessment — more a structured reflection on which bets paid off, which didn’t, and what the data was telling us about where the gaps were. This kind of review is undervalued in fast-moving engineering organisations. You learn a lot from production behaviour over years that you can’t learn from design docs. ...

Benchmarking Without Lying: JMH, Coordinated Omission, and Honest Numbers

I spent a morning once very proud of a benchmark showing our new order-matching path had p99 latency of 180µs, down from 340µs. It was a 47% improvement. I presented it in a team meeting. An engineer asked one question: “Is that closed-loop or open-loop?” I didn’t know what that meant. The benchmark was worthless. ...

Aeron: Reliable UDP Multicast for Market Data Distribution

Our market data distribution problem was straightforward to state and hard to solve: deliver price updates to a dozen internal consumers with sub-500µs latency at 400,000 messages/second, with no head-of-line blocking between consumers. TCP broadcasting is serial — slow consumers stall fast ones. ZeroMQ was promising but showed GC pressure from its buffer management. Kafka was built for durability, not microsecond latency. When Martin Thompson and Todd Montgomery open-sourced Aeron in 2014, it solved almost exactly this problem. ...

Choosing a GC Collector for Low-Latency Java: A Practical Comparison

By mid-2014 we had run CMS, G1, and Parallel GC in production on the same workload, and had evaluated Azul Zing. Here’s what we found and the decision framework that came out of it. ...