Three years into building trading systems, the end of 2014 felt like a good moment to stop and audit what we’d built. Not a full rewrite assessment — more a structured reflection on which bets paid off, which didn’t, and what the data was telling us about where the gaps were.

This kind of review is undervalued in fast-moving engineering organisations. You learn a lot from production behaviour over years that you can’t learn from design docs.

What Held Up

The ring-buffer messaging core. The Disruptor-based pipeline between the feed handler, normalisation, risk, and execution stages never required a significant redesign. The throughput headroom we’d built in (10× over initial requirements) absorbed two rounds of business growth without architectural change. The investment in understanding its performance characteristics paid compounding dividends.

Off-heap storage for reference data. Keeping risk parameters in a Chronicle Map off-heap was initially contentious — it was complex, manual memory management was not idiomatic Java. Three years later it had required zero GC tuning for that data layer and had never caused a latency spike attributable to GC pressure. The complexity was one-time; the benefit was ongoing.

Separation of the feed handler from the processing pipeline. Isolating the venue connectivity (feed handler, FIX parser) from the business logic (normalisation, aggregation, routing) made each independently evolvable. We changed the normalisation logic six times in three years; the feed handler changed twice. If they’d been coupled, each business change would have risked the network layer.

What Failed or Was Redesigned

The order routing state machine. The original design was a straightforward state machine: PENDING → SENT → ACKNOWLEDGED → FILLED or REJECTED. By 2013, the real states were PENDING → SENT → (PARTIALLY_FILLED | ACKNOWLEDGED | REJECTED | TIMED_OUT | STALE_REQUOTE) → … with about 40 valid transitions. The state machine worked but had become a maintenance burden nobody fully understood.

The redesign in mid-2014 moved to an event-sourced model: append-only event log, current state reconstructed from replay. This was more code initially but made auditing trivial and eliminated the class of bugs where the state machine got into an impossible state.

Price staleness detection. The original implementation had three staleness rules: time-based expiry (>500ms), spread-width check (>10 pips), and spike detection (>20 pip deviation). Each was added reactively after an incident. By 2014 we had eight rules, implemented as a chain of if/else blocks, with no single engineer understanding all of them.

Replaced with a configurable validator with named rules, per-rule thresholds in config, and an explicit “why was this price rejected” output. Far less code, far more visibility.

Synchronous risk checks on the hot path. Early on, the risk check (does this order breach position limits?) was synchronous: the order couldn’t proceed until risk responded. The risk service was fast (sub-millisecond 99% of the time) but during heavy load it occasionally took 5–10ms, adding directly to order latency.

Redesigned as a pre-computed limit table in shared memory (Chronicle Map), checked with a single off-heap read on the hot path. The risk service updates the table asynchronously. Order latency became independent of risk service latency.

The Numbers That Drove Decisions

Metric                  2012 baseline   2014 current   Target
────────────────────────────────────────────────────────────
Wire-to-consumer p50     800µs           44µs           <100µs
Wire-to-consumer p99    3,400µs          180µs          <500µs
Order rejection rate      8.1%           2.3%           <3%
Failed order rate         0.9%           0.1%           <0.2%
GC pauses >10ms/hour      4.2            0.1            <1

The rejection rate improvement was the most significant business outcome. Fewer rejections meant more trades executed at intended prices, directly improving P&L. The architectural change that drove it: real-time price freshness tracking that prevented orders from being sent on quotes more than 200ms old.

What I’d Do Differently

More explicit data contracts between components earlier. We had implicit contracts (the feed handler produces a certain format, the normaliser expects it) that were fine until they weren’t. The first time two teams touched a shared data structure, we discovered the contract only existed in our heads. Avro schemas or Protocol Buffers from the start would have prevented three integration incidents.

Instrument more aggressively from the start. Every latency investigation began with “can we add metrics to X?” We should have started with comprehensive instrumentation and tuned it down, rather than starting minimal and adding reactively. The overhead of more metrics is small; the cost of not having them during an incident is high.

Less code. Not less functionality — less code. By 2014 the codebase had accumulated a lot of abstractions that seemed useful at creation time but were used exactly once. Each added complexity without adding capability. The simplest version of each component — written with a full understanding of the requirements — would have been smaller and easier to maintain.


This kind of annual review became a team ritual. Not for planning (we had separate roadmap processes) but for calibration — checking whether the picture we had in our heads of the system matched what production data showed. The gap between the two is always instructive.