Three months into the job, I was given my first substantial project: build a component that subscribes to price feeds from five external venues, aggregates them into a single best-bid-offer (BBO) view per currency pair, and distributes that view to internal consumers.
The spec was one page. The first implementation took two weeks. The rewrite after I measured it took another two weeks and was 40× faster.
The Problem in Plain Terms
Each venue sends a stream of price updates:
EBS: EUR/USD 1.28443/1.28453 (bid/ask)
Reuters: EUR/USD 1.28441/1.28451
Currenex: EUR/USD 1.28445/1.28455
LMAX: EUR/USD 1.28442/1.28452
HotSpot: EUR/USD 1.28440/1.28450
The aggregator maintains the current best prices:
- Best bid: 1.28445 (Currenex)
- Best ask: 1.28451 (Reuters)
When a venue updates its price, recalculate the BBO and publish if it changed.
Input rate: ~5,000 updates/second across all pairs and venues. Latency target: update the BBO view within 500µs of receiving an input price.
Version 1: The Obvious Implementation
| |
This works. It’s thread-safe. ConcurrentHashMap handles concurrent venue updates. CopyOnWriteArrayList handles concurrent listener registration. Streams compute the BBO.
And it’s slow.
What Was Wrong With It
Running this under a JMH benchmark at 5,000 messages/second with 10 pairs and 5 venues showed:
p50: 180µs
p99: 2,100µs
p99.9: 15,000µs
500µs target, 2,100µs p99. Not close.
The profiler showed the problems clearly:
1. Object allocation on every price update
new Price(bid, ask) — one allocation per update. new BBO(bestBid, bestAsk) — another allocation. At 5,000/second, that’s 10,000 objects/second going straight to the young generation. GC pressure started showing in the histogram above 1ms.
2. CopyOnWriteArrayList for listeners
CopyOnWriteArrayList copies the underlying array on every write (registration/deregistration). For reads it’s fine — no locking. But I was calling forEach on it for every price update. The read itself was safe and fast, but the class creates a Spliterator for the forEach, adding an allocation.
3. Stream allocations in computeBBO
Each .stream() call creates a Stream object and a Spliterator. .mapToDouble() creates another. At 5,000 calls/second these are tiny allocations but they add up. More importantly, the stream approach iterates the values twice (once for max bid, once for min ask) when a single pass would do.
4. ConcurrentHashMap lock contention
With five venue threads all writing to the same pair’s inner map, there was visible lock contention in the profiler’s synchronisation view.
Version 2: The Performance-Aware Rewrite
| |
The changes:
- Primitive double arrays instead of objects — no allocation per price update, no GC pressure, better cache locality
- Pre-computed index arithmetic —
pairId * NUM_VENUES + venueIdinstead of map lookups for the hot path - Single-pass BBO computation — one loop, not two streams
- Change detection — only notify listeners if BBO actually changed (~30% of updates change the BBO in practice)
- Fixed listener array — no allocations for iteration
The string-to-int mapping (pairIndex, venueIndex) happens at the venue connection boundary, outside the hot path.
Results
Version 1 (stream/map):
p50: 180µs p99: 2,100µs p99.9: 15,000µs
Allocation rate: 480 MB/s
Version 2 (primitive arrays):
p50: 4µs p99: 42µs p99.9: 180µs
Allocation rate: 0 MB/s (zero hot-path allocation)
40× improvement at p50. 50× at p99. The p99.9 improvement is even larger because eliminating GC pressure removed the GC-induced spikes.
What This Taught Me
The version 1 code is the kind of code a capable Java developer writes naturally. It’s readable, idiomatic, and correct. The version 2 code requires knowing specific things:
- That objects in the hot path cause GC pressure at high rates
- That Java streams allocate objects per call
- That
double[]arrays are faster to iterate thanList<Double>because of cache layout - That early exit on “no change” is worth the branch
None of this is obvious from the language. It comes from measuring, understanding the JVM memory model, and knowing how the hardware works.
The broader lesson: performance-sensitive systems require different design thinking from the start. Version 2 isn’t a refactored version 1 — it’s a different design that happens to solve the same problem. The right time to choose the design is before writing version 1.