Three months into the job, I was given my first substantial project: build a component that subscribes to price feeds from five external venues, aggregates them into a single best-bid-offer (BBO) view per currency pair, and distributes that view to internal consumers.

The spec was one page. The first implementation took two weeks. The rewrite after I measured it took another two weeks and was 40× faster.

The Problem in Plain Terms

Each venue sends a stream of price updates:

EBS:       EUR/USD  1.28443/1.28453  (bid/ask)
Reuters:   EUR/USD  1.28441/1.28451
Currenex:  EUR/USD  1.28445/1.28455
LMAX:      EUR/USD  1.28442/1.28452
HotSpot:   EUR/USD  1.28440/1.28450

The aggregator maintains the current best prices:

  • Best bid: 1.28445 (Currenex)
  • Best ask: 1.28451 (Reuters)

When a venue updates its price, recalculate the BBO and publish if it changed.

Input rate: ~5,000 updates/second across all pairs and venues. Latency target: update the BBO view within 500µs of receiving an input price.

Version 1: The Obvious Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public class PriceAggregator {
    // pair → (venue → latest price)
    private final Map<String, Map<String, Price>> prices =
        new ConcurrentHashMap<>();

    private final List<PriceListener> listeners = new CopyOnWriteArrayList<>();

    public void onPrice(String venue, String pair, double bid, double ask) {
        prices.computeIfAbsent(pair, k -> new ConcurrentHashMap<>())
              .put(venue, new Price(bid, ask));

        BBO bbo = computeBBO(pair);
        listeners.forEach(l -> l.onBBO(pair, bbo));
    }

    private BBO computeBBO(String pair) {
        Map<String, Price> byVenue = prices.get(pair);
        double bestBid = byVenue.values().stream()
            .mapToDouble(p -> p.bid)
            .max().orElse(0);
        double bestAsk = byVenue.values().stream()
            .mapToDouble(p -> p.ask)
            .min().orElse(Double.MAX_VALUE);
        return new BBO(bestBid, bestAsk);
    }
}

This works. It’s thread-safe. ConcurrentHashMap handles concurrent venue updates. CopyOnWriteArrayList handles concurrent listener registration. Streams compute the BBO.

And it’s slow.

What Was Wrong With It

Running this under a JMH benchmark at 5,000 messages/second with 10 pairs and 5 venues showed:

p50:    180µs
p99:    2,100µs
p99.9: 15,000µs

500µs target, 2,100µs p99. Not close.

The profiler showed the problems clearly:

1. Object allocation on every price update

new Price(bid, ask) — one allocation per update. new BBO(bestBid, bestAsk) — another allocation. At 5,000/second, that’s 10,000 objects/second going straight to the young generation. GC pressure started showing in the histogram above 1ms.

2. CopyOnWriteArrayList for listeners

CopyOnWriteArrayList copies the underlying array on every write (registration/deregistration). For reads it’s fine — no locking. But I was calling forEach on it for every price update. The read itself was safe and fast, but the class creates a Spliterator for the forEach, adding an allocation.

3. Stream allocations in computeBBO

Each .stream() call creates a Stream object and a Spliterator. .mapToDouble() creates another. At 5,000 calls/second these are tiny allocations but they add up. More importantly, the stream approach iterates the values twice (once for max bid, once for min ask) when a single pass would do.

4. ConcurrentHashMap lock contention

With five venue threads all writing to the same pair’s inner map, there was visible lock contention in the profiler’s synchronisation view.

Version 2: The Performance-Aware Rewrite

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
public class PriceAggregator {
    private static final int NUM_PAIRS = 64;
    private static final int NUM_VENUES = 8;

    // Price data stored as parallel primitive arrays — no boxing, no objects
    // Index: pairId * NUM_VENUES + venueId
    private final double[] bids = new double[NUM_PAIRS * NUM_VENUES];
    private final double[] asks = new double[NUM_PAIRS * NUM_VENUES];

    // Pre-allocated BBO per pair — mutated in place, not reconstructed
    private final double[] bboBid = new double[NUM_PAIRS];
    private final double[] bboAsk = new double[NUM_PAIRS];

    // Mapping tables (populated at startup)
    private final Map<String, Integer> pairIndex = new HashMap<>();
    private final Map<String, Integer> venueIndex = new HashMap<>();

    // Listener array — fixed size, no dynamic list
    private final PriceListener[] listeners;
    private final int listenerCount;

    public void onPrice(int venueId, int pairId, double bid, double ask) {
        int idx = pairId * NUM_VENUES + venueId;
        bids[idx] = bid;
        asks[idx] = ask;

        // Recompute BBO for this pair — single pass, no allocation
        double bestBid = 0;
        double bestAsk = Double.MAX_VALUE;
        int base = pairId * NUM_VENUES;
        for (int i = 0; i < NUM_VENUES; i++) {
            double b = bids[base + i];
            double a = asks[base + i];
            if (b > bestBid) bestBid = b;
            if (a > 0 && a < bestAsk) bestAsk = a;
        }

        // Only publish if BBO changed
        if (bestBid != bboBid[pairId] || bestAsk != bboAsk[pairId]) {
            bboBid[pairId] = bestBid;
            bboAsk[pairId] = bestAsk;
            for (int i = 0; i < listenerCount; i++) {
                listeners[i].onBBO(pairId, bestBid, bestAsk);
            }
        }
    }
}

The changes:

  • Primitive double arrays instead of objects — no allocation per price update, no GC pressure, better cache locality
  • Pre-computed index arithmeticpairId * NUM_VENUES + venueId instead of map lookups for the hot path
  • Single-pass BBO computation — one loop, not two streams
  • Change detection — only notify listeners if BBO actually changed (~30% of updates change the BBO in practice)
  • Fixed listener array — no allocations for iteration

The string-to-int mapping (pairIndex, venueIndex) happens at the venue connection boundary, outside the hot path.

Results

Version 1 (stream/map):
  p50:    180µs  p99:  2,100µs  p99.9: 15,000µs
  Allocation rate: 480 MB/s

Version 2 (primitive arrays):
  p50:      4µs  p99:     42µs  p99.9:    180µs
  Allocation rate: 0 MB/s (zero hot-path allocation)

40× improvement at p50. 50× at p99. The p99.9 improvement is even larger because eliminating GC pressure removed the GC-induced spikes.

What This Taught Me

The version 1 code is the kind of code a capable Java developer writes naturally. It’s readable, idiomatic, and correct. The version 2 code requires knowing specific things:

  • That objects in the hot path cause GC pressure at high rates
  • That Java streams allocate objects per call
  • That double[] arrays are faster to iterate than List<Double> because of cache layout
  • That early exit on “no change” is worth the branch

None of this is obvious from the language. It comes from measuring, understanding the JVM memory model, and knowing how the hardware works.

The broader lesson: performance-sensitive systems require different design thinking from the start. Version 2 isn’t a refactored version 1 — it’s a different design that happens to solve the same problem. The right time to choose the design is before writing version 1.