Risk in a batch system is a solved problem: collect all positions, apply valuation models, sum the results, write a report. The dealing desk doesn’t want a report. They want a number that’s correct right now, updates in under a second when a trade comes in, and doesn’t go stale when a price moves.

That’s a different problem.

The Requirement Statement

The electronic trading firm ran a dealing desk alongside the algorithmic flow. Dealers needed to see their live risk exposure: net position per currency pair, Greeks for options books, notional P&L per desk and per dealer. The requirement from the desk was precise:

  • Update within 500ms of any trade or price event
  • Available continuously during trading hours (no scheduled recalculation)
  • Correct: no double-counting, no missed trades, no stale Greeks

The 500ms number came from a dealer who had watched the existing end-of-day batch system and timed how often his mental model of his position diverged from the number on screen. “It’s always wrong” was the complaint. He’d accept being 500ms behind reality. He wouldn’t accept being 10 minutes behind.

Why Batch Architectures Fail Here

The natural approach — recalculate all positions on every trade — fails at scale. At peak, the trading system processed several thousand trades per minute. Recalculating full P&L attribution across all positions on every trade would have required either:

  1. A very fast recalculation (achievable but with significant engineering investment)
  2. Rate-limiting updates (defeats the 500ms requirement)
  3. A different architectural model

The model we chose: incremental aggregation.

Incremental Aggregation

Instead of recalculating from scratch on each event, maintain the aggregate state and apply deltas.

Position state:
  EUR/USD:
    net_quantity: -12,500,000  (short 12.5M)
    avg_rate: 1.08432
    unrealised_pnl: +48,250 USD

When a new trade arrives (buy 2M EUR/USD at 1.08420):

Delta:
  quantity: +2,000,000
  cost: 2,000,000 × 1.08420 = 2,168,400

New state:
  net_quantity: -10,500,000
  avg_rate: recalculated from total cost basis
  unrealised_pnl: recalculated vs current mid

This reduces the per-trade computation from O(positions) to O(1) — update the affected position, recalculate aggregates up the hierarchy.

The hierarchy was: position → desk → book → total. A trade in EUR/USD on desk FX1 updates:

  1. The EUR/USD position record for desk FX1
  2. The FX1 desk aggregate
  3. The total aggregate

Three updates, regardless of how many other positions exist.

The Price Feed Problem

Position P&L depends on current market price, not just trade cost. A position bought at 1.0840 with current mid at 1.0860 has unrealised P&L of +0.0020 per unit. When the price moves, every position in that instrument needs its P&L recalculated.

At peak, EUR/USD was receiving several hundred price updates per second. Recalculating all EUR/USD positions on every tick would have created a CPU-bound loop.

The solution: coalesce price updates. Don’t process every tick. Process the latest price at the end of each 100ms window.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Price coalescer — one thread per instrument
class PriceCoalescer implements Runnable {
    private final AtomicReference<Double> latestMid = new AtomicReference<>();
    private final String instrument;
    private final RiskEngine engine;

    public void onTick(double mid) {
        latestMid.set(mid);  // overwrite — we only care about latest
    }

    @Override
    public void run() {
        while (!Thread.interrupted()) {
            Double mid = latestMid.getAndSet(null);
            if (mid != null) {
                engine.revaluePositions(instrument, mid);
            }
            LockSupport.parkNanos(100_000_000L);  // 100ms
        }
    }
}

This gave us: process every trade immediately (low latency for position changes), but batch price-driven revaluations to 10Hz (manageable CPU load).

The dealers’ screen refreshed at 2Hz. 10Hz revaluation meant the displayed number was at most 150ms stale relative to the market. Within tolerance.

Consistency: The Hard Part

The easy part is updating a number. The hard part is updating the right number correctly, exactly once, even when things fail.

Double-counting: If a trade confirmation is processed twice (network retransmission, retry on failure), the position must not be incremented twice. Each trade had a unique ID; the risk engine maintained a processed-set of trade IDs and rejected duplicates.

Lost trades: The position must be consistent with the trade store. We ran a periodic reconciliation (every 15 minutes during trading hours) that compared the risk engine’s in-memory position state against the trade database. Discrepancies triggered an alarm and a forced reload of the affected positions.

Recovery: When the risk engine restarted (deployment, crash), it needed to rebuild position state from scratch. The rebuild procedure: load all open trades from the database in timestamp order, replay them through the same position-update logic. This was tested daily — the risk engine was deployed every morning before trading opened, rebuilding state from overnight trades.

Clock ordering: Trades sometimes arrived slightly out of order due to messaging infrastructure. The incremental aggregation model is not order-sensitive for net quantity (addition is commutative), but is order-sensitive for average price calculation. We used trade timestamp (not arrival time) for ordering, with a small out-of-order tolerance window.

The Hierarchy Aggregation

Desks had sub-books; books had desks; the total was across all books. The hierarchy aggregation was maintained as a tree of nodes, each holding its own subtotal.

When a position changed, we propagated the delta upward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void onPositionChange(PositionDelta delta) {
    Position pos = positions.get(delta.instrument, delta.deskId);
    pos.apply(delta);

    DeskAggregate desk = desks.get(delta.deskId);
    desk.recomputeFromPositions();

    BookAggregate book = books.get(desk.bookId);
    book.recomputeFromDesks();

    totalAggregate.recomputeFromBooks();
}

This was single-threaded for correctness — one update thread, no concurrent modification of position state. The read path (dealer screens reading aggregates) used a copy-on-update pattern: the write thread published a new snapshot of the aggregate tree after each update, and readers held a reference to the latest snapshot.

Read operations were O(1) — get reference to latest snapshot, read from it. Write operations were serialised through a single thread. This eliminated the need for locks on read paths, which was important for the UI latency requirement.

What the Dealers Actually Used

The final system had three views:

Position ladder: per instrument, net quantity and P&L for the current dealer’s book, updated in real time.

Desk blotter: all desks on one screen, net exposure and P&L per desk, sortable. Management used this.

Risk summary: Greeks for the options book (delta, gamma, vega), computed by the risk model and pushed to the same aggregation infrastructure.

The update latency in practice: 200–300ms from trade to screen refresh, well within the 500ms requirement. The price-driven revaluation lag added another 0–100ms depending on where we were in the 100ms coalescing window.

The dealers stopped complaining about wrong numbers. That was the success metric.


Real-time risk aggregation is an incremental maintenance problem, not a recalculation problem. The constraints — latency, correctness, availability — push toward event-driven delta computation rather than batch refresh. The complexity is in the edge cases: out-of-order events, failure recovery, reconciliation against the source of truth. Getting the happy path right is a few days of work; getting the failure modes right takes the rest of the project.