Market Connectivity: Building a Low-Latency Feed Handler

The feed handler is where the external world becomes internal data. A FIX or binary protocol stream arrives over the network, gets parsed into typed events, and gets handed to the internal processing pipeline. Nothing downstream can be faster than the feed handler’s latency.

This is the design I evolved over three iterations at the trading firm.

The Latency Budget

Network wire arrival → kernel network stack → process
                                              │
                                        feed handler
                                              │
                                    parse FIX message
                                              │
                                    validate fields
                                              │
                                    enqueue for downstream
                                              │
                                    downstream consumers

Our target: total wire-to-consumer in under 200µs at p99. The feed handler needed to budget 50µs or less per message, leaving 150µs for everything downstream.

At 10,000 messages/second (our peak), 50µs per message means the handler needs to process 20,000 messages/second with headroom. A single-threaded handler can do this trivially. The challenge is doing it consistently — p99, not p50.

Version 1: Blocking I/O, One Thread Per Connection

The natural Java approach: one thread per connection, blocking reads:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class FeedConnection implements Runnable {
    private final Socket socket;
    private final InputStream in;
    private final byte[] readBuffer = new byte[4096];

    public void run() {
        while (!Thread.currentThread().isInterrupted()) {
            int n = in.read(readBuffer);
            if (n < 0) reconnect();
            processBuffer(readBuffer, 0, n);
        }
    }
}

With 5 venue connections, this means 5 threads blocked on I/O. The OS wakes each thread when data arrives, the thread reads and processes, then blocks again.

Problems:

Context-switch overhead when the OS wakes the thread
GC pressure from new byte[] per connection (fixable with reuse)
in.read() may return partial messages — buffering and message framing logic gets complex
The thread is blocked in a system call; it can’t do anything else while waiting

p99 with this approach: ~600µs. The dominant cost was thread wake-up latency: the time between data arriving in the kernel buffer and the application thread being scheduled to process it.

Version 2: NIO Selector (Non-Blocking)

Java NIO’s Selector allows one thread to multiplex many connections:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Selector selector = Selector.open();

// Register all venue channels with the selector:
for (SocketChannel channel : venueChannels) {
    channel.configureBlocking(false);
    channel.register(selector, SelectionKey.OP_READ);
}

ByteBuffer readBuf = ByteBuffer.allocateDirect(65536);

while (true) {
    int ready = selector.select(1);  // 1ms timeout
    if (ready == 0) continue;

    for (SelectionKey key : selector.selectedKeys()) {
        SocketChannel ch = (SocketChannel) key.channel();
        readBuf.clear();
        int n = ch.read(readBuf);
        if (n > 0) {
            readBuf.flip();
            processBuffer(readBuf, (VenueContext) key.attachment());
        }
    }
    selector.selectedKeys().clear();
}

One thread handles all five connections. No context switching between connections. ByteBuffer.allocateDirect avoids a kernel-to-user copy on each read.

p99 with NIO: ~180µs. Meaningful improvement, but the selector.select() call still involves a system call (epoll_wait on Linux), adding ~5–10µs of overhead per wake-up.

Version 3: Busy-Poll + CPU Affinity

For sub-100µs p99, you need to eliminate the system call on the receive path:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Instead of selector.select() with blocking:
while (true) {
    for (SocketChannel ch : channels) {
        readBuf.clear();
        int n = ch.read(readBuf);  // non-blocking, returns 0 if nothing to read
        if (n > 0) {
            readBuf.flip();
            processBuffer(readBuf, contextFor(ch));
        }
    }
    // No sleep, no yield — pure spin
}

This thread runs at 100% CPU, constantly checking for data. The moment data arrives in the kernel buffer, the next iteration picks it up — no scheduler wake-up, no system call latency. The trade-off is a full CPU core dedicated to this thread.

Combined with:

CPU pinning via taskset or pthread_setaffinity_np (through JNA/JNI)
isolcpus kernel parameter to prevent OS from scheduling other tasks on that core
SCHED_FIFO real-time scheduling priority for the feed handler thread

p99 with busy-poll + affinity: ~45µs. The remaining cost was FIX message parsing.

FIX Parsing: The Hidden Cost

A FIX message looks like:

8=FIX.4.4|9=123|35=W|55=EUR/USD|268=2|270=1.28445|271=5000000|...

Naïve parsing: String.split("|"), then Integer.parseInt() / Double.parseDouble() for each field.

The problem: String.split() allocates a String[]. Each field becomes a String. Double.parseDouble() on a String involves bounds checking, character iteration, and creates intermediate objects.

At 10,000 messages/second with 20 fields each, that’s 200,000 String allocations per second plus the array — several hundred MB/s of allocation rate. GC becomes the dominant latency source.

The fix: parse in-place on the byte buffer without creating String objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Parse a double value directly from byte buffer without String allocation:
double parseDouble(ByteBuffer buf, int start, int end) {
    long mantissa = 0;
    int decimalPos = -1;
    boolean negative = false;
    for (int i = start; i < end; i++) {
        byte b = buf.get(i);
        if (b == '-') { negative = true; continue; }
        if (b == '.') { decimalPos = end - i - 1; continue; }
        mantissa = mantissa * 10 + (b - '0');
    }
    double result = mantissa;
    if (decimalPos > 0) result /= Math.pow(10, decimalPos);
    return negative ? -result : result;
}

Zero allocations. Parses directly from the raw bytes. For the field types used in FIX (integers, doubles, short strings), this approach covers 95% of the parsing work.

After replacing string-based parsing with in-place byte parsing:

Allocation rate: 0 MB/s on the hot path
Parsing time per message: 8µs → 1.5µs
p99 total feed handler latency: 45µs → 22µs

The Reconnection Problem

A feed handler that processes messages fast is worthless if it doesn’t handle disconnections gracefully. Exchange connections drop — maintenance windows, network glitches, timeouts. The reconnection logic needs to be:

Automatic — no manual intervention for transient drops
Non-blocking — the handler continues processing other venues while one reconnects
Sequenced — after reconnection, handle any message gap (some venues support replay)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class VenueContext {
    enum State { CONNECTED, RECONNECTING, DISCONNECTED }
    volatile State state = State.DISCONNECTED;
    long lastHeartbeat;
    int reconnectAttempts;

    void onDisconnect() {
        state = State.RECONNECTING;
        scheduleReconnect(backoffMs(++reconnectAttempts));
    }

    long backoffMs(int attempt) {
        return Math.min(100L * (1L << attempt), 30_000L);  // cap at 30s
    }
}

The exponential backoff prevents hammering a venue that’s down. The state machine ensures the handler doesn’t process messages from a partially-reconnected channel.

The feed handler is where hardware meets finance. Every µs saved here compounds throughout the processing pipeline. The design I landed on — busy-poll, CPU affinity, zero-allocation parsing — was overkill for many use cases but exactly right for the sub-millisecond pipelines we were building. Starting with version 1 and measuring before each rewrite was the only way to know which optimisations actually mattered.

The Latency Budget#

Version 1: Blocking I/O, One Thread Per Connection#

Version 2: NIO Selector (Non-Blocking)#

Version 3: Busy-Poll + CPU Affinity#

FIX Parsing: The Hidden Cost#

The Reconnection Problem#