The feed handler is where the external world becomes internal data. A FIX or binary protocol stream arrives over the network, gets parsed into typed events, and gets handed to the internal processing pipeline. Nothing downstream can be faster than the feed handler’s latency.
This is the design I evolved over three iterations at the trading firm.
The Latency Budget
Network wire arrival → kernel network stack → process
│
feed handler
│
parse FIX message
│
validate fields
│
enqueue for downstream
│
downstream consumers
Our target: total wire-to-consumer in under 200µs at p99. The feed handler needed to budget 50µs or less per message, leaving 150µs for everything downstream.
At 10,000 messages/second (our peak), 50µs per message means the handler needs to process 20,000 messages/second with headroom. A single-threaded handler can do this trivially. The challenge is doing it consistently — p99, not p50.
Version 1: Blocking I/O, One Thread Per Connection
The natural Java approach: one thread per connection, blocking reads:
| |
With 5 venue connections, this means 5 threads blocked on I/O. The OS wakes each thread when data arrives, the thread reads and processes, then blocks again.
Problems:
- Context-switch overhead when the OS wakes the thread
- GC pressure from
new byte[]per connection (fixable with reuse) in.read()may return partial messages — buffering and message framing logic gets complex- The thread is blocked in a system call; it can’t do anything else while waiting
p99 with this approach: ~600µs. The dominant cost was thread wake-up latency: the time between data arriving in the kernel buffer and the application thread being scheduled to process it.
Version 2: NIO Selector (Non-Blocking)
Java NIO’s Selector allows one thread to multiplex many connections:
| |
One thread handles all five connections. No context switching between connections. ByteBuffer.allocateDirect avoids a kernel-to-user copy on each read.
p99 with NIO: ~180µs. Meaningful improvement, but the selector.select() call still involves a system call (epoll_wait on Linux), adding ~5–10µs of overhead per wake-up.
Version 3: Busy-Poll + CPU Affinity
For sub-100µs p99, you need to eliminate the system call on the receive path:
| |
This thread runs at 100% CPU, constantly checking for data. The moment data arrives in the kernel buffer, the next iteration picks it up — no scheduler wake-up, no system call latency. The trade-off is a full CPU core dedicated to this thread.
Combined with:
- CPU pinning via
tasksetorpthread_setaffinity_np(through JNA/JNI) isolcpuskernel parameter to prevent OS from scheduling other tasks on that coreSCHED_FIFOreal-time scheduling priority for the feed handler thread
p99 with busy-poll + affinity: ~45µs. The remaining cost was FIX message parsing.
FIX Parsing: The Hidden Cost
A FIX message looks like:
8=FIX.4.4|9=123|35=W|55=EUR/USD|268=2|270=1.28445|271=5000000|...
Naïve parsing: String.split("|"), then Integer.parseInt() / Double.parseDouble() for each field.
The problem: String.split() allocates a String[]. Each field becomes a String. Double.parseDouble() on a String involves bounds checking, character iteration, and creates intermediate objects.
At 10,000 messages/second with 20 fields each, that’s 200,000 String allocations per second plus the array — several hundred MB/s of allocation rate. GC becomes the dominant latency source.
The fix: parse in-place on the byte buffer without creating String objects:
| |
Zero allocations. Parses directly from the raw bytes. For the field types used in FIX (integers, doubles, short strings), this approach covers 95% of the parsing work.
After replacing string-based parsing with in-place byte parsing:
- Allocation rate: 0 MB/s on the hot path
- Parsing time per message: 8µs → 1.5µs
- p99 total feed handler latency: 45µs → 22µs
The Reconnection Problem
A feed handler that processes messages fast is worthless if it doesn’t handle disconnections gracefully. Exchange connections drop — maintenance windows, network glitches, timeouts. The reconnection logic needs to be:
- Automatic — no manual intervention for transient drops
- Non-blocking — the handler continues processing other venues while one reconnects
- Sequenced — after reconnection, handle any message gap (some venues support replay)
| |
The exponential backoff prevents hammering a venue that’s down. The state machine ensures the handler doesn’t process messages from a partially-reconnected channel.
The feed handler is where hardware meets finance. Every µs saved here compounds throughout the processing pipeline. The design I landed on — busy-poll, CPU affinity, zero-allocation parsing — was overkill for many use cases but exactly right for the sub-millisecond pipelines we were building. Starting with version 1 and measuring before each rewrite was the only way to know which optimisations actually mattered.