Our market data distribution problem was straightforward to state and hard to solve: deliver price updates to a dozen internal consumers with sub-500µs latency at 400,000 messages/second, with no head-of-line blocking between consumers.

TCP broadcasting is serial — slow consumers stall fast ones. ZeroMQ was promising but showed GC pressure from its buffer management. Kafka was built for durability, not microsecond latency. When Martin Thompson and Todd Montgomery open-sourced Aeron in 2014, it solved almost exactly this problem.

What Aeron Is

Aeron is a message transport library for Java (with C++ and .NET bindings) that provides:

  • Reliable multicast and unicast over UDP — the reliability layer is in userspace, not the kernel
  • Log-based message ordering — messages are written to a circular log buffer, consumers read at their own pace
  • Zero-copy delivery — the subscriber callback receives a DirectBuffer pointing into the shared memory-mapped file, no copying
  • Busy-spin and back-off polling models — you choose the latency/CPU trade-off
  • Built-in flow control — publishers slow or stop if subscribers fall behind (configurable)

The architecture that makes it fast:

Publisher thread                     Subscriber thread(s)
     │                                      │
     ▼                                      ▼
 Publication                           Subscription
     │                                      │
     ▼                                      ▼
 Log Buffer  ←── memory-mapped file ──→  Log Buffer
     │                                      │
     └──── Media Driver (single process) ───┘
              │               │
         UDP send          UDP recv
              │               │
         ─────────── Network ──────────

The Media Driver is a separate process (or embedded in the application) that owns the network I/O. Publication and subscription happen through shared memory — the driver reads from the publication log and writes to the subscription log using sun.misc.Unsafe operations. No system calls on the critical path.

The Subscriber Pattern

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
final Aeron.Context ctx = new Aeron.Context()
    .aeronDirectoryName("/dev/shm/aeron");  // tmpfs for low latency

try (Aeron aeron = Aeron.connect(ctx);
     Subscription sub = aeron.addSubscription(
         "aeron:udp?control=224.0.1.1:40456|interface=10.0.1.0/24",
         STREAM_ID)) {

    final FragmentHandler handler = (buffer, offset, length, header) -> {
        // buffer is a flyweight over shared memory — read directly, don't copy
        final long sequence = buffer.getLong(offset);
        final double bid    = buffer.getDouble(offset + 8);
        final double ask    = buffer.getDouble(offset + 16);
        processQuote(sequence, bid, ask);
    };

    // Busy-spin for minimum latency:
    final IdleStrategy idle = new BusySpinIdleStrategy();
    while (!Thread.currentThread().isInterrupted()) {
        final int fragments = sub.poll(handler, FRAGMENT_LIMIT);
        idle.idle(fragments);
    }
}

The URI aeron:udp?control=224.0.1.1:40456|interface=10.0.1.0/24 tells the driver to join multicast group 224.0.1.1 on the 10.0.1.0/24 interface. The control is the “control” address for Aeron’s NAK-based reliability protocol.

BusySpinIdleStrategy keeps the thread running at 100% CPU, yielding nothing, for minimum latency. The alternative is SleepingIdleStrategy (trades latency for CPU) or YieldingIdleStrategy (compromise). In production for the market data path we used busy-spin on a dedicated CPU core, isolated with isolcpus and thread affinity.

The Publisher Pattern

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
try (Aeron aeron = Aeron.connect(ctx);
     Publication pub = aeron.addPublication(
         "aeron:udp?control=224.0.1.1:40456|control-mode=dynamic|interface=10.0.1.0/24",
         STREAM_ID)) {

    final UnsafeBuffer buffer = new UnsafeBuffer(
        ByteBuffer.allocateDirect(64));

    // Publication loop:
    while (true) {
        final QuoteUpdate update = queue.poll();
        if (update == null) continue;

        buffer.putLong(0,    update.sequence);
        buffer.putDouble(8,  update.bid);
        buffer.putDouble(16, update.ask);

        long result;
        do {
            result = pub.offer(buffer, 0, 24);
        } while (result == Publication.BACK_PRESSURED ||
                 result == Publication.ADMIN_ACTION);

        if (result < 0) {
            // NOT_CONNECTED or MAX_POSITION_EXCEEDED — handle separately
        }
    }
}

pub.offer() returns the position in the log buffer on success, or a negative status code. BACK_PRESSURED means the fastest subscriber’s window is full — we’re publishing faster than it can consume. The spin on BACK_PRESSURED implements our flow control: we don’t drop messages, we slow the publisher.

Latency Profile

Our benchmark setup: two processes on the same host, connected over loopback, measuring round-trip latency (publisher sends, subscriber receives and publishes a response, original publisher measures).

Aeron loopback round-trip latency (busy-spin, tmpfs):

  p50:     800 ns
  p90:   1,100 ns
  p99:   1,800 ns
  p99.9: 4,200 ns
  max:   31,000 ns  (occasional JIT or GC spike)

vs. comparable ZeroMQ measurement:

  p50:   3,200 ns
  p90:   5,800 ns
  p99:  14,000 ns
  p99.9: 89,000 ns

The Aeron numbers are dominated by memory access patterns, not protocol overhead. The occasional max spike was traced to JIT compilation warm-up — after full steady state, the max dropped to ~8µs.

Over real 10GbE network between two servers:

Network round-trip (Aeron UDP unicast):

  p50:   8,200 ns
  p90:  10,100 ns
  p99:  14,300 ns
  p99.9: 22,000 ns

The floor is the wire round-trip (~5µs for 10GbE at short distances). Aeron adds ~2–4µs for serialisation, kernel network stack, and shared memory transfer.

The Reliability Layer

Aeron’s reliability over UDP uses NAK (Negative Acknowledgement): subscribers detect gaps in the log position sequence and send a NAK to the publisher, which retransmits. Unlike TCP, this is per-subscriber — one slow subscriber’s retransmit requests don’t affect others.

The log buffer size controls how much history is available for retransmit:

1
2
// In aeron.properties:
aeron.publication.term.buffer.length=16777216  // 16MB per stream

A 16MB term buffer at 400k messages/second × 24 bytes/message = ~9.6MB/s, so ~1.7 seconds of history. Any subscriber that falls more than 1.7 seconds behind will lose messages — intentionally. This is a feature: you get to define how much tail latency you’ll tolerate before declaring a subscriber too slow to keep up.

What We Shipped

The production system had:

  • 1 publisher process (normalised prices from the exchange gateway)
  • 12 subscriber processes (risk engine, order management, 10 algo strategies)
  • Aeron Media Driver running as a separate privileged process with real-time scheduling priority
  • CPU affinity: publisher pinned to core 2, driver to core 3, each subscriber to its own isolated core
  • Log buffers on tmpfs (/dev/shm) to avoid page faults

After 3 months in production: zero messages lost, p99 measured at the subscriber consistently under 200µs from the point the publisher called offer(). Total CPU usage: ~130% (1.3 full cores for the entire multicast fan-out to 12 consumers).

The alternative — 12 separate TCP connections from publisher to subscriber — would have required serialising 12 write operations, each with its own buffer copy and kernel call. The CPU would have been several times higher, and a slow subscriber would have introduced head-of-line blocking.

Aeron is not general-purpose messaging infrastructure. It’s a tool for a specific problem: high-throughput, low-latency broadcast to multiple consumers where message drop on a slow consumer is acceptable. That problem description fits market data distribution exactly. For everything else, use Kafka.