In mid-2013 we replaced our internal LinkedBlockingQueue-based event bus with the LMAX Disruptor. Median latency dropped by 30%. The 99th percentile dropped by more than half. The change touched about 400 lines of code.

This post is about the conceptual model you need to understand why the Disruptor is fast — not just “it uses a ring buffer,” but what that actually means for your hardware.

What’s Wrong with BlockingQueue

LinkedBlockingQueue uses two ReentrantLock instances — one for head (consumers), one for tail (producers). For single-producer/single-consumer this is reasonable. For our use case — one producer, multiple consumers at different stages of a pipeline — it meant lock contention, and worse, it meant allocation.

LinkedBlockingQueue allocates a Node per enqueued item. At 200,000 events/second, that’s 200,000 objects/second into eden space. You know where this ends: GC.

ArrayBlockingQueue avoids the allocation but uses a single lock for both head and tail — worse contention.

The Ring Buffer Mental Model

A ring buffer is a fixed-size array where you write at position sequence % size and read at the same calculation on the consumer side. When you write entry 1024 into a 1024-slot ring buffer, you’re overwriting slot 0.

The Disruptor’s ring buffer is pre-allocated at startup — all slots contain objects that get mutated, never replaced. No allocation on the hot path, ever.

Position in the ring is tracked by a sequence counter — a single long, incremented atomically. The CAS on a long is cheaper than a lock: no kernel involvement, no thread suspension, just a compare-and-swap instruction.

Producer claims sequence N → writes into slot N % size → publishes N
Consumer sees sequence N published → reads slot N % size → processes

Multiple consumers can read the same slot independently — they each track their own sequence. No coordination needed between consumers unless you explicitly model a dependency (which the Disruptor’s EventProcessorGroup handles).

Why the Layout Matters

The ring buffer is a contiguous array. When consumer reads slot N, the hardware prefetcher predicts it’ll need slot N+1 next and fetches it into cache. This is sequential access pattern — the most cache-friendly possible. Compare to a linked list, where each node can be anywhere in memory.

The sequence counters are padded to fill a cache line each (@Contended equivalent via manual padding in the original Disruptor). Producers and consumers never share a cache line, so no false sharing on the hot path.

The combination — sequential array access + cache-line padded sequences + no allocation + CAS instead of locks — is why the numbers look the way they do.

The Pipeline Model

What made the Disruptor particularly useful for our pricing pipeline was the dependency graph model.

Our pipeline had stages:

  1. FIX message decode
  2. Price normalisation
  3. Risk check
  4. Quote generation
  5. Client distribution

In the BlockingQueue world, each stage was a separate queue. You enqueued into stage N’s queue, the stage N thread consumed it, and enqueued into stage N+1’s queue. Five queues, five sets of locks, five sets of allocations.

With the Disruptor: one ring buffer, five event processors. The dependency declaration tells the Disruptor that quote generation can’t start processing event N until risk check has finished processing event N. The sequencer barrier handles this with a spin-wait on the upstream sequence — no locks.

1
2
3
4
5
EventHandlerGroup<PriceEvent> decoded    = disruptor.handleEventsWith(new FIXDecoder());
EventHandlerGroup<PriceEvent> normalised = decoded.then(new PriceNormaliser());
EventHandlerGroup<PriceEvent> checked    = normalised.then(new RiskChecker());
EventHandlerGroup<PriceEvent> quoted     = checked.then(new QuoteGenerator());
quoted.then(new ClientDistributor());

Five lines replacing five queues, five threads managing their own blocking, and the latency profile that came with it.

What It Doesn’t Fix

The Disruptor is not a silver bullet:

  • Back-pressure: if consumers are slower than producers, the ring buffer fills up. The producer blocks (or you throw if using THROW WaitStrategy). You need to size the buffer for your burst profile.
  • Single-producer assumption: SingleProducerSequencer is faster but you must guarantee one producer. Multiple producers require MultiProducerSequencer which uses CAS — still fast, but more overhead.
  • WaitStrategy matters: BusySpinWaitStrategy gives lowest latency but burns a CPU core. BlockingWaitStrategy is more CPU-friendly but adds latency. We used YieldingWaitStrategy as a compromise.

Numbers

Before/after on our pricing pipeline (p99, microseconds):

StageBlockingQueueDisruptor
FIX decode45 µs12 µs
Normalise + risk78 µs21 µs
End-to-end (p99)312 µs134 µs

The gains compound through the pipeline because you’re removing queue overhead at every stage, not just one.