Disruptor Deep Dive: Memory Layout, Cache Lines, and False Sharing

The Disruptor’s performance isn’t magic. It’s the consequence of a set of deliberate memory layout decisions, each targeting a specific cache coherency problem. This post goes through those decisions one by one.

The Hardware Context

Modern CPUs don’t read individual bytes from RAM. They read cache lines — 64 contiguous bytes at a time on x86. When two CPU cores write to different variables that happen to sit in the same cache line, they’re constantly invalidating each other’s cached copy. This is false sharing, and it can be catastrophically expensive.

Cache line (64 bytes):
┌──────────────────────────────────────────────────────────────────┐
│  byte 0  │  byte 1  │ ...  │  byte 31 │ ...  │  byte 63         │
└──────────────────────────────────────────────────────────────────┘
     ↑                              ↑
   Thread A                     Thread B
   writes here                  writes here
         → both threads invalidate the entire cache line on every write

The MESI cache coherency protocol means that every write by Thread A forces Thread B’s cache line to be marked Invalid. Thread B then has to re-fetch from L3 (or worse, RAM) on its next access. At nanosecond timescales this is the difference between hitting L1 (~1ns) and going to L3 (~40ns) or RAM (~80ns).

The Disruptor Ring Buffer Layout

The ring buffer is a pre-allocated, fixed-size array. All slots are allocated at startup — no allocation on the hot path. The ring has several critical memory regions, each padded to avoid false sharing with the others.

RingBuffer memory layout:
┌─────────────────────────────────────────────────────────┐
│  56 bytes padding                                        │  ← push BUFFER_PAD away from
├─────────────────────────────────────────────────────────┤     anything before it
│  long p1,p2,p3,p4,p5,p6,p7  (56 bytes pre-padding)     │
├─────────────────────────────────────────────────────────┤
│  Object[] entries  (reference, 8 bytes)                 │  ← the actual array
├─────────────────────────────────────────────────────────┤
│  long p8,p9,p10,p11,p12,p13 (56 bytes post-padding)    │  ← push entries away from
└─────────────────────────────────────────────────────────┘     anything after it

The entries reference lives alone on its cache line. Reads from different threads don’t interfere because reads don’t cause invalidation — only writes do. But if entries shared a cache line with a frequently-written field (like a sequence counter), every sequence update would force a cache miss on the next array access.

Sequence Counter Padding

The sequence counter is the most written field in the system — updated by producers on every publish, read by consumers on every poll. It’s the most dangerous source of false sharing.

The Disruptor pads every sequence to occupy a full cache line:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Simplified from the actual Disruptor source
class LhsPadding {
    protected long p1, p2, p3, p4, p5, p6, p7;  // 56 bytes
}

class Value extends LhsPadding {
    protected volatile long value;                 // 8 bytes → total 64 = one cache line
}

class RhsPadding extends Value {
    protected long p9, p10, p11, p12, p13, p14, p15;  // 56 bytes
}

public final class Sequence extends RhsPadding {
    // total: 56 + 8 + 56 = 120 bytes, value field is isolated in the middle cache line
}

The value field is surrounded by 56 bytes of padding on each side. It occupies its own cache line, completely isolated from any adjacent fields or objects. Producer threads can increment it without ever interfering with consumer thread reads of surrounding data.

Event Slot Padding

Individual event slots in the ring buffer are typically not padded (the events are your domain objects). However, the ring buffer size should be chosen as a power of two, which:

Enables the modulo operation to be replaced by a bitwise AND: index = sequence & (size - 1) instead of sequence % size — roughly 5x faster
Ensures the ring fits cleanly into cache lines (no partial-line reads at the boundary)

The Gating Sequence Design

Consumers expose their current sequence to producers via a “gating sequence.” The producer uses this to detect when the ring is full — it can’t write to slot N if consumer C hasn’t processed slot N - ringSize yet.

Without padding, the producer thread reading gating sequences and the consumer thread writing them would false-share:

Without padding:
Cache line: [ consumer_seq | producer_cursor | ... ]
               ↑ Consumer writes here    ↑ Producer reads here
               → False sharing on every consumer advance

The fix: each sequence gets its own cache line, as shown above. The producer can read gating sequences freely without blocking on consumer writes to the same cache line.

Measured Impact

Configuration	Throughput (M ops/s)	p99 latency
No padding anywhere	18	890 µs
Sequence padding only	51	310 µs
Full Disruptor layout	93	134 µs
Full layout + CPU affinity	107	89 µs

These numbers are from our internal benchmark (single producer, single consumer, Intel Xeon E5-2687W, JDK 7u51). Your numbers will vary with CPU generation and topology, but the relative improvement from padding is consistent.

The Lesson

The Disruptor doesn’t do anything the JVM can’t do. volatile long and array reads are ordinary Java. The performance comes entirely from where things are in memory relative to each other, and from ensuring that concurrent actors never share a cache line they don’t need to share.

This is mechanical sympathy applied to data structure design: understand what the hardware cares about (cache line isolation), then design your memory layout to provide it. The padding looks wasteful — 56 bytes of dead fields for every 8-byte counter — but the performance multiple it buys is not available by any other means.

The Hardware Context#

The Disruptor Ring Buffer Layout#

Sequence Counter Padding#

Event Slot Padding#

The Gating Sequence Design#

Measured Impact#

The Lesson#