The Disruptor’s performance isn’t magic. It’s the consequence of a set of deliberate memory layout decisions, each targeting a specific cache coherency problem. This post goes through those decisions one by one.
The Hardware Context
Modern CPUs don’t read individual bytes from RAM. They read cache lines — 64 contiguous bytes at a time on x86. When two CPU cores write to different variables that happen to sit in the same cache line, they’re constantly invalidating each other’s cached copy. This is false sharing, and it can be catastrophically expensive.
Cache line (64 bytes):
┌──────────────────────────────────────────────────────────────────┐
│ byte 0 │ byte 1 │ ... │ byte 31 │ ... │ byte 63 │
└──────────────────────────────────────────────────────────────────┘
↑ ↑
Thread A Thread B
writes here writes here
→ both threads invalidate the entire cache line on every write
The MESI cache coherency protocol means that every write by Thread A forces Thread B’s cache line to be marked Invalid. Thread B then has to re-fetch from L3 (or worse, RAM) on its next access. At nanosecond timescales this is the difference between hitting L1 (~1ns) and going to L3 (~40ns) or RAM (~80ns).
The Disruptor Ring Buffer Layout
The ring buffer is a pre-allocated, fixed-size array. All slots are allocated at startup — no allocation on the hot path. The ring has several critical memory regions, each padded to avoid false sharing with the others.
RingBuffer memory layout:
┌─────────────────────────────────────────────────────────┐
│ 56 bytes padding │ ← push BUFFER_PAD away from
├─────────────────────────────────────────────────────────┤ anything before it
│ long p1,p2,p3,p4,p5,p6,p7 (56 bytes pre-padding) │
├─────────────────────────────────────────────────────────┤
│ Object[] entries (reference, 8 bytes) │ ← the actual array
├─────────────────────────────────────────────────────────┤
│ long p8,p9,p10,p11,p12,p13 (56 bytes post-padding) │ ← push entries away from
└─────────────────────────────────────────────────────────┘ anything after it
The entries reference lives alone on its cache line. Reads from different threads don’t interfere because reads don’t cause invalidation — only writes do. But if entries shared a cache line with a frequently-written field (like a sequence counter), every sequence update would force a cache miss on the next array access.
Sequence Counter Padding
The sequence counter is the most written field in the system — updated by producers on every publish, read by consumers on every poll. It’s the most dangerous source of false sharing.
The Disruptor pads every sequence to occupy a full cache line:
| |
The value field is surrounded by 56 bytes of padding on each side. It occupies its own cache line, completely isolated from any adjacent fields or objects. Producer threads can increment it without ever interfering with consumer thread reads of surrounding data.
Event Slot Padding
Individual event slots in the ring buffer are typically not padded (the events are your domain objects). However, the ring buffer size should be chosen as a power of two, which:
- Enables the modulo operation to be replaced by a bitwise AND:
index = sequence & (size - 1)instead ofsequence % size— roughly 5x faster - Ensures the ring fits cleanly into cache lines (no partial-line reads at the boundary)
The Gating Sequence Design
Consumers expose their current sequence to producers via a “gating sequence.” The producer uses this to detect when the ring is full — it can’t write to slot N if consumer C hasn’t processed slot N - ringSize yet.
Without padding, the producer thread reading gating sequences and the consumer thread writing them would false-share:
Without padding:
Cache line: [ consumer_seq | producer_cursor | ... ]
↑ Consumer writes here ↑ Producer reads here
→ False sharing on every consumer advance
The fix: each sequence gets its own cache line, as shown above. The producer can read gating sequences freely without blocking on consumer writes to the same cache line.
Measured Impact
| Configuration | Throughput (M ops/s) | p99 latency |
|---|---|---|
| No padding anywhere | 18 | 890 µs |
| Sequence padding only | 51 | 310 µs |
| Full Disruptor layout | 93 | 134 µs |
| Full layout + CPU affinity | 107 | 89 µs |
These numbers are from our internal benchmark (single producer, single consumer, Intel Xeon E5-2687W, JDK 7u51). Your numbers will vary with CPU generation and topology, but the relative improvement from padding is consistent.
The Lesson
The Disruptor doesn’t do anything the JVM can’t do. volatile long and array reads are ordinary Java. The performance comes entirely from where things are in memory relative to each other, and from ensuring that concurrent actors never share a cache line they don’t need to share.
This is mechanical sympathy applied to data structure design: understand what the hardware cares about (cache line isolation), then design your memory layout to provide it. The padding looks wasteful — 56 bytes of dead fields for every 8-byte counter — but the performance multiple it buys is not available by any other means.