When a thread is waiting for work — a new event, a lock to release, a signal — it has two options. It can block (tell the OS “wake me up when there’s work”) or busy-spin (loop checking a condition, never yielding the CPU). Both are correct. They have very different performance profiles.
What Happens When a Thread Blocks
Blocking in Java happens via Object.wait(), LockSupport.park(), BlockingQueue.take(), or any blocking I/O call. The OS kernel:
- Takes the thread off the run queue
- Context-switches to another thread (cost: ~1–5µs, varies by CPU and kernel)
- When the wake condition is met, puts the thread back on the run queue
- Schedules it to run (additional delay: up to one scheduler tick, typically 1ms on Linux with HZ=1000)
The critical problem: step 4. The thread doesn’t run the instant it’s unparked — it enters the run queue and waits for its turn. On a loaded system, this delay is unpredictable and can be 100µs or more.
Timeline (blocking wake-up):
Event occurs Thread unparked Scheduler picks thread Thread runs
│ │ │ │
─────┼────────────────┼────────────────────┼─────────────────────┼──────▶ time
│ │ │ │
└──────event─────┴─────run queue──────┴──scheduler delay────┘
latency latency latency
(1–5µs) (0–1000µs) (0–100µs)
The total wake-up latency is non-deterministic and heavy-tailed. For sub-millisecond SLAs, this is often unacceptable.
What Happens When a Thread Busy-Spins
A busy-spinning thread never yields the CPU. It continuously polls a condition:
| |
Thread.onSpinWait() emits the x86 PAUSE instruction, which:
- Signals to the CPU that this is a spin-wait loop (allows hyper-thread to get more cycles)
- Prevents memory order violations that can cause pipeline flushes in tight CAS loops
- Reduces power consumption slightly vs a pure spin
The wake-up latency: near zero. The thread is already running; as soon as the condition is true, it proceeds. No OS involvement, no scheduler delay.
The cost: the CPU core is fully consumed by this thread even while waiting. You are paying for a core just to sit in a loop.
The Comparison
| Strategy | Wake-up latency | CPU cost while waiting | Jitter |
|---|---|---|---|
BlockingWaitStrategy | 50µs–1ms+ | ~0% | High |
SleepingWaitStrategy | 1µs–100µs | Low (yields intermittently) | Medium |
YieldingWaitStrategy | 0–10µs | Medium (calls Thread.yield()) | Low-medium |
BusySpinWaitStrategy | <1µs | 100% one core | Very low |
The Disruptor exposes these as pluggable wait strategies precisely because the right choice depends on your latency/resource tradeoff.
The Thread.yield() Middle Ground
Thread.yield() asks the OS to reschedule the current thread, letting other threads run. If no other runnable threads exist, it returns immediately. If there are other threads waiting, it might wait a scheduler tick.
YieldingWaitStrategy in the Disruptor: spin for 100 iterations, then yield. The spin covers the sub-microsecond case; the yield lets other threads run when there’s nothing to do.
This works well when:
- You have multiple Disruptor event processors on the same machine
- The processors’ loads are balanced (some busy, some idle at any given moment)
- You can tolerate occasional 10µs latency spikes
It works poorly when:
- Other processes on the same machine are competing for CPU
- The latency spikes from yield are unacceptable
Thread Affinity and Isolation
Busy-spinning’s value is maximised when combined with:
CPU affinity: pin the spinning thread to a specific core with taskset (Linux) or Thread Affinity library (Java). The thread is never migrated; the data it accesses is always warm in that core’s cache.
CPU isolation: remove the core from the OS scheduler with isolcpus=N in the kernel boot parameters. The OS scheduler will not run other processes on core N; only explicitly pinned threads run there.
# /etc/default/grub — add isolcpus to GRUB_CMDLINE_LINUX
GRUB_CMDLINE_LINUX="isolcpus=2,3,4,5"
# Verify (core 2-5 should have 0 scheduled processes)
$ cat /sys/devices/system/cpu/cpu2/topology/core_id
With isolated cores, a busy-spinning thread on core 2 will not be preempted by anything — no OS timer interrupt, no kernel thread, no other user process. The latency profile becomes extremely consistent.
We measured the effect: with no isolation, our aggregator thread’s p999 latency was 420µs. With CPU isolation and affinity on the same hardware with the same code, it dropped to 78µs. The difference was entirely OS scheduler interference.
When to Block, When to Spin
A decision guide:
Latency target < 100µs?
Yes → busy spin + CPU affinity
No → blocking or yielding
Spare CPU cores available for dedicated threads?
Yes → busy spin is viable
No → blocking (can't afford to waste cores)
Predictable latency more important than average latency?
Yes → busy spin (eliminates scheduler jitter)
No → blocking or yielding (better average, worse tail)
Is this a short-lived wait (event expected in < 1µs)?
Yes → spin (context switch would cost more than the wait)
No → depends on target latency
For our feed handler and aggregator threads: busy-spin on isolated cores. For client distribution threads: yielding strategy (there were too many to isolate individually, and distribution latency tolerance was higher). For operations and monitoring threads: blocking (they were never on the critical path).
The mix is intentional. Not every thread in the system needs the same latency properties. Identifying which threads are on the critical path and applying the right wait strategy per thread is better than applying the lowest-latency strategy uniformly and wasting hardware resources.