Locks work. synchronized in Java is correct, well-understood, and wrong for our use case. A lock that’s contested causes a thread to block — the OS parks it, context-switches to something else, and eventually context-switches back. Each of those transitions costs microseconds. When your SLA is sub-millisecond and your hot path is called 200,000 times per second, locks are not an option.
Lock-free programming replaces locks with atomic CPU instructions. The CPU handles the synchronisation at the hardware level, without OS involvement.
The Foundation: Compare-And-Swap
Every lock-free algorithm is built on compare-and-swap (CAS). The CPU instruction CMPXCHG (on x86) does three things atomically:
- Read the current value at a memory location
- Compare it to an expected value
- If equal, write a new value; if not, do nothing
CAS(address, expected, new_value):
if *address == expected:
*address = new_value
return true // success — we "won"
else:
return false // failure — someone else changed it first
The atomicity guarantee: no other thread can observe the memory location in a state between steps 2 and 3. The comparison and write are indivisible.
In Java, java.util.concurrent.atomic.* exposes CAS through compareAndSet:
| |
This loop is the fundamental lock-free pattern: read, compute, attempt CAS, retry on failure. Because the retry is cheap (just a loop iteration), and because CAS failure means progress was made by another thread, the algorithm is lock-free: the system as a whole always makes progress, even if individual threads retry.
Compare to a mutex: if one thread holds the lock, all others wait — no progress by anyone until the lock holder releases. Lock-free algorithms don’t have this property.
The ABA Problem
The subtle correctness trap in lock-free programming.
Suppose thread A reads value 42 from a memory location and is about to CAS it to 43. Before A does the CAS:
- Thread B reads
42 - Thread B writes
99 - Thread C reads
99 - Thread C writes
42(back to the original value)
Now thread A’s CAS succeeds — it sees 42 and replaces it with 43. But the value at that location went 42 → 99 → 42; A’s assumption (that nothing changed) is wrong. Depending on what the value represents, this can be a serious bug.
In practice for counters and sequences: ABA usually doesn’t matter — you’re replacing a number with another number, and the semantic effect is the same regardless of intermediate values.
In practice for pointers in lock-free linked structures: ABA is dangerous. If the pointer at position X changed from pointing to node A to pointing to some other node and then back to pointing to node A (possibly because A was freed and reallocated at the same address), your CAS succeeds on a stale pointer.
Solutions:
- Versioned CAS / tagged pointers: attach a version counter to the value; increment on every change. CAS on (value, version). ABA becomes (42, version1) → (99, version2) → (42, version3); the version mismatch prevents the false success.
- Hazard pointers: mark pointers before dereferencing; garbage collection of nodes waits until no thread has a hazard pointer on it.
- Avoid lock-free linked structures: they’re hard to get right. Use lock-free arrays (ring buffers, fixed-size queues) instead.
Java’s Atomic Classes
The java.util.concurrent.atomic package provides lock-free building blocks:
| Class | Use case | Key operations |
|---|---|---|
AtomicLong | Counters, sequence numbers | get, set, incrementAndGet, compareAndSet |
AtomicReference<T> | Single pointer/reference updates | get, set, compareAndSet |
AtomicReferenceArray<T> | Arrays with atomic element updates | get(i), compareAndSet(i,e,u) |
LongAdder | High-contention counters (striped) | increment, sum |
AtomicLongFieldUpdater | Lock-free field update without boxing | compareAndSet(obj, e, u) |
LongAdder is important: under high contention, a single AtomicLong becomes a bottleneck because many threads are competing on one CAS. LongAdder maintains a cell per CPU, accumulates increments locally, and aggregates on sum(). For metrics counters updated from many threads, LongAdder is significantly faster than AtomicLong.
AtomicLongFieldUpdater avoids the object overhead of wrapping a long in AtomicLong. If you have a class with a volatile long sequence field and you want CAS on it without allocating an AtomicLong wrapper:
| |
This is exactly what the Disruptor does for its sequence fields — minimising object count and keeping the fields inline in the class layout.
Memory Ordering: volatile and Happens-Before
CAS alone isn’t sufficient — you need memory visibility guarantees. The Java memory model (JMM) says:
- A
volatilewrite happens-before any subsequentvolatileread of the same variable AtomicLong.compareAndSetand similar operations have the same memory semantics asvolatileread + write
This means: if thread A writes data and then does a CAS on sequence, and thread B reads sequence via CAS and then reads data, B is guaranteed to see A’s write to data. The sequence acts as a memory barrier.
This is why the Disruptor’s producer writes the event data before publishing the sequence, and consumers read the sequence before reading the event data. The ordering guarantees flow from the volatile semantics on the sequence fields.
Producer: Consumer:
event.price = 1.0842; long seq = sequence.get(); // volatile read
event.qty = 1_000_000; // happens-after: price and qty visible
sequence.set(nextSeq); // volatile write
Without the volatile write on sequence, the consumer might read a stale price or qty even after seeing the updated sequence. The happens-before relationship is what makes the whole construct correct, not just the CAS instruction.
When Not to Use Lock-Free
Lock-free code is harder to write, harder to test, and harder to reason about than lock-based code. The cases where it’s worth it:
| Situation | Recommendation |
|---|---|
| High-contention counter | LongAdder or AtomicLong |
| Single-producer/single-consumer queue | Disruptor or LinkedTransferQueue |
| Hot path, measured lock contention | Consider lock-free |
| Cold path, no contention measured | synchronized — simpler, equally fast |
| Complex multi-step operation | Lock (lock-free multi-step is very hard) |
| Correctness uncertain | Lock — a correct slow program beats a fast incorrect one |
The rule: measure first. Lock contention shows up in thread dumps (threads parked on a lock) and in profilers (time spent in park/unpark). If you don’t have measured contention, don’t reach for lock-free code.