Locks work. synchronized in Java is correct, well-understood, and wrong for our use case. A lock that’s contested causes a thread to block — the OS parks it, context-switches to something else, and eventually context-switches back. Each of those transitions costs microseconds. When your SLA is sub-millisecond and your hot path is called 200,000 times per second, locks are not an option.

Lock-free programming replaces locks with atomic CPU instructions. The CPU handles the synchronisation at the hardware level, without OS involvement.

The Foundation: Compare-And-Swap

Every lock-free algorithm is built on compare-and-swap (CAS). The CPU instruction CMPXCHG (on x86) does three things atomically:

  1. Read the current value at a memory location
  2. Compare it to an expected value
  3. If equal, write a new value; if not, do nothing
CAS(address, expected, new_value):
    if *address == expected:
        *address = new_value
        return true   // success — we "won"
    else:
        return false  // failure — someone else changed it first

The atomicity guarantee: no other thread can observe the memory location in a state between steps 2 and 3. The comparison and write are indivisible.

In Java, java.util.concurrent.atomic.* exposes CAS through compareAndSet:

1
2
3
4
5
6
7
8
9
AtomicLong counter = new AtomicLong(0);

// Lock-free increment
long current, next;
do {
    current = counter.get();
    next    = current + 1;
} while (!counter.compareAndSet(current, next));
// If CAS fails, another thread incremented first. We re-read and retry.

This loop is the fundamental lock-free pattern: read, compute, attempt CAS, retry on failure. Because the retry is cheap (just a loop iteration), and because CAS failure means progress was made by another thread, the algorithm is lock-free: the system as a whole always makes progress, even if individual threads retry.

Compare to a mutex: if one thread holds the lock, all others wait — no progress by anyone until the lock holder releases. Lock-free algorithms don’t have this property.

The ABA Problem

The subtle correctness trap in lock-free programming.

Suppose thread A reads value 42 from a memory location and is about to CAS it to 43. Before A does the CAS:

  1. Thread B reads 42
  2. Thread B writes 99
  3. Thread C reads 99
  4. Thread C writes 42 (back to the original value)

Now thread A’s CAS succeeds — it sees 42 and replaces it with 43. But the value at that location went 42 → 99 → 42; A’s assumption (that nothing changed) is wrong. Depending on what the value represents, this can be a serious bug.

In practice for counters and sequences: ABA usually doesn’t matter — you’re replacing a number with another number, and the semantic effect is the same regardless of intermediate values.

In practice for pointers in lock-free linked structures: ABA is dangerous. If the pointer at position X changed from pointing to node A to pointing to some other node and then back to pointing to node A (possibly because A was freed and reallocated at the same address), your CAS succeeds on a stale pointer.

Solutions:

  • Versioned CAS / tagged pointers: attach a version counter to the value; increment on every change. CAS on (value, version). ABA becomes (42, version1) → (99, version2) → (42, version3); the version mismatch prevents the false success.
  • Hazard pointers: mark pointers before dereferencing; garbage collection of nodes waits until no thread has a hazard pointer on it.
  • Avoid lock-free linked structures: they’re hard to get right. Use lock-free arrays (ring buffers, fixed-size queues) instead.

Java’s Atomic Classes

The java.util.concurrent.atomic package provides lock-free building blocks:

ClassUse caseKey operations
AtomicLongCounters, sequence numbersget, set, incrementAndGet, compareAndSet
AtomicReference<T>Single pointer/reference updatesget, set, compareAndSet
AtomicReferenceArray<T>Arrays with atomic element updatesget(i), compareAndSet(i,e,u)
LongAdderHigh-contention counters (striped)increment, sum
AtomicLongFieldUpdaterLock-free field update without boxingcompareAndSet(obj, e, u)

LongAdder is important: under high contention, a single AtomicLong becomes a bottleneck because many threads are competing on one CAS. LongAdder maintains a cell per CPU, accumulates increments locally, and aggregates on sum(). For metrics counters updated from many threads, LongAdder is significantly faster than AtomicLong.

AtomicLongFieldUpdater avoids the object overhead of wrapping a long in AtomicLong. If you have a class with a volatile long sequence field and you want CAS on it without allocating an AtomicLong wrapper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class Sequencer {
    private volatile long sequence = -1;

    private static final AtomicLongFieldUpdater<Sequencer> UPDATER =
        AtomicLongFieldUpdater.newUpdater(Sequencer.class, "sequence");

    public boolean claim(long expected, long next) {
        return UPDATER.compareAndSet(this, expected, next);
    }
}

This is exactly what the Disruptor does for its sequence fields — minimising object count and keeping the fields inline in the class layout.

Memory Ordering: volatile and Happens-Before

CAS alone isn’t sufficient — you need memory visibility guarantees. The Java memory model (JMM) says:

  • A volatile write happens-before any subsequent volatile read of the same variable
  • AtomicLong.compareAndSet and similar operations have the same memory semantics as volatile read + write

This means: if thread A writes data and then does a CAS on sequence, and thread B reads sequence via CAS and then reads data, B is guaranteed to see A’s write to data. The sequence acts as a memory barrier.

This is why the Disruptor’s producer writes the event data before publishing the sequence, and consumers read the sequence before reading the event data. The ordering guarantees flow from the volatile semantics on the sequence fields.

Producer:                    Consumer:

event.price = 1.0842;        long seq = sequence.get();  // volatile read
event.qty   = 1_000_000;     // happens-after: price and qty visible
sequence.set(nextSeq);       // volatile write

Without the volatile write on sequence, the consumer might read a stale price or qty even after seeing the updated sequence. The happens-before relationship is what makes the whole construct correct, not just the CAS instruction.

When Not to Use Lock-Free

Lock-free code is harder to write, harder to test, and harder to reason about than lock-based code. The cases where it’s worth it:

SituationRecommendation
High-contention counterLongAdder or AtomicLong
Single-producer/single-consumer queueDisruptor or LinkedTransferQueue
Hot path, measured lock contentionConsider lock-free
Cold path, no contention measuredsynchronized — simpler, equally fast
Complex multi-step operationLock (lock-free multi-step is very hard)
Correctness uncertainLock — a correct slow program beats a fast incorrect one

The rule: measure first. Lock contention shows up in thread dumps (threads parked on a lock) and in profilers (time spent in park/unpark). If you don’t have measured contention, don’t reach for lock-free code.