Mechanical Sympathy: Writing Java That Respects the Hardware

Martin Thompson coined the term “mechanical sympathy” — the idea that to write fast software you need to understand the machine it runs on. Not at the assembly level necessarily, but well enough to reason about what the CPU, memory hierarchy, and OS are actually doing with your code.

This post is what that looks like in practice, writing Java for a system where microseconds matter.

The Memory Hierarchy Is Not Flat

The single most important hardware fact for low-latency Java:

Level	Latency	Size
L1 cache	~1 ns	32–64 KB
L2 cache	~4 ns	256 KB–1 MB
L3 cache	~10–40 ns	4–32 MB
DRAM	~60–100 ns	GBs

A cache miss to DRAM is 60–100x slower than an L1 hit. At 50,000 messages/second, if each message processing path causes one unexpected DRAM access, you’ve spent more time on memory fetches than on computation.

Java makes this easy to forget because you rarely think about memory layout. But the JVM does have a layout, and you can reason about it.

One of the more counterintuitive performance problems I encountered early on.

A CPU cache operates at cache-line granularity — typically 64 bytes. If two threads are writing to different variables that happen to live in the same cache line, they’ll invalidate each other’s cache entries on every write. This is false sharing, and it can bring a multi-threaded system to its knees.

In Java, fields in an object are laid out contiguously. A long counter and a boolean flag in the same object are likely sharing a cache line.

The fix: padding. Before @Contended (JDK 8+), you’d add 7 long fields between hot fields to push them onto separate cache lines. Ugly, but effective.

1
2
3
4
5
6
// Before @Contended
public class PaddedLong {
    public volatile long value;
    // 7 longs of padding = 56 bytes, pushing next field to a new cache line
    public long p1, p2, p3, p4, p5, p6, p7;
}

With @Contended (requires -XX:-RestrictContended):

1
2
@sun.misc.Contended
public volatile long value;

The Disruptor uses this extensively — it’s one of the reasons it’s so fast.

Branch Prediction and Predictable Code Paths

Modern CPUs speculatively execute ahead of branches. If the branch predictor gets it wrong, the pipeline has to flush and retry — a penalty of 10–20 cycles.

In a tight loop processing market data, unpredictable branches are expensive. The practical implication:

Keep hot paths branch-free where possible
If you must branch, make the common case the one the predictor will get right (i.e., the common case should be consistently true or consistently false — not random)
Avoid polymorphic dispatch in hot paths; the JIT will inline monomorphic call sites but not megamorphic ones

What This Means for Object Allocation

The JVM allocates objects in eden space sequentially (bump-pointer allocation). Allocation itself is fast. The problem is GC.

When you allocate objects in a hot path:

They accumulate in eden
Minor GC runs and either promotes them or collects them
During minor GC: stop-the-world, or at minimum, CPU stalls

For latency-sensitive code, the rule is: zero allocation on the hot path. This means object pooling, pre-allocation, or redesigning the data flow to avoid temporaries.

This is un-idiomatic Java. It’s also the only way to hit sub-millisecond tail latencies reliably.

The Thread-to-Core Binding Problem

One thing the JVM doesn’t give you by default: control over which CPU core a thread runs on. Thread scheduling is the OS’s job, and the OS will move threads around based on its own priorities.

For ultra-low latency, you want a critical thread pinned to a dedicated CPU core — no migration, no sharing with other threads. On Linux this is taskset or the affinity syscall. From Java, you need JNI or a library like Java-Thread-Affinity (from OpenHFT).

Combined with isolating that CPU from the OS scheduler (isolcpus kernel parameter), you can get remarkably consistent latencies. We shaved our 99.9th percentile latency by ~30% just from CPU affinity changes, with no code changes at all.

The Takeaway

None of this is Java-specific. The hardware realities apply to any language. Java just adds extra layers (JVM, GC, JIT) that can obscure what’s happening. The goal isn’t to fight those layers — the JIT is genuinely impressive — but to understand them well enough to work with them.

Reading list that shaped my thinking in this era:

Martin Thompson’s Mechanical Sympathy blog
“Computer Architecture: A Quantitative Approach” — Hennessy & Patterson (even skim-reading chapter 1 is useful)
The Disruptor technical paper (available on the LMAX GitHub)

The Memory Hierarchy Is Not Flat#

False Sharing#

Branch Prediction and Predictable Code Paths#

What This Means for Object Allocation#

The Thread-to-Core Binding Problem#

The Takeaway#

The Memory Hierarchy Is Not Flat

False Sharing

Branch Prediction and Predictable Code Paths

What This Means for Object Allocation

The Thread-to-Core Binding Problem

The Takeaway