Five years ago I joined the electronic trading firm not knowing what a cache line was. I thought garbage collection was something that happened to other people’s code. I had never looked at assembly output from a Java program. I’d heard of the LMAX Disruptor but had no idea why it existed.

By the time I left, I had opinions about CPU prefetchers. I had read the Intel 64 and IA-32 Architectures Software Developer’s Manual for fun. I could look at a flame graph and immediately see the GC pressure. I had shipped components processing a million messages per second with sub-millisecond p99 guarantees.

Here’s what that environment actually teaches you.

The Measurement Mindset Is Everything

The single most transferable skill from HFT is not knowing about cache lines or lock-free data structures. It’s the habit of measuring before claiming.

In most software environments, “this should be faster” is an acceptable basis for a change. In HFT, “this is faster, here’s the benchmark” is the only acceptable basis. The difference isn’t just rigour — it’s that performance intuition at the nanosecond level is genuinely unreliable. I lost count of how many “obvious optimisations” made things slower.

Optimisation I was sure would help    Actual result
──────────────────────────────────────────────────────
Use int instead of long (smaller)     Slower (alignment)
Pre-compute lookup table              Slower (cache pressure)
Remove branch with conditional move   Slower (dependency chain)
Add padding to avoid false sharing    Faster ✓
Pin thread to isolated CPU            Faster ✓
Replace ArrayList with array          Faster ✓ (but not by much)
Eliminate virtual dispatch            Faster ✓ (significantly)

Three of those seven “optimisations” made things worse. Without measurement I would have shipped all of them and concluded the system had been improved.

Hardware Awareness Becomes Automatic

After long enough in this environment, you stop thinking of code as abstract instructions and start thinking about where data is when the instruction executes.

The mental model that becomes second nature:

Register:    ~0.3ns access (within the CPU)
L1 cache:    ~1ns   (32–64KB per core)
L2 cache:    ~4ns   (256KB–1MB per core)
L3 cache:    ~10ns  (shared, 8–32MB)
DRAM:        ~100ns (main memory)
NVMe SSD:    ~100µs (100,000ns — 100× slower than DRAM)

A "fast" trading system lives in L1/L2.
A "slow" trading system makes DRAM accesses in the hot path.

This changes how you think about data layout. Arrays of structs vs. structs of arrays. Which fields are accessed together and should live in the same cache line. Whether your queue implementation causes the producer’s write and the consumer’s read to fight over the same cache line (false sharing).

These aren’t things you read once and remember. They’re things you measure repeatedly until the model is calibrated.

The GC War Is Real

Five years of fighting JVM garbage collection taught me something that I think is underappreciated: GC is not primarily a throughput problem, it’s a tail latency problem.

Your average throughput can be excellent while GC pauses are killing your p99.9. A 10ms stop-the-world pause every 30 seconds sounds harmless. At 500k messages/second, 30 seconds of messages arrive during that pause. Those messages queue up. After the pause, the queue is processed at some rate above 500k/s. The latency spike you see is not 10ms — it’s the 10ms pause plus the queuing time to drain the backlog.

The techniques that actually worked, roughly in order of impact:

TechniqueLatency reductionComplexity added
Reduce allocation rate on hot pathHighMedium
Object pooling for predictable objectsHighHigh
Off-heap storage (Chronicle)HighVery high
-Xmx = -Xms (pre-size heap)MediumLow
G1GC with explicit region sizingMediumMedium
Explicit System.gc() at known quiet timesLowLow
ZGC/Shenandoah (available 2019+)HighLow

The irony: the most effective single technique is the simplest — stop allocating in the hot path. No GC pauses if there’s nothing to collect. This is why the codebase at the trading firm was full of pre-allocated object pools, off-heap buffers, and careful management of what got allocated where.

Debugging Requires Deeper Tools

Standard debugging — print statements, breakpoints, stacktraces — breaks down at this level. A breakpoint stops the world. A print statement adds microseconds of latency that masks the thing you’re trying to measure. Exceptions (even caught ones) disable JIT optimisations for the surrounding code.

The toolbox that replaces them:

  • async-profiler — samples the JVM without safepoints, gives you accurate CPU profiles without observer effect
  • HdrHistogram — records full latency distribution, not averages
  • -XX:+PrintGCDetails -XX:+PrintGCDateStamps — GC log
  • -XX:+PrintCompilation -XX:+PrintInlining — JIT decisions
  • Flight Recorder + JMC — comprehensive low-overhead recording
  • Linux perf — CPU performance counters, cache misses, branch mispredictions
  • strace -c — syscall count/timing when you suspect kernel transitions

Learning these tools properly — not just “run it and look at the output” but understanding what the output means — takes longer than learning most programming languages.

The Things That Don’t Transfer Cleanly

HFT habits are powerful and sometimes counterproductive outside the domain:

Premature optimisation instinct. After years where every microsecond matters, you develop reflexes around performance that can slow down feature development in contexts where 100ms is completely acceptable. A database query that returns in 50ms is not a problem worth spending a day on when your users experience the response over a 200ms API call.

Allocation anxiety. Writing Go or Python code after years of Java HFT, I initially felt guilty about every allocation. Go’s GC is not a stop-the-world event in the same way. Python’s object model doesn’t have the same cache-line implications. Different languages have different cost models.

Distrust of abstractions. HFT trains you to understand every layer. This is appropriate when the layer costs you microseconds. It can become pathological when you spend hours understanding a library’s internals for a use case where the library’s default behaviour is perfectly adequate.

Measuring things that don’t matter. The discipline of measurement is invaluable. Measuring the wrong thing — microsecond performance of code that runs once at startup — is just expensive bikeshedding.

What Does Transfer

  • The habit of measuring before concluding
  • Understanding the hardware underneath the software (even imperfectly)
  • Reading bytecode/assembly when the abstraction layer is hiding something important
  • Thinking about the full distribution of latency, not just the average case
  • Designing data structures around how they’ll be accessed, not just what they contain
  • Writing systems that degrade gracefully under load rather than catastrophically

These skills are useful in any performance-sensitive domain. Which, as systems continue to grow in scale and user expectations for responsiveness increase, turns out to be most domains.


I left to work on different problems at a much larger organisation. The tools changed, the scale changed, the performance targets changed. The measurement mindset didn’t change, and it remained the most useful thing I brought with me.