In 2015, the state of GC for latency-sensitive Java was: use G1GC, tune it carefully, accept occasional 50–200ms pauses on large heaps, and work around them with off-heap storage and careful allocation management.
The conventional wisdom was that sub-10ms GC pauses required small heaps (< 4GB) or near-zero allocation on the hot path. For trading systems with large position caches, this meant either expensive off-heap engineering or living with GC latency spikes.
Then the early previews of ZGC (Oracle/Sun) and Shenandoah (Red Hat) started circulating. Both claimed sub-millisecond pause times regardless of heap size. The mechanisms were different, the implications were significant.
Why Traditional GC Pauses Are Proportional to Heap Size
G1GC’s stop-the-world pauses are unavoidable for certain operations. Young collection pauses scale with the amount of live data in the young generation (typically 1–5ms for well-tuned systems). Full GC pauses scale with the total live set — a 100GB heap with 60GB live can take 30+ seconds.
The root cause: the GC needs to scan all live object references to determine what to keep. During this scan, the application cannot run (the “stop the world”). The larger the heap, the longer the scan.
Both ZGC and Shenandoah attack this with concurrent marking and relocation — doing the expensive work while the application runs, using barriers to track changes made during the concurrent phase.
ZGC’s Approach: Coloured Pointers
ZGC embeds GC metadata in pointer bits. On x86-64, pointers are 64 bits but only 48 bits are used for addressing. ZGC uses the spare bits to track GC state per pointer:
64-bit pointer:
[unused 4 bits][GC metadata 4 bits][address 44 bits][page offset 12 bits]
↑ finalizable, remapped, marked1, marked0
When ZGC relocates an object (moves it to defragment the heap), it updates the pointer metadata rather than immediately chasing all references. A load barrier intercepts any reference load and follows a forwarding pointer if needed:
Load barrier (conceptually):
pointer = load(address)
if pointer has 'remapped' bit clear:
pointer = follow_forwarding_pointer(pointer)
store(address, pointer) // heal the pointer
return pointer
The barrier adds ~2–4ns per object reference load. The heap can be relocated concurrently while the application runs. Stop-the-world pauses are only needed for short initial and final phases — typically <1ms for any heap size.
Preview results (2015 research builds):
Heap size ZGC pause p99 G1GC pause p99
─────────────────────────────────────────
4 GB <1ms 5–15ms
32 GB <1ms 30–80ms
128 GB <1ms 120–300ms
The ZGC pause time was essentially independent of heap size. This was the breakthrough.
Shenandoah’s Approach: Brooks Pointers
Shenandoah uses a different technique: an additional forwarding pointer field per object (called a Brooks pointer after the inventor). During relocation, the Brooks pointer is updated to point to the new location; loads go through it.
Object layout with Brooks pointer:
[mark word][class pointer][Brooks pointer → new location][fields...]
The Brooks pointer adds one word of overhead per object (8 bytes on 64-bit JVMs). For object-heavy workloads this is meaningful; for large objects (arrays, strings) it’s negligible.
Shenandoah’s load barrier is simpler than ZGC’s coloured pointer approach, making it easier to implement but with slightly higher per-barrier overhead on some workloads.
What This Meant for Trading Systems
In 2015, both collectors were not yet production-ready (ZGC shipped in Java 11 in 2018; Shenandoah in OpenJDK 12 in 2019). But the implications for system design were already clear:
The off-heap workaround becomes less necessary. The main reason we kept large data structures off-heap was to avoid GC pauses. With sub-millisecond GC regardless of heap size, keeping 40GB of risk parameters off-heap is an operational complexity without a clear benefit.
Allocation rate still matters. ZGC and Shenandoah reduce GC pause time but not GC frequency. If you allocate 500MB/second, the GC still runs frequently — it just doesn’t pause the application visibly. Allocation pressure can still affect throughput, just not latency.
The tuning model changes. G1GC requires careful tuning of eden size, survivor ratios, region sizes, and pause time targets. ZGC requires almost no tuning — set -Xmx, choose the number of GC threads, and it works. The operational simplification is significant.
The barrier cost is real. ZGC’s ~2–4ns load barrier overhead is noticeable in pointer-chasing workloads (traversing linked structures). For array-based workloads (our ring buffers, price tables), the overhead is minimal. Profiling showed a 3–5% throughput reduction on our core pipeline — acceptable given the latency improvement.
The Transition Plan We Sketched
By the time ZGC ships as production-ready:
- Replace off-heap risk parameter store with on-heap Chronicle Map equivalent
- Reduce Xmx tuning complexity — standard 8GB heap, let ZGC manage it
- Remove GC-related latency spike monitoring (less relevant)
- Add allocation rate monitoring (still relevant)
This was a long-term plan. The actual migration happened years later, after ZGC matured through Java 11–15. But sketching it in 2015 was useful: it changed the design conversation from “how do we minimise GC impact?” to “how much longer do we need to treat GC as a hard constraint?”
The low-pause GC collectors changed the Java performance landscape more fundamentally than most JVM improvements. They didn’t make Java as fast as a carefully-tuned C++ system, but they removed the class of latency spikes that most distinguished “Java with GC” from “native code without.” For financial systems built on the JVM, that matters.