Choosing a GC Collector for Low-Latency Java: A Practical Comparison

By mid-2014 we had run CMS, G1, and Parallel GC in production on the same workload, and had evaluated Azul Zing. Here’s what we found and the decision framework that came out of it.

The Workload

Before the comparison means anything, the workload matters:

Allocation rate: ~300 MB/s sustained, bursting to ~800 MB/s during market opens
Object tenure: most objects short-lived (FIX message processing), some medium-lived (order state), few long-lived (reference data)
Heap: 16 GB, JDK 7u67
SLA: p99 < 1ms tick-to-quote, p999 < 5ms

Collector Comparison

Collector	Pause type	Typical minor (ms)	Typical major (ms)	Throughput overhead
Parallel GC	STW all	50–200	2,000–15,000	Lowest
CMS	Mostly concurrent	10–40	500–2,000 (failure)	Medium
G1	STW + concurrent	20–80	200–800	Medium
Azul C4 (Zing)	Concurrent	<5	<10	Higher

STW = stop-the-world. Numbers from our specific workload on Xeon E5-2687W.

Parallel GC: Eliminated Immediately

Parallel GC (the default in JDK ≤8) uses multiple GC threads and achieves high throughput — but every collection is stop-the-world. Minor collections on 16GB with our allocation rate were 50–200ms. Not a consideration for sub-millisecond SLA work.

The one scenario where Parallel GC is correct: batch processing where you want maximum throughput and don’t care about pause duration. For our trading system: no.

CMS: The Standard Choice, With Caveats

CMS runs its marking and sweeping phases concurrently with application threads, producing short stop-the-world pauses only for initial mark and final remark. When it works, it’s good. When it doesn’t work, it falls back to a serial full GC — the worst outcome.

CMS cycle (normal):
─────────────────────────────────────────────────────────────────────▶ time
[STW: initial mark] [concurrent mark ·····] [STW: remark] [concurrent sweep ·····]
     ~5ms                 ~500ms                 ~20ms              ~800ms
     ↑                                           ↑
     Only these two are stop-the-world

The failure conditions to avoid:

Concurrent mode failure: old gen fills before CMS completes its concurrent cycle → serial full GC
Promotion failure: minor GC can’t promote to old gen (fragmentation or exhaustion) → full GC

Mitigations that worked for us:

CMSInitiatingOccupancyFraction=55 (start early, give CMS room)
+CMSScavengeBeforeRemark (minor GC just before remark reduces remark work)
Aggressive allocation reduction on the hot path (see earlier post on GC pauses)

With these in place: p99 ~340µs, p999 ~2.1ms. Still occasional spikes from remark.

G1: The Upgrade Path

G1 (Garbage First) divides the heap into equal-sized regions and collects the regions with the most garbage first (hence the name). It eliminates the CMS fragmentation problem and has more predictable pause times.

Heap layout (G1):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ E  │ E  │ S  │ O  │ O  │ H  │ E  │ O  │  ← regions, dynamically typed
└────┴────┴────┴────┴────┴────┴────┴────┘
  E=Eden  S=Survivor  O=Old  H=Humongous

G1 collects whichever regions have highest garbage density first
→ predictable pause time via region count limit per collection

Key tuning lever: MaxGCPauseMillis (default 200ms). G1 tries to keep pauses under this target by limiting how many regions it collects per cycle. Setting it to 20ms on our workload:

Metric	CMS (tuned)	G1 (MaxGCPauseMillis=20)
p50 latency	185 µs	201 µs
p99 latency	340 µs	390 µs
p999 latency	2,100 µs	680 µs
Full GC freq	2–3/week	0 (in 3 months)

G1 had slightly worse p50 and p99 but dramatically better p999 and eliminated full GCs entirely. For a system where occasional 2-second full GCs were unacceptable, G1 was the correct move.

G1 gotcha: humongous objects. Objects larger than half a G1 region size (default region = 1–32MB, G1HeapRegionSize) are allocated in humongous regions and are only collected in full GCs. If your workload creates large objects (>512KB), check for humongous allocation warnings in GC logs.

Azul Zing C4: When You Need Genuinely Pauseless

C4 (Continuously Concurrent Compacting Collector) is the Azul proprietary GC available in Zing. It’s the only production GC at the time of writing (2014) that achieves sub-millisecond pause times at multi-gigabyte heap sizes by doing all GC work concurrently.

We evaluated it:

p999 dropped from 680µs (G1) to 85µs
No outliers above 500µs in a 72-hour soak test
Cost: Zing is commercial, not cheap

For the strictest latency budgets, the cost was justified. For our use case at the time, G1’s p999 of 680µs met the 5ms SLA with margin, and the cost of Zing wasn’t approved. At a previous trading firm with a tighter budget, Zing was in use — the numbers above are directionally correct.

Decision Framework

Is p999 > 1ms acceptable?
    │
    ├─ Yes → G1 with MaxGCPauseMillis=50-100
    │         Eliminate humongous allocations
    │         Monitor with -Xlog:gc* (JDK 9+) or -XX:+PrintGCDetails
    │
    └─ No → Is p999 > 500µs acceptable?
                │
                ├─ Yes → G1 with MaxGCPauseMillis=20-30
                │         Aggressive allocation reduction on hot path
                │
                └─ No → Azul Zing / ZGC (JDK 15+) / Shenandoah
                         OR redesign to avoid GC pressure entirely
                         (off-heap, object pooling, Aeron)

ZGC and Shenandoah weren’t production-ready in 2014, but they’re the modern equivalent of Azul C4 and are worth evaluating before paying for a commercial JVM.

What Actually Matters More Than Collector Choice

For the record: allocation rate reduction matters more than collector choice. Switching from CMS to G1 improved our p999 from 2.1ms to 680µs. Reducing our allocation rate from 300MB/s to 80MB/s (through object pooling and off-heap structures) improved the G1 p999 from 680µs to 290µs. Same collector, lower allocation rate — bigger improvement.

Measure allocation rate first. Choose a collector second.

The Workload#

Collector Comparison#

Parallel GC: Eliminated Immediately#

CMS: The Standard Choice, With Caveats#

G1: The Upgrade Path#

Azul Zing C4: When You Need Genuinely Pauseless#

Decision Framework#

What Actually Matters More Than Collector Choice#