By mid-2014 we had run CMS, G1, and Parallel GC in production on the same workload, and had evaluated Azul Zing. Here’s what we found and the decision framework that came out of it.
The Workload
Before the comparison means anything, the workload matters:
- Allocation rate: ~300 MB/s sustained, bursting to ~800 MB/s during market opens
- Object tenure: most objects short-lived (FIX message processing), some medium-lived (order state), few long-lived (reference data)
- Heap: 16 GB, JDK 7u67
- SLA: p99 < 1ms tick-to-quote, p999 < 5ms
Collector Comparison
| Collector | Pause type | Typical minor (ms) | Typical major (ms) | Throughput overhead |
|---|---|---|---|---|
| Parallel GC | STW all | 50–200 | 2,000–15,000 | Lowest |
| CMS | Mostly concurrent | 10–40 | 500–2,000 (failure) | Medium |
| G1 | STW + concurrent | 20–80 | 200–800 | Medium |
| Azul C4 (Zing) | Concurrent | <5 | <10 | Higher |
STW = stop-the-world. Numbers from our specific workload on Xeon E5-2687W.
Parallel GC: Eliminated Immediately
Parallel GC (the default in JDK ≤8) uses multiple GC threads and achieves high throughput — but every collection is stop-the-world. Minor collections on 16GB with our allocation rate were 50–200ms. Not a consideration for sub-millisecond SLA work.
The one scenario where Parallel GC is correct: batch processing where you want maximum throughput and don’t care about pause duration. For our trading system: no.
CMS: The Standard Choice, With Caveats
CMS runs its marking and sweeping phases concurrently with application threads, producing short stop-the-world pauses only for initial mark and final remark. When it works, it’s good. When it doesn’t work, it falls back to a serial full GC — the worst outcome.
CMS cycle (normal):
─────────────────────────────────────────────────────────────────────▶ time
[STW: initial mark] [concurrent mark ·····] [STW: remark] [concurrent sweep ·····]
~5ms ~500ms ~20ms ~800ms
↑ ↑
Only these two are stop-the-world
The failure conditions to avoid:
- Concurrent mode failure: old gen fills before CMS completes its concurrent cycle → serial full GC
- Promotion failure: minor GC can’t promote to old gen (fragmentation or exhaustion) → full GC
Mitigations that worked for us:
CMSInitiatingOccupancyFraction=55(start early, give CMS room)+CMSScavengeBeforeRemark(minor GC just before remark reduces remark work)- Aggressive allocation reduction on the hot path (see earlier post on GC pauses)
With these in place: p99 ~340µs, p999 ~2.1ms. Still occasional spikes from remark.
G1: The Upgrade Path
G1 (Garbage First) divides the heap into equal-sized regions and collects the regions with the most garbage first (hence the name). It eliminates the CMS fragmentation problem and has more predictable pause times.
Heap layout (G1):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ E │ E │ S │ O │ O │ H │ E │ O │ ← regions, dynamically typed
└────┴────┴────┴────┴────┴────┴────┴────┘
E=Eden S=Survivor O=Old H=Humongous
G1 collects whichever regions have highest garbage density first
→ predictable pause time via region count limit per collection
Key tuning lever: MaxGCPauseMillis (default 200ms). G1 tries to keep pauses under this target by limiting how many regions it collects per cycle. Setting it to 20ms on our workload:
| Metric | CMS (tuned) | G1 (MaxGCPauseMillis=20) |
|---|---|---|
| p50 latency | 185 µs | 201 µs |
| p99 latency | 340 µs | 390 µs |
| p999 latency | 2,100 µs | 680 µs |
| Full GC freq | 2–3/week | 0 (in 3 months) |
G1 had slightly worse p50 and p99 but dramatically better p999 and eliminated full GCs entirely. For a system where occasional 2-second full GCs were unacceptable, G1 was the correct move.
G1 gotcha: humongous objects. Objects larger than half a G1 region size (default region = 1–32MB, G1HeapRegionSize) are allocated in humongous regions and are only collected in full GCs. If your workload creates large objects (>512KB), check for humongous allocation warnings in GC logs.
Azul Zing C4: When You Need Genuinely Pauseless
C4 (Continuously Concurrent Compacting Collector) is the Azul proprietary GC available in Zing. It’s the only production GC at the time of writing (2014) that achieves sub-millisecond pause times at multi-gigabyte heap sizes by doing all GC work concurrently.
We evaluated it:
- p999 dropped from 680µs (G1) to 85µs
- No outliers above 500µs in a 72-hour soak test
- Cost: Zing is commercial, not cheap
For the strictest latency budgets, the cost was justified. For our use case at the time, G1’s p999 of 680µs met the 5ms SLA with margin, and the cost of Zing wasn’t approved. At a previous trading firm with a tighter budget, Zing was in use — the numbers above are directionally correct.
Decision Framework
Is p999 > 1ms acceptable?
│
├─ Yes → G1 with MaxGCPauseMillis=50-100
│ Eliminate humongous allocations
│ Monitor with -Xlog:gc* (JDK 9+) or -XX:+PrintGCDetails
│
└─ No → Is p999 > 500µs acceptable?
│
├─ Yes → G1 with MaxGCPauseMillis=20-30
│ Aggressive allocation reduction on hot path
│
└─ No → Azul Zing / ZGC (JDK 15+) / Shenandoah
OR redesign to avoid GC pressure entirely
(off-heap, object pooling, Aeron)
ZGC and Shenandoah weren’t production-ready in 2014, but they’re the modern equivalent of Azul C4 and are worth evaluating before paying for a commercial JVM.
What Actually Matters More Than Collector Choice
For the record: allocation rate reduction matters more than collector choice. Switching from CMS to G1 improved our p999 from 2.1ms to 680µs. Reducing our allocation rate from 300MB/s to 80MB/s (through object pooling and off-heap structures) improved the G1 p999 from 680µs to 290µs. Same collector, lower allocation rate — bigger improvement.
Measure allocation rate first. Choose a collector second.