Stop-the-World GC Pauses Killed Our SLA — And What We Did About It

The incident happened at 08:31 on a Tuesday — Frankfurt open, high volatility session. Our tick-to-quote latency spiked to 340ms for about 2 seconds. The SLA was 1ms at p99. Trading desk noticed before our monitoring did.

The culprit: a full GC triggered by a promotion failure. We had 12GB heap, CMS collector, and no one had looked at GC logs since the initial deployment.

Understanding What Actually Happened

CMS (Concurrent Mark Sweep) is designed to avoid full GCs by running most of its collection work concurrently with application threads. It fails — and triggers a full GC — when:

Promotion failure: Old gen fills up before CMS can finish a collection cycle. Minor GC tries to promote objects to old gen, finds no space, panics.
Concurrent mode failure: CMS can’t finish its concurrent phase before old gen exhausts.

Both produce stop-the-world pauses. With a 12GB heap and CMS, that pause was 8–12 seconds in the worst case. We got lucky with 340ms that day — it was a partial collection.

The GC log told the whole story. We just hadn’t been reading it.

2012-11-13T08:31:02.441+0000: [Full GC [CMS
 [CMS-concurrent-mark: 2.341/4.102 secs]
 (concurrent mode failure): 9823421K->4231823K(10485760K), 1.8823 secs]

concurrent mode failure — there it is.

Phase 1: Stop the Bleeding (Heap Tuning)

The immediate fix was heap configuration. CMS needs headroom — if old gen is running too hot, it can’t complete concurrent phases in time.

Changes:

-Xms12g -Xmx12g              # already set
-XX:NewRatio=2               # was 8 — eden was tiny relative to old gen
-XX:CMSInitiatingOccupancyFraction=60   # was default 92 — way too late
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark

Setting CMSInitiatingOccupancyFraction=60 means CMS starts its concurrent cycle when old gen hits 60% — giving it a 40% buffer to complete before running out of space.

This eliminated the concurrent mode failures. But we still had minor GC pauses at 20–40ms during peak. That’s still too slow.

Phase 2: Reduce Allocation Rate

Minor GC frequency is driven by how fast eden fills. Eden fills because code allocates objects. The fix to minor GC frequency is to allocate less.

We instrumented the hot path with java.lang.management.GarbageCollectorMXBean to measure GC rate in production, then profiled with async-profiler to find the allocation hotspots.

The top three offenders:

FIX message parsing — creating String objects for every field value
Price normalisation — allocating a new NormalisedPrice per tick
Internal event routing — boxing long timestamps into Long for generic collections

For each: pre-allocated pools or primitive collections (Trove/Eclipse Collections for the generic ones). The FIX parser was rewritten to parse directly into a flyweight object backed by a pre-allocated byte[].

Allocation rate dropped from ~800MB/s to ~120MB/s under peak load. Minor GC frequency fell by 6x.

Phase 3: Measure the Right Things

The missing piece throughout had been measurement. We were watching application-level metrics (tick-to-quote latency, error rates) but not GC metrics specifically. By the time the app metric spiked, it was already an incident.

What we added to monitoring:

GC pause duration (minor and major) via JMX
GC frequency
Heap occupancy trend (old gen especially)
Allocation rate (inferred from young gen collection frequency × eden size)

Alerting on old gen > 70% with CMS gave us 10–30 minutes of warning before a potential concurrent mode failure. That’s enough to either trigger a manual GC cycle or fail over to a standby instance.

Where This Leaves Us

CMS is better than parallel GC for our use case, but it’s still not the answer for sub-millisecond tail latency. We started evaluating Azul’s Zing (pauseless GC via C4) and looking at what Shenandoah and ZGC might offer when they mature.

The deeper lesson: GC behaviour is a first-class system characteristic, not an afterthought. If you’re running Java in a latency-sensitive environment and you’re not reading GC logs regularly, you’re flying blind.

Understanding What Actually Happened#

Phase 1: Stop the Bleeding (Heap Tuning)#

Phase 2: Reduce Allocation Rate#

Phase 3: Measure the Right Things#

Where This Leaves Us#

Understanding What Actually Happened

Phase 1: Stop the Bleeding (Heap Tuning)

Phase 2: Reduce Allocation Rate

Phase 3: Measure the Right Things

Where This Leaves Us