Reading GC Logs Like a Detective

GC logs are always-on, low-overhead diagnostic data that the JVM will produce for you. They tell you the timing, cause, duration, and effect of every collection — if you know how to read them.

Most Java engineers can tell you what GC does. Far fewer can look at a GC log and immediately see why the p99 latency spiked at 14:37 last Tuesday.

Enabling Useful GC Logging

The modern (Java 9+) unified logging syntax:

1
-Xlog:gc*:file=/var/log/app/gc.log:time,uptime,level,tags:filecount=10,filesize=100m

This gives you:

All GC events (gc*) — not just the summary
Timestamped output (wall clock + uptime)
Rolling files (10 × 100MB = 1GB of history)

For Java 8 (the relevant era at the institution):

1
2
3
4
5
6
7
8
9
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintTenuringDistribution
-Xloggc:/var/log/app/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100m

Anatomy of a G1GC Log Entry

2017-04-18T09:12:34.891+0000: 3847.234: [GC pause (G1 Evacuation Pause) (young)
  Desired survivor size 134217728 bytes, new threshold 15 (max threshold 15)
  [Parallel Time: 44.3 ms, GC Workers: 8]
    [GC Worker Start (ms): Min: 3847234.1, Avg: 3847234.3, Max: 3847234.8, Diff: 0.7]
    [Ext Root Scanning (ms): Min: 0.5, Avg: 0.8, Max: 1.4, Diff: 0.9, Sum: 6.6]
    [Update RS (ms): Min: 2.1, Avg: 2.9, Max: 4.2, Diff: 2.1, Sum: 23.3]
    [Processed Buffers: Min: 4, Avg: 12.3, Max: 24, Diff: 20, Sum: 98]
    [Scan RS (ms): Min: 1.2, Avg: 1.8, Max: 2.7, Diff: 1.5, Sum: 14.6]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum: 0.8]
    [Object Copy (ms): Min: 28.1, Avg: 36.7, Max: 41.2, Diff: 13.1, Sum: 293.6]
    [Termination (ms): Min: 0.0, Avg: 0.3, Max: 1.2, Diff: 1.2, Sum: 2.3]
  [Code Root Fixup: 0.6 ms]
  [Code Root Purge: 0.0 ms]
  [Clear CT: 0.8 ms]
  [Other: 3.1 ms]
  [Eden: 4096.0M(4096.0M)->0.0B(3584.0M) Survivors: 512.0M->512.0M Heap: 12288.0M(24576.0M)->8705.8M(24576.0M)]
  [Times: user=0.33 sys=0.02, real=0.05 secs]

Reading this:

GC pause (G1 Evacuation Pause) (young) — This is a normal young-gen collection. Evacuation = live objects moved from eden/survivor regions to new survivor/old regions. Not a full collection.

Parallel Time: 44.3 ms — Total elapsed time was 44.3ms. This is the stop-the-world duration from the application’s perspective.

Object Copy: 36.7ms average — 83% of the pause is spent copying live objects out of eden. This is normal. If this number is high, you have too many long-lived objects surviving into old gen, or eden regions are too large.

Eden: 4096.0M->0.0B(3584.0M) — Eden was 4GB, drained to 0, then resized to 3.5GB. G1 adaptively resizes eden based on pause time targets.

Heap: 12288.0M->8705.8M(24576.0M) — Before: 12GB used. After: 8.7GB used. Heap max 24GB. Freed 3.3GB of garbage.

real=0.05 secs — 50ms wall clock time. Matches Parallel Time. If real is much larger than Parallel Time, you have OS scheduling issues (the GC thread was preempted).

The GC Pause Patterns and What They Mean

Pattern 1: Normal young collections, stable

Young: 12GB → 8GB (pause 45ms) — OK
Young: 12GB → 8GB (pause 47ms) — OK
Young: 12GB → 8GB (pause 44ms) — OK

Allocation rate is predictable, live set is stable. No action needed.

Pattern 2: Growing live set

Young + concurrent: 8GB retained after GC
Young + concurrent: 8.3GB retained after GC
Young + concurrent: 8.7GB retained after GC

Old gen live set is growing. Memory leak, or workload has grown beyond old gen capacity. Check heap dump for retained objects.

Pattern 3: Increasing pause time, stable heap

Young: 45ms
Young: 52ms
Young: 61ms
Young: 78ms

Possible causes: allocation rate increased (more to copy), more cross-region references increasing RS scan time, or humongous object allocation fragmenting heap.

Pattern 4: Mixed collections appearing frequently

[GC pause (G1 Evacuation Pause) (mixed)]

G1 is doing mixed collections (collecting old gen regions along with young gen). This is normal as part of the concurrent marking cycle. If mixed collections are very frequent, old gen is filling faster than concurrent marking can keep up.

Pattern 5: Full GC

[GC pause (G1 Compaction Pause)]

G1 failed to evacuate — ran out of free regions. This is the emergency brake. If you see this in production, something went wrong. Usually: humongous allocations, concurrent marking couldn’t keep up, or old gen is simply too full.

Diagnosing Humongous Allocations

In G1, objects larger than half a region size (default 8MB for 16MB regions) are “humongous” — they get their own region and are collected only during marking cycles, not young GCs. They can cause fragmentation and pause spikes.

[GC pause (G1 Evacuation Pause) (young) (to-space exhausted)]

“To-space exhausted” means G1 couldn’t find enough free regions to evacuate into — often because humongous objects are consuming regions that can’t be reused.

Diagnosis:

1
2
3
4
5
6
7
# Find what's allocating large objects:
jcmd <pid> JFR.start duration=60s \
  settings=profile \
  filename=/tmp/jfr.jfr

# Or add to startup:
-Xlog:gc+humongous:file=/var/log/gc-hum.log

The fix is usually to increase region size:

1
-XX:G1HeapRegionSize=32m  # 32MB regions → humongous threshold = 16MB

Or to find and fix the large allocations themselves (typically: large arrays in parsing or serialisation code, unbounded collection builders, cached data structures that grew unexpectedly).

The Pause Time Target and Why You Can’t Always Hit It

1
-XX:MaxGCPauseMillis=200  # default: 200ms

G1 tries to stay under this target by adaptively sizing eden, adjusting mixed collection frequency, and choosing which old gen regions to collect. It cannot guarantee it — if there’s too much work to do, the pause will exceed the target.

Setting this too low is counterproductive: G1 will keep eden small (frequent collections), spend more time on overhead per byte collected, and potentially make throughput worse without improving p99 latency.

The relationship between the target and actual pauses:

Target 200ms → Actual p99 ~150ms, occasional 400ms spikes
Target 100ms → Actual p99 ~80ms, frequent 200ms spikes, throughput drops
Target 50ms  → Actual p99 ~40ms, constant overhead, throughput degrades significantly

For workloads that need sub-50ms GC pauses, you need either: very small heaps (< 4GB), ZGC/Shenandoah (available in Java 12+), or off-heap data structures to reduce the live set that GC has to manage.

The GC Log Analysis Workflow

When investigating a latency incident:

1. Find the timestamp of the latency spike in your monitoring
2. Cross-reference with GC log: was there a GC pause at that time?
3. If yes:
   a. What type of pause? Young / mixed / full?
   b. How long? Expected for this heap size?
   c. What caused it? Check GC cause field
   d. Was old gen growing in the period before?
4. If no GC pause:
   a. Check safepoint log: -XX:+PrintSafepointStatistics
   b. Check for lock contention: thread dump during spike
   c. Check OS: was the process preempted? (check sys time in GC logs)

GC is often the first suspect in JVM latency spikes — and often the correct one. But not always. The discipline of checking rather than assuming prevents you from “fixing” GC in a service where the real problem is a slow database query.

Enabling Useful GC Logging#

Anatomy of a G1GC Log Entry#

The GC Pause Patterns and What They Mean#

Diagnosing Humongous Allocations#

The Pause Time Target and Why You Can’t Always Hit It#

The GC Log Analysis Workflow#