Heap Dumps and Flight Recorder: Diagnosing JVM Memory Problems in Production

At the large financial institution where I worked from 2016, the JVM services were larger and longer-running than anything I’d dealt with in the previous role. Old generation sizes in the hundreds of gigabytes. Services running for months between restarts. Memory problems that took days or weeks to manifest.

The debugging approach that worked in trading — small heaps, frequent restarts, aggressive allocation control — didn’t apply here. You had to diagnose production JVM state without stopping it.

The Memory Problem Taxonomy

Before reaching for tools, it helps to know what kind of problem you’re dealing with:

Symptom                          Likely cause
────────────────────────────────────────────────────────────────
OOM: Java heap space             Heap exhausted — leak or undersized
OOM: GC overhead limit exceeded  GC taking >98% of CPU, heap near full
OOM: Metaspace                   Class loading growth — dynamic class gen
OutOfMemoryError: Direct buffer  Native memory exhausted — NIO buffers
Growing old gen, stable young    Slow leak accumulating in old gen
Full GC every N minutes          Allocation rate exceeding old gen capacity
Process RSS growing (no heap OOM) Native memory leak — off-heap, thread stacks
High CPU, small heap              GC thrashing — too-small heap for workload
Sudden CPU spike + latency       Large object allocation → immediate old gen

The first step is matching your symptom to a category — each requires different tools.

Heap Dumps: What They Are and When to Capture

A heap dump is a snapshot of all live objects on the Java heap at a point in time, serialised in HPROF format. It contains every object, its type, its field values, and every reference relationship.

Capturing one in a running service:

1
2
3
4
5
6
# Find the process
jps -l
# → 12345 com.example.TradeService

# Capture heap dump (live flag = only include reachable objects)
jmap -dump:live,format=b,file=/tmp/heapdump.hprof 12345

Warning: jmap -dump:live triggers a full GC before dumping (to mark live objects). On a 32GB heap this can take several seconds — during which your service stops responding. Use with care in production. On services with strict SLAs, use -dump:format=b (without live) to dump all objects including unreachable ones — faster, but noisier analysis.

Alternative: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/dumps/ — the JVM automatically dumps when it throws OOM. This is zero-cost when not triggered and should be on by default in every production service.

Analysing a Heap Dump with Eclipse MAT

Eclipse Memory Analyser (MAT) is the best free tool for heap analysis. Load the HPROF file, let it build the object index (~2× the heap dump size in memory), then:

Retained heap is the key metric: the amount of memory that would be freed if this object (and everything it exclusively references) were garbage collected. A HashMap with 1M entries has small shallow size but large retained size.

The “Leak Suspects” report is usually the first thing to check. MAT runs a set of heuristics looking for objects with large retained heap that look suspicious:

Problem Suspect 1
  68 objects of type "java.util.HashMap$Entry[]" retain 4.1 GB

  Keywords
  com.example.service.TradeCache
  com.example.model.Trade
  java.util.HashMap$Entry[]

This tells you a TradeCache is holding 4GB of Trade objects via HashMap entries. The question then is: why aren’t those entries being evicted?

Dominator tree shows the hierarchy of object retainment. Objects that retain the most memory are at the top. Click through to see what’s holding what:

Class Name                         Retained Heap
─────────────────────────────────────────────────
TradeCache                         4.1 GB (68.3%)
  ├── HashMap$Entry[]              4.0 GB
  │     ├── Trade[]               3.8 GB
  │     └── ...
  └── ...

OQL (Object Query Language) lets you query the heap like a database:

1
2
3
4
5
6
7
-- Find all HashMap instances with more than 100K entries
SELECT h FROM java.util.HashMap h
WHERE h.size > 100000

-- Find Trade objects created after a timestamp
SELECT t FROM com.example.model.Trade t
WHERE t.createdAt > 1659000000000L

Java Flight Recorder: Continuous Profiling Without the Snapshot

Where heap dumps are forensic (you capture them reactively when something is wrong), Flight Recorder is continuous — it records a rolling window of JVM events with sub-2% overhead.

Enable it:

1
2
3
4
5
6
7
8
# Start recording to file (5 minute window, continuous):
java -XX:+FlightRecorder \
     -XX:StartFlightRecording=delay=20s,duration=5m,filename=/var/jfr/baseline.jfr \
     -jar service.jar

# Or attach to running process (Java 11+):
jcmd 12345 JFR.start duration=120s filename=/tmp/recording.jfr settings=profile
jcmd 12345 JFR.dump filename=/tmp/recording.jfr

What JFR records (selectable per event type):

Category              Events
─────────────────────────────────────────────────────────────
GC                    GC start/end, pause duration, cause
Memory                TLAB allocation, object allocation outside TLAB
JIT                   Method compilation, deoptimisations
Class loading         Class load/unload, ClassLoader events
Thread               Thread start/stop, lock contention, blocking
I/O                   File read/write, socket read/write
Native                JNI calls, SafePoint wait time
Application           Custom application events (via JFR API)

Open the recording in JDK Mission Control. The Memory view gives you:

Heap usage over time (should be sawtooth, not linear increase)
TLAB allocation rate (rate of short-lived object creation)
Old gen live set size after each full GC (should be stable; growing live set = leak)
GC cause distribution (is this G1 Humongous Allocation? Metadata GC Threshold?)

Reading GC Logs

For production services, always enable GC logging:

-Xlog:gc*:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50m

The output that tells you what’s happening:

[2016-08-24T14:32:01.234+0000] GC(1234) Pause Young (Normal) (G1 Evacuation Pause)
  18432M->14201M(32768M) 45.312ms

[2016-08-24T14:32:44.891+0000] GC(1235) Pause Full (G1 Compaction Pause)
  28901M->8234M(32768M) 12847.301ms

That 12-second full GC is the problem. It was triggered by a G1 Compaction Pause — meaning G1 ran out of region space and had to compact the entire heap. The 18GB before, 8GB after tells you the live set is about 8GB. The question is why it grew to 28GB before GC could keep up.

The Memory Leak We Found

The specific problem that prompted this investigation: a service with a 32GB heap that would OOM after ~72 hours of running. Flight Recorder showed the live set after full GC growing by ~100MB per hour — slow enough to take days to manifest.

Live set after full GC over time:
  Day 0:  8.1 GB
  Day 1:  8.4 GB
  Day 2:  8.8 GB
  Day 3: 10.1 GB  ← rate increasing
  Day 4: OOM

Heap dump on day 3 showed the TradeCache example from above — 4GB of Trade objects held by a cache that was supposed to evict entries older than 24 hours, but had a bug where the eviction check compared Unix timestamps as int rather than long. The check worked for trades in a certain ID range and silently failed for others. The trades that failed the eviction check accumulated forever.

The fix was one line of code. Finding it took two days. This is why you enable HeapDumpOnOutOfMemoryError, keep JFR recordings for at least a week, and invest time in understanding MAT before you need it in an incident.

The JVM Diagnostic Checklist

Every production JVM service should have:

At startup:
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/dumps/
  -Xlog:gc*:file=/var/log/gc.log:time,uptime:filecount=10,filesize=100m
  -XX:+FlightRecorder
  -XX:StartFlightRecording=delay=30s,maxage=6h,name=continuous,settings=default

For investigation:
  jcmd <pid> VM.command_line          — what flags is it running with?
  jcmd <pid> GC.heap_info             — current heap state
  jcmd <pid> JFR.start ...           — start a detailed recording
  jstat -gcutil <pid> 1000            — GC stats every second
  jmap -histo:live <pid>             — live object histogram (triggers Full GC!)

The heap histogram (jmap -histo:live) is often sufficient for quick diagnosis without needing a full heap dump — it tells you the count and total size of every class in the heap, which is usually enough to identify the offending data structure.

The Memory Problem Taxonomy#

Heap Dumps: What They Are and When to Capture#

Analysing a Heap Dump with Eclipse MAT#

Java Flight Recorder: Continuous Profiling Without the Snapshot#

Reading GC Logs#

The Memory Leak We Found#

The JVM Diagnostic Checklist#