We’d tuned GC to near-perfection. Pause times were sub-millisecond. The p99.9 latency was still spiking to 8ms several times a day, with no GC events anywhere near those spikes. It took three weeks to find the cause: safepoints, specifically a revoke-bias operation triggered by lock patterns in a third-party library.

What Safepoints Are

A safepoint is a point in execution where the JVM can stop all threads to perform some operation — GC being the most common, but not the only one. All threads must reach a safepoint before the operation can proceed. Threads that are executing bytecode check for a safepoint request at certain points in their execution; threads that are blocked waiting for I/O or a lock are already at a safepoint.

The list of operations that require a safepoint is longer than most people realise:

  • Garbage collection (minor and major)
  • Biased lock revocation
  • Thread dump (kill -3 or JVMTI)
  • Class redefinition (e.g., hot code replace in debugger)
  • JIT deoptimisation
  • -XX:+PrintGCApplicationStoppedTime shows total safepoint time, including all of these

The key: a safepoint pause is as long as the slowest thread to reach its nearest safepoint. If a thread is executing a long-running loop with no safepoint polls, it holds everyone up.

Diagnosing Safepoint Pauses

The first step is visibility. Add to JVM flags:

-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1

PrintSafepointStatistics output shows each safepoint event, its type, and the time to reach it. The “vmop” column tells you why the safepoint was requested. When we ran this, the mystery pauses showed up as RevokeBias — biased locking revocation.

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.256: RevokeBias                       [      48          1              0    ]      [     0     0     0       0     8]  1

The sync time (0ms here, but can be much higher) is how long it took for all threads to reach the safepoint. The vmop time (8ms) is the operation itself.

The Biased Locking Problem

Biased locking is a JVM optimisation: if a lock is only ever acquired by one thread, the JVM “biases” it to that thread, making subsequent acquisitions nearly free. When a different thread tries to acquire the lock, the JVM must revoke the bias — which requires a safepoint.

Our third-party library used synchronized blocks internally, and its object pool was accessed by multiple threads in a pattern that caused frequent bias revocations. The library author had assumed single-threaded access; we were using it from a thread pool.

Fixes (in order of invasiveness):

  1. Disable biased locking: -XX:-UseBiasedLocking. Costs you the biased locking optimisation for all locks, but eliminates revocation pauses. Often the right call for multi-threaded server applications.
  2. Fix the access pattern: make the pool thread-local so each thread only ever accesses its own instances. This maintains the single-thread assumption biased locking relies on.
  3. Use a library that doesn’t use synchronized: java.util.concurrent locks don’t use biased locking.

We went with option 1 first (immediate relief), then option 2 over the following sprint (correct fix).

Long-Running Loops and Safepoint Delays

A different safepoint pathology: code that takes a long time to reach a safepoint.

The JVM inserts safepoint polls at method returns and loop back-edges. But counted loops (those with an integer counter from 0 to N) historically didn’t get safepoint polls in the JIT-compiled code — the JIT assumed they were short. If your loop processes a large array and N is big, all other threads wait until the loop completes before the safepoint can proceed.

This was partially fixed in later JVM versions (JDK 9+ with -XX:+UseCountedLoopSafepoints, default on in JDK 11+). On older versions, the fix is to either avoid very long counted loops in hot code or break them into chunks.

The Monitoring Gap This Creates

Most monitoring focuses on GC metrics. JVM safepoint time is harder to expose. Our Prometheus JMX exporter didn’t expose safepoint statistics by default. We had to add a custom JMX bean reading the safepoint MXBean, or parse the GC log output.

Once we had visibility, we added an alert on total safepoint pause time exceeding 2ms. It fired three times in the first month, each time pointing at a different cause — biased revocations, a thread dump from our ops tooling, and a JIT deoptimisation triggered by a reflective call we hadn’t noticed.

None of these would have been findable from GC logs alone. Safepoints are the pauses behind the pauses.