Java’s “write once, run anywhere” promise is kept by the JVM. Its performance is kept by the JIT compiler. The gap between “Java is slow” (the 1998 opinion) and “Java is competitive with C++ for many workloads” (the 2013 reality, and more so now) is almost entirely the C2 compiler.

Understanding what C2 does — and when it stops doing it — matters if you’re writing performance-sensitive Java.

Interpretation → C1 → C2: The Tiered Compilation Pipeline

Your code's execution path:

Bytecode interpretation (tier 0)
    ↓ after ~2,000 invocations
C1 compilation (tier 3) — fast compile, basic optimisations
    ↓ after ~15,000 invocations (method) or ~10,000 (loop back-edge)
C2 compilation (tier 4) — slow compile, aggressive optimisations
    ↓ continuous profiling → deoptimise → recompile if needed

The thresholds are configurable (-XX:CompileThreshold) but the pattern is fixed: the JVM interprets first, collects type profiles, then compiles progressively more aggressively. This is why JVM benchmarks that don’t warm up are measuring the interpreter, not the JIT.

What C2 Actually Does

Inlining — the most important optimisation. C2 inlines small, frequently-called methods directly at the call site, eliminating the method call overhead and enabling further optimisations on the merged code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Written as:
double total = 0;
for (Trade t : trades) {
    total += t.notional();  // virtual call
}

// After inlining (conceptually):
double total = 0;
for (Trade t : trades) {
    total += t.price * t.quantity;  // field access, no dispatch
}

The inline threshold is ~325 bytecodes by default (-XX:MaxInlineSize, -XX:FreqInlineSize). Methods larger than this won’t be inlined. This is why hot inner methods should be kept small — not just for readability, but because it directly affects inlineability.

Loop unrolling — C2 unrolls small loops, processing multiple iterations per cycle to reduce loop overhead and enable vectorisation:

1
2
3
4
5
6
7
8
// Written:
for (int i = 0; i < n; i++) { sum += a[i]; }

// After unrolling (x4):
for (int i = 0; i < n-3; i += 4) {
    sum += a[i] + a[i+1] + a[i+2] + a[i+3];
}
// handle remainder

Auto-vectorisation — with loop unrolling, C2 can emit SIMD instructions (SSE, AVX) to process multiple array elements per CPU instruction. This only works with simple loops over primitive arrays with no data dependencies between iterations.

Escape analysis — determines whether an object “escapes” the current method (is returned, stored in a field, passed to another method that stores it). Objects that don’t escape can be allocated on the stack instead of the heap, or eliminated entirely (scalar replacement).

1
2
3
4
5
// This object may be eliminated by escape analysis:
Point midpoint(Point a, Point b) {
    Point mid = new Point((a.x + b.x) / 2, (a.y + b.y) / 2);  // doesn't escape
    return mid.x > 0 ? mid : a;  // mid might escape here
}

If C2 can prove mid doesn’t escape, it never allocates the object — the fields exist only in registers. This is why micro-benchmarks of object-heavy code that’s been compiled by C2 often show zero allocation — the objects have been optimised away.

The Monomorphic Call Site

C2’s type-based inlining requires the call site to be monomorphic — only one concrete type ever appears at that call site. When the JVM profiles your code and sees that t.notional() is always called on SpotTrade instances, it compiles with an inline guard:

if (t.getClass() == SpotTrade.class) {
    // inlined SpotTrade.notional() — fast path
} else {
    // fallback: slow dispatch
}

If a second type (ForwardTrade) later appears at the same call site, the JVM deoptimises — throws away the compiled code and falls back to interpretation, recompiling later with a bimorphic or megamorphic profile. Megamorphic sites (3+ types) are not inlined at all.

Practical implication: if you have a hot loop over a heterogeneous collection of different concrete types implementing the same interface, you’re likely getting megamorphic dispatch. Segregating by type (process all SpotTrade then all ForwardTrade) allows each loop to be monomorphic and fully inlined.

Seeing the JIT in Action

-XX:+PrintCompilation shows what’s being compiled and when:

   2847 3371 % 4       com.example.PriceAggregator::aggregate @ 12 (87 bytes)
   2849 3372   4       com.example.FIXParser::parseDouble (43 bytes)
   2851 3373   4       com.example.OrderBook::updateLevel (61 bytes)

Columns: timestamp (ms), compile ID, type (% = OSR — On Stack Replacement, compiling a running loop), tier (4 = C2), method, bytecode offset, size.

-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining shows what’s being inlined at each call site:

@ 23   com.example.Trade::notional (12 bytes)   inline (hot)
@ 31   com.example.RiskLimits::check (89 bytes) too big to inline
@ 45   com.example.Logger::debug (4 bytes)      inline (hot)

too big to inline is the common culprit when you have a hot method that isn’t getting the expected performance — break it into smaller methods.

The Deoptimisation Trap

Deoptimisation happens silently and can cause significant performance regressions:

  • A new concrete type appears at a monomorphic call site → deoptimise, recompile bimorphic
  • A field that C2 assumed was constant changes → deoptimise
  • An exception is thrown in a path C2 treated as never-taken → deoptimise
  • Class.forName(), reflection, or debugger attachment → deoptimise

-XX:+TraceDeoptimization or async-profiler’s deoptimisation events let you catch these in production.

The most common production deoptimisation I’ve seen: a logging framework that uses reflection to format log messages. Each call is a potential deoptimisation boundary, and if the logger sits inside a hot loop it can completely defeat JIT optimisation of the surrounding code.

Compiler Directives: Controlling JIT Decisions

For specific hot methods where you want to override C2’s default decisions, compiler directives let you force inline or block inline:

1
2
3
4
5
6
[{
  "match": ["com/example/PriceAggregator.aggregate(*)"],
  "inline": [
    { "match": ["com/example/Trade.*"], "Force": true }
  ]
}]

Used sparingly — this is a last resort when profiling shows that C2 is making wrong inlining decisions for a specific hot path. In the vast majority of cases, the right fix is to keep methods small enough that C2 inlines them naturally.

What This Means in Practice

The useful mental model: C2 is an extremely good compiler that operates on profiles, not just code. The same bytecode can compile to fast native code or slow native code depending on what types and values it has seen.

Implications:

  • Warm up properly before benchmarking (or before making production latency decisions)
  • Keep hot methods small enough to inline (< 325 bytes of bytecode)
  • Avoid heterogeneous collections at hot call sites — megamorphic dispatch is not inlined
  • Exception handling is expensive in hot paths — C2 treats thrown exceptions as unusual
  • final on methods and classes helps C2 devirtualise calls (though escape analysis often achieves this anyway)

The investment in understanding the JIT pays back every time you’re trying to explain a performance cliff, an allocation rate that shouldn’t exist, or a loop that’s inexplicably slower than expected.