Scala on the Hot Path: Where the Abstraction Cost Goes

We were using Scala in a few non-critical components at the trading firm — utility code, configuration, some tooling. Then someone proposed moving a market data normalisation component to Scala. The component processed 800,000 messages/second and had a 500µs latency budget per message at p99.

The discussion that followed taught me more about the JVM than a year of reading.

The Case For

Scala offered pattern matching for message dispatch (clean, readable), case classes for the normalised events (immutable, value-semantics), and collection transformations that were easier to reason about than raw Java loops.

The architecture looked like:

1
2
3
4
5
6
7
8
9
sealed trait MarketEvent
case class Trade(symbol: String, price: Double, quantity: Long) extends MarketEvent
case class Quote(symbol: String, bid: Double, ask: Double)      extends MarketEvent

def normalise(raw: RawFeedMessage): MarketEvent = raw.msgType match {
  case 'D' => Trade(raw.symbol, raw.fields(4).toDouble, raw.fields(5).toLong)
  case 'V' => Quote(raw.symbol, raw.fields(4).toDouble, raw.fields(6).toDouble)
  case _   => throw new UnknownMessageType(raw.msgType)
}

Clean. Now let’s look at what the JVM is actually doing.

The Boxing Problem

Double and Long in Scala are value types that compile to primitives where the compiler can prove it. Where it can’t — generics, collections, traits — they become boxed java.lang.Double and java.lang.Long.

1
2
3
// This looks harmless:
val prices: List[Double] = List(100.0, 100.1, 100.2)
val total = prices.sum

List[Double] on the JVM is List[java.lang.Double] — every element is a heap-allocated boxed object. sum iterates, unboxes each element, adds, re-boxes for the accumulator. For a list of 3 elements this is irrelevant. For a hot path processing millions of messages this is significant.

Java equivalent of what's happening:
Double d1 = new Double(100.0);  // heap allocation
Double d2 = new Double(100.1);  // heap allocation
Double d3 = new Double(100.2);  // heap allocation
List<Double> list = Arrays.asList(d1, d2, d3);
double total = 0;
for (Double d : list) {
    total += d.doubleValue();   // unbox each element
}

The fix is to use Array[Double] (which compiles to double[]) or the specialised collections from libraries like Spire.

Scala Specialisation

Scala has @specialized annotation for generic types, which instructs the compiler to generate primitive-specialised bytecode alongside the generic version:

1
2
3
4
class Accumulator[@specialized(Double, Long) T](initial: T) {
  private var value: T = initial
  def add(v: T): Unit = ...
}

With @specialized(Double), Scala generates a Accumulator$mcD$sp class that works with double primitives directly, no boxing. The compiler uses the specialised version when the type parameter is statically known to be Double.

But @specialized has sharp edges:

It multiplies bytecode size (one class per specialised type)
If any method in the inheritance chain isn’t specialised, boxing creeps back in
Third-party libraries rarely use it

Closures and Anonymous Functions

Every lambda in Scala becomes an anonymous class on the JVM:

1
2
val threshold = 100.0
val aboveThreshold = prices.filter(_ > threshold)

_ > threshold compiles to something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Anonymous class generated:
class anonfun$1 extends AbstractFunction1<Double, Boolean> {
    private final double threshold;

    anonfun$1(double threshold) { this.threshold = threshold; }

    public Boolean apply(Double x) {
        return x.doubleValue() > this.threshold;  // unboxes
    }
}

Two costs: the object allocation for the closure (once, amortised) and the virtual dispatch on apply (every invocation). The virtual dispatch may be inlined by C2 if the call site is monomorphic — but if you’re passing different lambdas through the same higher-order function, you’ll get megamorphic dispatch.

Pattern Matching

Scala’s match on sealed types compiles to a sequence of instanceof checks (or, for small integer types, a tableswitch/lookupswitch):

1
2
3
4
event match {
  case Trade(sym, price, qty) => processTrade(sym, price, qty)
  case Quote(sym, bid, ask)   => processQuote(sym, bid, ask)
}

Compiles to approximately:

1
2
3
4
5
6
7
if (event instanceof Trade) {
    Trade t = (Trade) event;
    processTrade(t.symbol(), t.price(), t.quantity());
} else if (event instanceof Quote) {
    Quote q = (Quote) event;
    processQuote(q.symbol(), q.bid(), q.ask());
}

This is fine — two instanceof checks, no boxing, predictable branch. The C2 compiler handles this well. The issue is the wrapping: getting to this point requires the message to be a heap-allocated case class object.

Case Classes and Allocation

1
case class Trade(symbol: String, price: Double, quantity: Long)

Every Trade(...) is a heap allocation. In the normalisation pipeline processing 800k messages/second, that’s 800,000 allocations per second. Young-gen GC handles this at low frequencies, but at high volumes you’re generating significant GC pressure.

The alternatives that were considered:

Approach	Allocation	Ergonomics	Safety
Case classes	1 per message	Excellent	Immutable
Mutable objects + object pool	Near zero	Verbose	Error-prone
Off-heap via Chronicle/Unsafe	Zero on-heap	Very verbose	Dangerous
Value types (Java records, Valhalla preview)	Stack/scalar	Good	Immutable

For this component, we ended up with a hybrid: case classes for the normalised event representation (passed downstream), but a shared mutable parse buffer for the intermediate extraction step. The allocation rate was cut by ~60%.

Measuring It

The discipline that resolved the debate: measure first, then decide.

javap -verbose -c NormalisationComponent.class | grep invoke

Looking at the bytecode tells you exactly what virtual calls are happening. invokevirtual is fine if C2 inlines it. invokeinterface carries the cost of interface dispatch. Multiple different invokevirtual targets at the same site → megamorphic → no inlining.

Then JMH with -prof gc to measure allocation rate:

Benchmark                          Mode  Cnt    Score     Error   Units
NormComponent.normaliseBaseline   thrpt   10  847234 ± 12441  ops/s
NormComponent.normaliseCaseClass  thrpt   10  623891 ±  8312  ops/s
NormComponent.normalisePooled     thrpt   10  809341 ±  9876  ops/s

GC allocation rate:
NormComponent.normaliseBaseline    0 MB/s
NormComponent.normaliseCaseClass   312 MB/s
NormComponent.normalisePooled      18 MB/s

312 MB/s of allocation at 800k messages/second means roughly 400 bytes per message. On a young-gen of 512MB, that’s ~1.6 seconds between young GC collections. At sub-millisecond stop-the-world pauses, this is tolerable. But the trend as throughput grows is not.

The Decision

For that specific component — 800k/s, 500µs p99 budget — we stayed with Java. Not because Scala is bad, but because:

The boxing in generic collections required constant vigilance
The closure overhead in collection pipelines required careful profiling
The allocation from case classes required object pooling, which negated the ergonomic advantage of case classes

The upstream tooling that consumed the normalised events stayed in Scala and worked perfectly. The hot path where every byte counted stayed in Java with off-heap Chronicle Queue buffers.

The lesson: Scala is excellent on the JVM. But “works on the JVM” doesn’t mean “same cost as Java.” The abstractions have a price, and in performance-critical code, you need to measure whether that price is worth paying. Sometimes it is. Sometimes it isn’t. The mistake is assuming either way without measuring.

The Case For#

The Boxing Problem#

Scala Specialisation#

Closures and Anonymous Functions#

Pattern Matching#

Case Classes and Allocation#

Measuring It#

The Decision#