We were using Scala in a few non-critical components at the trading firm — utility code, configuration, some tooling. Then someone proposed moving a market data normalisation component to Scala. The component processed 800,000 messages/second and had a 500µs latency budget per message at p99.
The discussion that followed taught me more about the JVM than a year of reading.
The Case For
Scala offered pattern matching for message dispatch (clean, readable), case classes for the normalised events (immutable, value-semantics), and collection transformations that were easier to reason about than raw Java loops.
The architecture looked like:
| |
Clean. Now let’s look at what the JVM is actually doing.
The Boxing Problem
Double and Long in Scala are value types that compile to primitives where the compiler can prove it. Where it can’t — generics, collections, traits — they become boxed java.lang.Double and java.lang.Long.
| |
List[Double] on the JVM is List[java.lang.Double] — every element is a heap-allocated boxed object. sum iterates, unboxes each element, adds, re-boxes for the accumulator. For a list of 3 elements this is irrelevant. For a hot path processing millions of messages this is significant.
Java equivalent of what's happening:
Double d1 = new Double(100.0); // heap allocation
Double d2 = new Double(100.1); // heap allocation
Double d3 = new Double(100.2); // heap allocation
List<Double> list = Arrays.asList(d1, d2, d3);
double total = 0;
for (Double d : list) {
total += d.doubleValue(); // unbox each element
}
The fix is to use Array[Double] (which compiles to double[]) or the specialised collections from libraries like Spire.
Scala Specialisation
Scala has @specialized annotation for generic types, which instructs the compiler to generate primitive-specialised bytecode alongside the generic version:
| |
With @specialized(Double), Scala generates a Accumulator$mcD$sp class that works with double primitives directly, no boxing. The compiler uses the specialised version when the type parameter is statically known to be Double.
But @specialized has sharp edges:
- It multiplies bytecode size (one class per specialised type)
- If any method in the inheritance chain isn’t specialised, boxing creeps back in
- Third-party libraries rarely use it
Closures and Anonymous Functions
Every lambda in Scala becomes an anonymous class on the JVM:
| |
_ > threshold compiles to something like:
| |
Two costs: the object allocation for the closure (once, amortised) and the virtual dispatch on apply (every invocation). The virtual dispatch may be inlined by C2 if the call site is monomorphic — but if you’re passing different lambdas through the same higher-order function, you’ll get megamorphic dispatch.
Pattern Matching
Scala’s match on sealed types compiles to a sequence of instanceof checks (or, for small integer types, a tableswitch/lookupswitch):
| |
Compiles to approximately:
| |
This is fine — two instanceof checks, no boxing, predictable branch. The C2 compiler handles this well. The issue is the wrapping: getting to this point requires the message to be a heap-allocated case class object.
Case Classes and Allocation
| |
Every Trade(...) is a heap allocation. In the normalisation pipeline processing 800k messages/second, that’s 800,000 allocations per second. Young-gen GC handles this at low frequencies, but at high volumes you’re generating significant GC pressure.
The alternatives that were considered:
| Approach | Allocation | Ergonomics | Safety |
|---|---|---|---|
| Case classes | 1 per message | Excellent | Immutable |
| Mutable objects + object pool | Near zero | Verbose | Error-prone |
| Off-heap via Chronicle/Unsafe | Zero on-heap | Very verbose | Dangerous |
| Value types (Java records, Valhalla preview) | Stack/scalar | Good | Immutable |
For this component, we ended up with a hybrid: case classes for the normalised event representation (passed downstream), but a shared mutable parse buffer for the intermediate extraction step. The allocation rate was cut by ~60%.
Measuring It
The discipline that resolved the debate: measure first, then decide.
javap -verbose -c NormalisationComponent.class | grep invoke
Looking at the bytecode tells you exactly what virtual calls are happening. invokevirtual is fine if C2 inlines it. invokeinterface carries the cost of interface dispatch. Multiple different invokevirtual targets at the same site → megamorphic → no inlining.
Then JMH with -prof gc to measure allocation rate:
Benchmark Mode Cnt Score Error Units
NormComponent.normaliseBaseline thrpt 10 847234 ± 12441 ops/s
NormComponent.normaliseCaseClass thrpt 10 623891 ± 8312 ops/s
NormComponent.normalisePooled thrpt 10 809341 ± 9876 ops/s
GC allocation rate:
NormComponent.normaliseBaseline 0 MB/s
NormComponent.normaliseCaseClass 312 MB/s
NormComponent.normalisePooled 18 MB/s
312 MB/s of allocation at 800k messages/second means roughly 400 bytes per message. On a young-gen of 512MB, that’s ~1.6 seconds between young GC collections. At sub-millisecond stop-the-world pauses, this is tolerable. But the trend as throughput grows is not.
The Decision
For that specific component — 800k/s, 500µs p99 budget — we stayed with Java. Not because Scala is bad, but because:
- The boxing in generic collections required constant vigilance
- The closure overhead in collection pipelines required careful profiling
- The allocation from case classes required object pooling, which negated the ergonomic advantage of case classes
The upstream tooling that consumed the normalised events stayed in Scala and worked perfectly. The hot path where every byte counted stayed in Java with off-heap Chronicle Queue buffers.
The lesson: Scala is excellent on the JVM. But “works on the JVM” doesn’t mean “same cost as Java.” The abstractions have a price, and in performance-critical code, you need to measure whether that price is worth paying. Sometimes it is. Sometimes it isn’t. The mistake is assuming either way without measuring.