Systems

Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers

The system that prompted this post was a trade enrichment pipeline. The input was a Kafka topic receiving ~50,000 trade events per minute during market hours. The enrichment step required a database lookup — pulling counterparty and instrument data — that averaged 2ms per trade. 50,000 trades/minute = ~833 trades/second. At 2ms per lookup, a single thread can handle 500 lookups/second. To keep up, we needed at least two threads and ideally a small pool. We had six threads and a queue in front of them. During a market event that pushed the rate to 200,000 trades/minute, the queue grew without bound, memory climbed, and the service eventually OOM’d. Classic backpressure failure. ...

Project Loom Preview: Virtual Threads and What They Mean for Server Code

Java’s threading model has a fundamental scalability problem: OS threads are expensive. Creating thousands of them consumes gigabytes of stack memory and causes significant scheduling overhead. This is why reactive programming (Netty, Project Reactor, RxJava) became popular — it avoids the thread-per-request model by using event loops and async callbacks. Project Loom, announced in 2017 with early previews arriving in 2018, proposed a different solution: make threads cheap. Virtual threads — JVM-managed threads that are not 1:1 with OS threads — could make the thread-per-request model scalable again. ...

Distributed Transactions Are a Lie (And What to Do Instead)

Every discussion of distributed systems eventually reaches the question: “can we just wrap this in a transaction?” The answer is technically yes and practically no. Understanding why — and what to do instead — is one of the more important shifts in distributed systems thinking. ...

From Java 8 to Java 11 in a Regulated Environment: What Actually Broke

Java 11 was the first long-term support release after Java 8. Oracle’s announcement that commercial Java 8 support would end pushed the bank’s architecture committee to approve a migration. In theory: update the JDK, update the build files, done. In practice: six months of discovery. This is a frank account of what broke. ...

Building MiFID II Trade Reporting Infrastructure: An Engineer's View

MiFID II went live on January 3, 2018. The preparation started in 2016. Two years for a set of regulatory requirements that, from the outside, looked straightforward: report each trade to a trade repository within 15 minutes of execution. From the inside, “report each trade” requires answering: which trades? From which systems? In what format? To which trade repository? What constitutes a trade for the purposes of reporting vs. booking vs. settlement? What do you do when the reporting service is unavailable? What happens when the trade repository rejects a report? This is the engineering story of building a system to answer those questions. ...

Stream Processing with Kafka Streams vs Flink: A Real Comparison

By mid-2017, the institution had two competing proposals on the table for the next generation of real-time analytics infrastructure: one team advocating Kafka Streams, another advocating Apache Flink. Both solve the same problem. Both use Kafka as input and output. Both provide stateful stream processing with windowing and exactly-once semantics. The evaluation took eight weeks. Here’s what we found. ...

Persistent Data Structures Are Not Just for Functional Purists

When I joined the bank’s risk team, Clojure was already in production for risk calculation. The code I inherited used Clojure’s persistent maps and vectors everywhere — not as a philosophical statement but because the team had found them practically useful in a specific way. The specific way: concurrent reads and occasional writes to a shared state snapshot, with no locks. ...

Reading GC Logs Like a Detective

GC logs are always-on, low-overhead diagnostic data that the JVM will produce for you. They tell you the timing, cause, duration, and effect of every collection — if you know how to read them. Most Java engineers can tell you what GC does. Far fewer can look at a GC log and immediately see why the p99 latency spiked at 14:37 last Tuesday. ...

Column Stores for Analytics: Why Row-Based Is Wrong for This Problem

The analytics team’s query: “Give me total notional, average spread, and fill rate for every instrument over the last 90 days, broken down by hour.” On our Postgres trade history table with ~2 billion rows: 4 hours, 23 minutes. After the columnar rewrite: 8 seconds. This post is about why, not how to install Parquet. ...

Spec-Driven Development in Clojure: Validating Financial Data at the Edge

Before clojure.spec, our FIX message parser had a test suite with 40 hand-written test cases. We’d been running it in production for 18 months without incident. After we added spec and ran the property-based tests overnight, it found 7 edge cases we hadn’t written tests for — including one where a negative zero value (-0.0) in a price field caused the downstream risk calculation to produce NaN, which propagated silently through the pipeline and ended up in the regulatory report as a blank field. That was the end of hand-written validation tests for external data. ...