Posts

Kafka in Finance: What 'Exactly Once' Actually Costs You

Kafka 0.11 landed with exactly-once semantics and a lot of marketing. We were running trade event pipelines in a regulated environment and the promise was appealing: no duplicate trades in the downstream risk system, no idempotency logic sprinkled through consuming services. After three months with it in production, the honest summary is: EOS (exactly-once semantics) works as advertised within its scope, and that scope is narrower than it sounds. ...

Clojure Data Pipelines: Transducers in Production Risk Processing

The risk calculation pipeline processed end-of-day positions: take all the day’s trades, aggregate them to net positions, apply mark-to-market prices, and compute risk metrics. The input was ~800,000 trade records; the output was ~12,000 position records with P&L and Greeks. The initial implementation used standard Clojure sequence operations: 1 2 3 4 5 6 (->> trades (filter open-trade?) (map enrich-with-market-data) (group-by :currency-pair) (map-vals aggregate-position) (map-vals compute-risk-metrics)) Clean. Readable. And it created four intermediate collections of 800,000 elements each before producing the final output. That’s 3.2M intermediate objects for a 12K result. Transducers changed this. ...

Threading Models in Java: Which One Does Your System Actually Need?

The move from a small trading firm to a large financial institution meant working with codebases an order of magnitude larger, maintained by dozens of engineers across multiple teams. It also meant encountering the full spectrum of Java threading models in production — some appropriate, some inherited from a different era, and some that were actively causing problems. This is a survey of what those models look like, what they’re good at, and how you tell which one a system needs. ...

KDB+/Q for Java Developers: Reading the Matrix

KDB+ is used in risk analytics, trade surveillance, and market data storage across most tier-1 financial institutions. If you work in finance long enough, you will encounter it. Nothing in your Java background prepares you for it. ...

Heap Dumps and Flight Recorder: Diagnosing JVM Memory Problems in Production

At the large financial institution where I worked from 2016, the JVM services were larger and longer-running than anything I’d dealt with in the previous role. Old generation sizes in the hundreds of gigabytes. Services running for months between restarts. Memory problems that took days or weeks to manifest. The debugging approach that worked in trading — small heaps, frequent restarts, aggressive allocation control — didn’t apply here. You had to diagnose production JVM state without stopping it. ...

Time-Series Data at a Bank: Why Relational Databases Break and What Comes Next

When I moved to the large financial institution, the team I joined managed the market data and trade data storage layer. The engineering problem was deceptively simple to state: store every price tick, every trade execution, and every risk calculation — billions of records per day — and answer analytical queries over them quickly. The existing system was PostgreSQL. It worked, technically. Queries that needed to run in seconds for trading decisions took minutes. Operational costs for storage were climbing. The database team was spending more time running VACUUM than building features. Understanding why required understanding what time-series data actually is and why it’s different. ...

Why the Risk Team Chose Clojure (And Why It Made Sense)

The first time someone told me the risk calculation system was written in Clojure, I assumed it was a prototype or a skunkworks project. It wasn’t. It processed end-of-day risk for a significant portion of the firm’s trading book and had been in production for two years. Here’s why that decision made sense, and what it was actually like to work in. ...

When the Scale Changes: Moving into Institutional Finance

The job spec said “trading systems.” I assumed it would be similar to what I’d been doing — latency-focused, technically aggressive, small team, fast decisions. I was wrong on most counts, and right for reasons I didn’t expect. ...

Five Years in High-Frequency Trading: What I Actually Learned

Five years ago I joined the electronic trading firm not knowing what a cache line was. I thought garbage collection was something that happened to other people’s code. I had never looked at assembly output from a Java program. I’d heard of the LMAX Disruptor but had no idea why it existed. By the time I left, I had opinions about CPU prefetchers. I had read the Intel 64 and IA-32 Architectures Software Developer’s Manual for fun. I could look at a flame graph and immediately see the GC pressure. I had shipped components processing a million messages per second with sub-millisecond p99 guarantees. Here’s what that environment actually teaches you. ...

ZGC and Shenandoah: What Low-Pause GC Means for Trading Systems

In 2015, the state of GC for latency-sensitive Java was: use G1GC, tune it carefully, accept occasional 50–200ms pauses on large heaps, and work around them with off-heap storage and careful allocation management. The conventional wisdom was that sub-10ms GC pauses required small heaps (< 4GB) or near-zero allocation on the hot path. For trading systems with large position caches, this meant either expensive off-heap engineering or living with GC latency spikes. Then the early previews of ZGC (Oracle/Sun) and Shenandoah (Red Hat) started circulating. Both claimed sub-millisecond pause times regardless of heap size. The mechanisms were different, the implications were significant. ...