Architecture

Distributed Consistency Models: What Your Service Actually Guarantees

At the large US tech company, the hardest design review conversations were almost never about which database to use. They were about what consistency guarantee the system needed, and whether the proposed design actually provided it. “Eventually consistent” is not a useful answer to “what does this service guarantee?” It describes the best case for a wide range of behaviours, some of which are harmless and some of which can cause correctness bugs in production. ...

Cross-Team Technical Alignment at Scale

At the large US technology company, no single team controls the entire system. A feature that touches payments, identity, and platform teams requires coordination across three codebases, three on-call rotations, and three sets of priorities. Getting technical alignment across teams without creating bureaucratic overhead is an active engineering problem. ...

Cache Design as a Reliability Practice, Not an Optimisation

At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms. Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate. This is a cache reliability failure, not a cache performance failure. ...

Kafka at Startup Scale

The fintech startup adopted Kafka early — we were processing market events at rates that would have overwhelmed any request-response queue. Two years in, with a five-broker cluster handling 200k messages/sec at peak, the operational experience was significantly different from what I’d expected based on the documentation and conference talks. ...

Technical Debt Is a Balance Sheet Item, Not a Moral Failing

The term “technical debt” was coined by Ward Cunningham specifically as a financial metaphor. Debt isn’t inherently bad — it’s a tradeoff: you get something now in exchange for a future obligation. When the metaphor is understood, the management of technical debt becomes clearer. When the moral framing replaces the financial one, it becomes a guilt-driven cycle that helps nobody. ...

The Platform vs Product Tension in a Growing Startup

The fintech startup hit the platform question about eighteen months in. We had product-market fit, we were growing, and the engineering team was doubling every six months. The systems that had worked at ten engineers were showing strain at twenty-five. The question became: dedicate engineering time to platform work, or keep all capacity on product features? This is a hard question. The people who get it right aren’t smarter — they’re clearer about what they’re actually trading off. ...

Building the First Production Service at a Startup: Decisions Under Uncertainty

Three months into the startup, the prototype was working and investors were asking for a production timeline. We had a Postgres database, a Python script doing the core business logic, and no infrastructure to speak of. The decision: rewrite in Go, build proper infrastructure, or ship the Python and iterate? And if we rewrite, what does “proper infrastructure” mean when you have six engineers and four months of runway? ...

What Big-Bank Engineering Taught Me About System Design

I joined the large financial institution expecting to find bureaucracy that slowed down engineering. I did find that. I also found something I didn’t expect: certain constraints imposed by regulation, scale, and risk aversion produced genuinely better engineering decisions than I’d been making at the smaller trading firm. This is about the non-obvious lessons. ...

Event Sourcing in Financial Systems: Real Benefits, Real Costs

Financial systems are natural candidates for event sourcing. Regulators want to know the state of positions at any point in time. Audit trails are not optional. The need to replay a day’s events to debug a pricing anomaly comes up regularly. These requirements — which other domains treat as optional — map directly onto event sourcing’s core properties. That said, event sourcing in production has costs that the enthusiast literature systematically underplays. Here’s an honest accounting. ...

Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers

The system that prompted this post was a trade enrichment pipeline. The input was a Kafka topic receiving ~50,000 trade events per minute during market hours. The enrichment step required a database lookup — pulling counterparty and instrument data — that averaged 2ms per trade. 50,000 trades/minute = ~833 trades/second. At 2ms per lookup, a single thread can handle 500 lookups/second. To keep up, we needed at least two threads and ideally a small pool. We had six threads and a queue in front of them. During a market event that pushed the rate to 200,000 trades/minute, the queue grew without bound, memory climbed, and the service eventually OOM’d. Classic backpressure failure. ...