Distributed-Systems

Distributed Consistency Models: What Your Service Actually Guarantees

At the large US tech company, the hardest design review conversations were almost never about which database to use. They were about what consistency guarantee the system needed, and whether the proposed design actually provided it. “Eventually consistent” is not a useful answer to “what does this service guarantee?” It describes the best case for a wide range of behaviours, some of which are harmless and some of which can cause correctness bugs in production. ...

Observability at Scale: What 'Good' Looks Like When You Have Too Much Data

At a startup with a dozen services, the observability problem is getting enough signal. You don’t have enough logging, your traces are incomplete, and your metrics dashboards have gaps. You know when something is wrong because a user tells you. At scale, the problem inverts. You have petabytes of logs, hundreds of millions of traces per day, and metrics cardinality so high that naive approaches cause your time-series database to OOM. The engineering challenge is filtering signal from noise, not generating signal. Both problems are real. They require different solutions. ...

Cache Design as a Reliability Practice, Not an Optimisation

At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms. Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate. This is a cache reliability failure, not a cache performance failure. ...

Engineering at Enterprise Scale: What Changes When the System Is Actually Big

I’d worked at organisations ranging from twelve people to four hundred. The new role is at a company with tens of thousands of engineers. The systems are bigger, the coordination surface is larger, and some things I assumed were universal engineering truths turned out to be scale-specific. ...

OpenTelemetry in Go: Distributed Tracing That Doesn't Get in the Way

Before OpenTelemetry, distributed tracing was a vendor-specific integration. You chose Jaeger or Zipkin or Datadog, added their SDK, and traced against their API. Switching vendors meant rewriting instrumentation. Adding a library that used a different tracing SDK meant two tracing systems running in parallel. OpenTelemetry solved this with a vendor-neutral API and a pluggable exporter model. The API stays the same; you swap exporters. Most major observability vendors now accept OTel format natively. Here’s what a solid Go integration looks like. ...

gRPC in Production: Lessons After Two Years

We moved our internal service communication from REST+JSON to gRPC when the data pipeline scaled past a point where JSON parsing overhead was measurable in profiles. Two years later, the performance win was real and smaller than expected; the developer experience wins were larger; and the operational complexity was a genuine tax that we underpriced initially. ...

Context Propagation in Go: Deadlines, Cancellation, and Tracing

context.Context appears in the signature of almost every non-trivial Go function. It’s the idiomatic way to carry deadlines, cancellation signals, and request-scoped values across API boundaries and goroutines. After years of reading Go codebases, the misuse patterns are as common as the correct uses. ...

Schema Evolution in Avro: The Hard Lessons from Production

Three months after deploying Kafka with Avro schemas and the Confluent Schema Registry, we had a production incident where a schema change caused a downstream consumer to silently produce incorrect output — wrong field values, no error thrown, no monitoring alert triggered. That incident rewired how the team thought about schema evolution. The tools don’t protect you from all the failure modes. Understanding the rules and building organisational processes around them is what does. ...

Event Sourcing in Financial Systems: Real Benefits, Real Costs

Financial systems are natural candidates for event sourcing. Regulators want to know the state of positions at any point in time. Audit trails are not optional. The need to replay a day’s events to debug a pricing anomaly comes up regularly. These requirements — which other domains treat as optional — map directly onto event sourcing’s core properties. That said, event sourcing in production has costs that the enthusiast literature systematically underplays. Here’s an honest accounting. ...

Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers

The system that prompted this post was a trade enrichment pipeline. The input was a Kafka topic receiving ~50,000 trade events per minute during market hours. The enrichment step required a database lookup — pulling counterparty and instrument data — that averaged 2ms per trade. 50,000 trades/minute = ~833 trades/second. At 2ms per lookup, a single thread can handle 500 lookups/second. To keep up, we needed at least two threads and ideally a small pool. We had six threads and a queue in front of them. During a market event that pushed the rate to 200,000 trades/minute, the queue grew without bound, memory climbed, and the service eventually OOM’d. Classic backpressure failure. ...