Data-at-Scale

Distributed Consistency Models: What Your Service Actually Guarantees

At the large US tech company, the hardest design review conversations were almost never about which database to use. They were about what consistency guarantee the system needed, and whether the proposed design actually provided it. “Eventually consistent” is not a useful answer to “what does this service guarantee?” It describes the best case for a wide range of behaviours, some of which are harmless and some of which can cause correctness bugs in production. ...

Tail-Based Trace Sampling: Why Head Sampling Is Usually Wrong

The large US technology company runs at a scale where tracing every request is financially and computationally impractical. You have to sample. How you sample determines whether your traces are useful. Most teams implement head-based sampling — decide whether to trace a request when it starts. This is the easy implementation and produces useless traces for most debugging purposes. ...

Observability at Scale: What 'Good' Looks Like When You Have Too Much Data

At a startup with a dozen services, the observability problem is getting enough signal. You don’t have enough logging, your traces are incomplete, and your metrics dashboards have gaps. You know when something is wrong because a user tells you. At scale, the problem inverts. You have petabytes of logs, hundreds of millions of traces per day, and metrics cardinality so high that naive approaches cause your time-series database to OOM. The engineering challenge is filtering signal from noise, not generating signal. Both problems are real. They require different solutions. ...

Cache Design as a Reliability Practice, Not an Optimisation

At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms. Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate. This is a cache reliability failure, not a cache performance failure. ...

ClickHouse for Application Analytics: Fast Aggregations Without Spark

The requirement: an internal analytics dashboard showing trading activity metrics — volume, trade count, latency distributions, error rates — sliced by instrument, venue, time window, and a dozen other dimensions. Data volume: about 4 billion events per day, 90-day retention. Query pattern: ad-hoc OLAP — arbitrary group-bys, time ranges, filters. We evaluated TimescaleDB (Postgres extension), Apache Druid, ClickHouse, and “just use BigQuery.” We chose ClickHouse. After a year in production, I’d make the same choice. ...

Kafka at Startup Scale

The fintech startup adopted Kafka early — we were processing market events at rates that would have overwhelmed any request-response queue. Two years in, with a five-broker cluster handling 200k messages/sec at peak, the operational experience was significantly different from what I’d expected based on the documentation and conference talks. ...

Choosing a Time-Series Database in 2020

The fintech startup had three different time-series storage problems at the same time. After evaluating the options available in 2020, we ended up running two different systems. Here’s the decision framework and why the landscape fragmented the way it did. ...

Schema Evolution in Avro: The Hard Lessons from Production

Three months after deploying Kafka with Avro schemas and the Confluent Schema Registry, we had a production incident where a schema change caused a downstream consumer to silently produce incorrect output — wrong field values, no error thrown, no monitoring alert triggered. That incident rewired how the team thought about schema evolution. The tools don’t protect you from all the failure modes. Understanding the rules and building organisational processes around them is what does. ...

Event Sourcing in Financial Systems: Real Benefits, Real Costs

Financial systems are natural candidates for event sourcing. Regulators want to know the state of positions at any point in time. Audit trails are not optional. The need to replay a day’s events to debug a pricing anomaly comes up regularly. These requirements — which other domains treat as optional — map directly onto event sourcing’s core properties. That said, event sourcing in production has costs that the enthusiast literature systematically underplays. Here’s an honest accounting. ...

Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers

The system that prompted this post was a trade enrichment pipeline. The input was a Kafka topic receiving ~50,000 trade events per minute during market hours. The enrichment step required a database lookup — pulling counterparty and instrument data — that averaged 2ms per trade. 50,000 trades/minute = ~833 trades/second. At 2ms per lookup, a single thread can handle 500 lookups/second. To keep up, we needed at least two threads and ideally a small pool. We had six threads and a queue in front of them. During a market event that pushed the rate to 200,000 trades/minute, the queue grew without bound, memory climbed, and the service eventually OOM’d. Classic backpressure failure. ...