Performance

Tail-Based Trace Sampling: Why Head Sampling Is Usually Wrong

The large US technology company runs at a scale where tracing every request is financially and computationally impractical. You have to sample. How you sample determines whether your traces are useful. Most teams implement head-based sampling — decide whether to trace a request when it starts. This is the easy implementation and produces useless traces for most debugging purposes. ...

Go's net/http: Building Production HTTP Servers Without a Framework

Go’s net/http is frequently underrated. The ecosystem has frameworks — Chi, Gin, Echo, Fiber — and they’re fine choices, but the standard library gets you remarkably far without additional dependencies. After building several production APIs that stayed on raw net/http, here’s the honest assessment of what you can and can’t do without a framework, and the patterns that make it work. ...

Memory Layout in Go: Structs, Alignment, and Cache Performance

This is the JVM false-sharing problem in a different language. The rules differ slightly, the tooling differs, but the underlying hardware constraint — cache lines are 64 bytes and sharing one across goroutines is expensive — is identical. ...

Go's Scheduler: GOMAXPROCS, Work Stealing, and Why It Matters

Go’s goroutine scheduler sits between your code and the OS. Understanding it is useful not because you’ll tune it (you rarely should) but because its behaviour explains a class of performance surprises and concurrency patterns that look odd until you see why they exist. ...

Building a High-Throughput Event Pipeline in Go Without Losing Your Mind

The fintech startup processed market events at sustained rates of 50,000–200,000 per second through a normalisation and enrichment pipeline. Go channels and goroutines are the natural tool. The naive implementation falls apart at scale. Here’s what works. ...

Profiling Go Services in Production with pprof

Every Go service should have profiling endpoints enabled by default. The overhead of having net/http/pprof imported and listening is negligible — a few goroutines, no continuous sampling. The payoff when you need it is enormous. This post is about the workflow I use for diagnosing real performance problems in production Go services. ...

Go Benchmarks: Writing Ones That Actually Tell You Something

Go has first-class benchmarking built into the testing package. go test -bench=. is enough to get started. The hard part isn’t running benchmarks — it’s writing ones that measure what you intend to measure. These are the patterns I’ve found essential and the mistakes I’ve made repeatedly enough to write down. ...

Goroutines Are Not Threads: The Mental Model Shift

The first thing every Java developer learns about goroutines is that they’re cheap. Start a million of them, no problem. The mental model that follows from this — “goroutines are threads but lighter” — is close enough to be useful and wrong enough to cause confusion. Here’s the refined model. ...

Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers

The system that prompted this post was a trade enrichment pipeline. The input was a Kafka topic receiving ~50,000 trade events per minute during market hours. The enrichment step required a database lookup — pulling counterparty and instrument data — that averaged 2ms per trade. 50,000 trades/minute = ~833 trades/second. At 2ms per lookup, a single thread can handle 500 lookups/second. To keep up, we needed at least two threads and ideally a small pool. We had six threads and a queue in front of them. During a market event that pushed the rate to 200,000 trades/minute, the queue grew without bound, memory climbed, and the service eventually OOM’d. Classic backpressure failure. ...

Project Loom Preview: Virtual Threads and What They Mean for Server Code

Java’s threading model has a fundamental scalability problem: OS threads are expensive. Creating thousands of them consumes gigabytes of stack memory and causes significant scheduling overhead. This is why reactive programming (Netty, Project Reactor, RxJava) became popular — it avoids the thread-per-request model by using event loops and async callbacks. Project Loom, announced in 2017 with early previews arriving in 2018, proposed a different solution: make threads cheap. Virtual threads — JVM-managed threads that are not 1:1 with OS threads — could make the thread-per-request model scalable again. ...