This is the JVM false-sharing problem in a different language. The rules differ slightly, the tooling differs, but the underlying hardware constraint — cache lines are 64 bytes and sharing one across goroutines is expensive — is identical.
How Go Aligns Struct Fields
Go aligns each field to its natural alignment boundary: a bool is 1-byte aligned, an int32 is 4-byte aligned, a int64 is 8-byte aligned, a pointer is pointer-size aligned (8 bytes on 64-bit systems).
This means struct layout is not necessarily the same as field declaration order:
| |
For Bad, Go inserts 7 bytes of padding after a to align b on an 8-byte boundary, and 7 more bytes after c to align d. The Good struct packs the large fields first and the small ones last, eliminating the internal padding.
Use unsafe.Sizeof to check actual sizes, or run go vet -fieldalignment (or the standalone fieldalignment analyser from golang.org/x/tools).
When This Actually Matters
Memory consumption in large slices. If you have a slice of millions of structs, layout efficiency directly affects RSS and GC pressure (more memory → more GC work). In a market data system holding tick data for thousands of instruments, struct layout across millions of entries is measurable.
Cache performance in hot loops. When iterating a slice of structs, the CPU loads cache lines sequentially. A struct that fits in 32 bytes means two fit in a cache line; a struct that wastes 16 bytes of padding fits only one and a half. For tight loops over large slices, this is the difference between a workload that fits in L2 and one that doesn’t.
False sharing in concurrent code. If two goroutines are modifying different fields of the same struct, and those fields share a cache line, you have false sharing — each write invalidates the other goroutine’s cache entry. The solution is padding to push the fields onto separate cache lines.
Detecting False Sharing in Go
There’s no @Contended annotation in Go (unlike Java’s sun.misc.Contended). You pad manually:
| |
The _ field is a zero-size blank identifier field with enough bytes to push the struct to a cache line boundary. It’s ugly, but it’s explicit and survives struct size changes only if you update it — using a constant calculation makes it at least partially self-documenting.
A Concrete Example: Lock-Free Counter
We had a hot-path metrics counter updated from multiple goroutines. The naive implementation:
| |
atomic.Int64 is 8 bytes. Three of them packed together: all three in the same cache line. Goroutine A updating requests and goroutine B updating errors are writing to the same cache line → false sharing → cache coherency traffic between CPUs.
The fix:
| |
Each Counter is now 64 bytes — exactly one cache line. Different goroutines updating different counters hit different cache lines.
Benchmark result in our case: under contention from 8 goroutines, the padded version was ~4x faster. Under no contention, identical.
The Arrays-of-Structs vs Structs-of-Arrays Question
For numerical workloads, array-of-structs (AoS) vs struct-of-arrays (SoA) is worth thinking about.
AoS (the natural Go layout):
| |
SoA:
| |
If you’re processing all three fields together for each tick, AoS is better — all fields for one tick fit in a few cache lines, and iterating forward is sequential. If you’re summing all volumes across thousands of ticks (a common analytics query), SoA is better — you only load the Volume slice into cache, not the prices and timestamps you don’t need.
In a market data analytics context, we had both access patterns. The solution: keep ticks in AoS format for the real-time path (process each tick fully before moving to the next), and columnar SoA for the analytics path (materialise column slices from the AoS when needed). Two representations of the same data, each optimal for its access pattern. The conversion cost was acceptable because analytics runs were infrequent compared to tick ingestion.