This is the JVM false-sharing problem in a different language. The rules differ slightly, the tooling differs, but the underlying hardware constraint — cache lines are 64 bytes and sharing one across goroutines is expensive — is identical.

How Go Aligns Struct Fields

Go aligns each field to its natural alignment boundary: a bool is 1-byte aligned, an int32 is 4-byte aligned, a int64 is 8-byte aligned, a pointer is pointer-size aligned (8 bytes on 64-bit systems).

This means struct layout is not necessarily the same as field declaration order:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type Bad struct {
    a bool    // 1 byte
    // 7 bytes padding to align b
    b int64   // 8 bytes
    c bool    // 1 byte
    // 7 bytes padding to align d
    d int64   // 8 bytes
}
// sizeof(Bad) = 32 bytes

type Good struct {
    b int64   // 8 bytes
    d int64   // 8 bytes
    a bool    // 1 byte
    c bool    // 1 byte
    // 6 bytes padding to align next struct (if in array)
}
// sizeof(Good) = 24 bytes

For Bad, Go inserts 7 bytes of padding after a to align b on an 8-byte boundary, and 7 more bytes after c to align d. The Good struct packs the large fields first and the small ones last, eliminating the internal padding.

Use unsafe.Sizeof to check actual sizes, or run go vet -fieldalignment (or the standalone fieldalignment analyser from golang.org/x/tools).

When This Actually Matters

Memory consumption in large slices. If you have a slice of millions of structs, layout efficiency directly affects RSS and GC pressure (more memory → more GC work). In a market data system holding tick data for thousands of instruments, struct layout across millions of entries is measurable.

Cache performance in hot loops. When iterating a slice of structs, the CPU loads cache lines sequentially. A struct that fits in 32 bytes means two fit in a cache line; a struct that wastes 16 bytes of padding fits only one and a half. For tight loops over large slices, this is the difference between a workload that fits in L2 and one that doesn’t.

False sharing in concurrent code. If two goroutines are modifying different fields of the same struct, and those fields share a cache line, you have false sharing — each write invalidates the other goroutine’s cache entry. The solution is padding to push the fields onto separate cache lines.

Detecting False Sharing in Go

There’s no @Contended annotation in Go (unlike Java’s sun.misc.Contended). You pad manually:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
const cacheLineSize = 64

type CacheLinePadded struct {
    value int64
    _     [cacheLineSize - unsafe.Sizeof(int64(0))]byte
}

// Or more commonly in practice:
type ShardedCounter struct {
    mu    sync.Mutex
    count int64
    _     [cacheLineSize - unsafe.Sizeof(sync.Mutex{}) - unsafe.Sizeof(int64(0))]byte
}

The _ field is a zero-size blank identifier field with enough bytes to push the struct to a cache line boundary. It’s ugly, but it’s explicit and survives struct size changes only if you update it — using a constant calculation makes it at least partially self-documenting.

A Concrete Example: Lock-Free Counter

We had a hot-path metrics counter updated from multiple goroutines. The naive implementation:

1
2
3
4
5
type Counters struct {
    requests  atomic.Int64
    errors    atomic.Int64
    latencyNs atomic.Int64
}

atomic.Int64 is 8 bytes. Three of them packed together: all three in the same cache line. Goroutine A updating requests and goroutine B updating errors are writing to the same cache line → false sharing → cache coherency traffic between CPUs.

The fix:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type Counter struct {
    v atomic.Int64
    _ [56]byte // pad to 64 bytes
}

type Counters struct {
    requests  Counter
    errors    Counter
    latencyNs Counter
}

Each Counter is now 64 bytes — exactly one cache line. Different goroutines updating different counters hit different cache lines.

Benchmark result in our case: under contention from 8 goroutines, the padded version was ~4x faster. Under no contention, identical.

The Arrays-of-Structs vs Structs-of-Arrays Question

For numerical workloads, array-of-structs (AoS) vs struct-of-arrays (SoA) is worth thinking about.

AoS (the natural Go layout):

1
2
type Tick struct { Price float64; Volume int64; Timestamp int64 }
ticks []Tick

SoA:

1
2
3
4
5
type Ticks struct {
    Price     []float64
    Volume    []int64
    Timestamp []int64
}

If you’re processing all three fields together for each tick, AoS is better — all fields for one tick fit in a few cache lines, and iterating forward is sequential. If you’re summing all volumes across thousands of ticks (a common analytics query), SoA is better — you only load the Volume slice into cache, not the prices and timestamps you don’t need.

In a market data analytics context, we had both access patterns. The solution: keep ticks in AoS format for the real-time path (process each tick fully before moving to the next), and columnar SoA for the analytics path (materialise column slices from the AoS when needed). Two representations of the same data, each optimal for its access pattern. The conversion cost was acceptable because analytics runs were infrequent compared to tick ingestion.