Go has first-class benchmarking built into the testing package. go test -bench=. is enough to get started. The hard part isn’t running benchmarks — it’s writing ones that measure what you intend to measure.

These are the patterns I’ve found essential and the mistakes I’ve made repeatedly enough to write down.

The Basic Structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func BenchmarkParseOrderMessage(b *testing.B) {
    msg := []byte(`{"id":"ord_123","symbol":"AAPL","qty":100,"price":145.23}`)

    b.ResetTimer()  // Don't count setup time
    for i := 0; i < b.N; i++ {
        _, err := parseOrderMessage(msg)
        if err != nil {
            b.Fatal(err)
        }
    }
}

b.N is set by the framework — it runs the loop enough times to get a stable measurement. b.ResetTimer() after setup ensures setup time doesn’t pollute the measurement.

Running it:

1
2
3
4
5
6
7
go test -bench=BenchmarkParseOrderMessage -benchmem -count=5 ./...

BenchmarkParseOrderMessage-8    2847193    421.3 ns/op    256 B/op    4 allocs/op
BenchmarkParseOrderMessage-8    2891034    418.9 ns/op    256 B/op    4 allocs/op
BenchmarkParseOrderMessage-8    2904821    419.7 ns/op    256 B/op    4 allocs/op
BenchmarkParseOrderMessage-8    2886710    420.1 ns/op    256 B/op    4 allocs/op
BenchmarkParseOrderMessage-8    2878923    419.4 ns/op    256 B/op    4 allocs/op

-benchmem adds allocation stats. -count=5 runs each benchmark 5 times so you can see variance.

The Dead Code Elimination Problem

The most common benchmark mistake: the compiler eliminates the code you’re benchmarking.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// WRONG: compiler may optimise away hash(s) since result is unused
func BenchmarkHash(b *testing.B) {
    s := "hello, world"
    for i := 0; i < b.N; i++ {
        hash(s)  // result unused — compiler is allowed to remove this call
    }
}

// CORRECT: use the result to prevent elimination
func BenchmarkHash(b *testing.B) {
    s := "hello, world"
    var result uint64
    for i := 0; i < b.N; i++ {
        result = hash(s)
    }
    _ = result  // prevent result from being optimised away
}

The _ = result outside the loop tells the compiler the result matters, preventing elimination of the call and the loop. The assignment outside the loop also prevents the compiler from optimising the accumulation inside the loop.

For more complex cases, use the sink pattern from the standard library:

1
2
3
4
5
6
7
8
9
var Sink interface{}  // package-level variable

func BenchmarkComplexOp(b *testing.B) {
    var result SomeStruct
    for i := 0; i < b.N; i++ {
        result = complexOperation()
    }
    Sink = result  // escapes to heap, defeats all optimisations on result
}

Allocation Benchmarks: Pre-allocated vs. Fresh

When benchmarking allocation patterns, be clear about whether you’re measuring allocation or logic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Measures allocation + processing:
func BenchmarkProcessWithAlloc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        result := make([]Order, 0, 100)
        result = processOrders(input, result)
        _ = result
    }
}

// Measures processing only, with pre-allocated slice:
func BenchmarkProcessPrealloc(b *testing.B) {
    result := make([]Order, 0, 100)
    for i := 0; i < b.N; i++ {
        result = result[:0]  // reset length, keep capacity
        result = processOrders(input, result)
    }
    _ = result
}

If your production code pre-allocates, benchmark the pre-allocated version. If you’re trying to measure whether pre-allocation helps, benchmark both and compare.

Sub-Benchmarks and Table-Driven Benchmarks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
func BenchmarkEncode(b *testing.B) {
    cases := []struct {
        name string
        size int
    }{
        {"small", 10},
        {"medium", 1000},
        {"large", 100000},
    }

    for _, tc := range cases {
        b.Run(tc.name, func(b *testing.B) {
            data := generateData(tc.size)
            b.ResetTimer()
            b.SetBytes(int64(len(data)))  // enables MB/s reporting
            for i := 0; i < b.N; i++ {
                encoded := encode(data)
                _ = encoded
            }
        })
    }
}

b.SetBytes(n) adds throughput in MB/s to the output:

BenchmarkEncode/small-8      9823401    121 ns/op    826 MB/s    0 B/op
BenchmarkEncode/medium-8      112341   10823 ns/op    924 MB/s    0 B/op
BenchmarkEncode/large-8        10234  1082341 ns/op    924 MB/s    0 B/op

MB/s reporting is useful for serialisation, compression, and I/O benchmarks where throughput is the meaningful metric.

Benchmarking Concurrent Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func BenchmarkConcurrentMap(b *testing.B) {
    m := sync.Map{}

    b.RunParallel(func(pb *testing.PB) {
        i := 0
        for pb.Next() {
            key := strconv.Itoa(i % 1000)
            m.Store(key, i)
            i++
        }
    })
}

b.RunParallel runs the inner function in GOMAXPROCS goroutines simultaneously. pb.Next() is the parallel equivalent of the b.N loop. Use this to measure concurrent data structure performance under realistic contention.

Run with -cpu 1,2,4,8 to see scaling:

1
2
3
4
5
6
go test -bench=BenchmarkConcurrentMap -cpu=1,2,4,8

BenchmarkConcurrentMap-1     5234891    229 ns/op
BenchmarkConcurrentMap-2     8234123    145 ns/op
BenchmarkConcurrentMap-4    14234891     84 ns/op
BenchmarkConcurrentMap-8    14891234     80 ns/op  ← scaling stops at 4

The scalability plateau at 4 suggests lock contention at that concurrency level.

Profiling Benchmarks

Benchmarks are the ideal environment for CPU and memory profiling — controlled, repeatable, no test overhead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# CPU profile:
go test -bench=BenchmarkParseOrderMessage -cpuprofile cpu.prof
go tool pprof -http=:8080 cpu.prof

# Memory profile:
go test -bench=BenchmarkParseOrderMessage -memprofile mem.prof
go tool pprof -http=:8080 mem.prof

# Both:
go test -bench=BenchmarkParseOrderMessage \
    -cpuprofile cpu.prof -memprofile mem.prof

The web UI from go tool pprof -http gives you flame graphs, top functions by CPU time, allocation sites — everything you need to find the hot path.

benchstat: Comparing Results

The benchstat tool from golang.org/x/perf compares benchmark runs statistically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
go test -bench=. -count=10 > before.txt
# Make your change
go test -bench=. -count=10 > after.txt
benchstat before.txt after.txt

name                     old time/op  new time/op  delta
ParseOrderMessage-8       420ns ± 1%   187ns ± 2%  -55.5%  (p=0.000 n=10+10)
ParseOrderMessage/json-8  419ns ± 1%   186ns ± 2%  -55.6%  (p=0.000 n=10+10)

name                     old alloc/op  new alloc/op  delta
ParseOrderMessage-8        256B ± 0%      0B ± 0%   -100.0%  (p=0.000 n=10+10)

p=0.000 means the difference is statistically significant (Mann-Whitney test). Small n values make it easy to get false significance — use -count=10 or more.

The Cache Warming Trap

Benchmarks run in process with warm caches. Production code starts with cold caches after deployment or on low-traffic instances. If your code is cache-sensitive, the benchmark may over-optimistically show performance that doesn’t appear in production.

For cache-sensitive benchmarks, add explicit cache thrashing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func BenchmarkLookup(b *testing.B) {
    table := buildLookupTable(100000)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        // Access different keys each iteration to prevent L1 cache warmth:
        key := keys[i%len(keys)]
        result := table.Lookup(key)
        _ = result
    }
}

Or use large enough working sets that the data doesn’t fit in cache:

1
b.SetBytes(int64(len(data)))  // make data large enough to exceed L3 cache

Good benchmarks are the difference between “I think this is faster” and “this is 23% faster at p99 under realistic load.” Given how easy Go makes writing and running benchmarks, there’s no excuse for the former when you can have the latter.