Go has first-class benchmarking built into the testing package. go test -bench=. is enough to get started. The hard part isn’t running benchmarks — it’s writing ones that measure what you intend to measure.
These are the patterns I’ve found essential and the mistakes I’ve made repeatedly enough to write down.
The Basic Structure#
1
2
3
4
5
6
7
8
9
10
11
| func BenchmarkParseOrderMessage(b *testing.B) {
msg := []byte(`{"id":"ord_123","symbol":"AAPL","qty":100,"price":145.23}`)
b.ResetTimer() // Don't count setup time
for i := 0; i < b.N; i++ {
_, err := parseOrderMessage(msg)
if err != nil {
b.Fatal(err)
}
}
}
|
b.N is set by the framework — it runs the loop enough times to get a stable measurement. b.ResetTimer() after setup ensures setup time doesn’t pollute the measurement.
Running it:
1
2
3
4
5
6
7
| go test -bench=BenchmarkParseOrderMessage -benchmem -count=5 ./...
BenchmarkParseOrderMessage-8 2847193 421.3 ns/op 256 B/op 4 allocs/op
BenchmarkParseOrderMessage-8 2891034 418.9 ns/op 256 B/op 4 allocs/op
BenchmarkParseOrderMessage-8 2904821 419.7 ns/op 256 B/op 4 allocs/op
BenchmarkParseOrderMessage-8 2886710 420.1 ns/op 256 B/op 4 allocs/op
BenchmarkParseOrderMessage-8 2878923 419.4 ns/op 256 B/op 4 allocs/op
|
-benchmem adds allocation stats. -count=5 runs each benchmark 5 times so you can see variance.
The Dead Code Elimination Problem#
The most common benchmark mistake: the compiler eliminates the code you’re benchmarking.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| // WRONG: compiler may optimise away hash(s) since result is unused
func BenchmarkHash(b *testing.B) {
s := "hello, world"
for i := 0; i < b.N; i++ {
hash(s) // result unused — compiler is allowed to remove this call
}
}
// CORRECT: use the result to prevent elimination
func BenchmarkHash(b *testing.B) {
s := "hello, world"
var result uint64
for i := 0; i < b.N; i++ {
result = hash(s)
}
_ = result // prevent result from being optimised away
}
|
The _ = result outside the loop tells the compiler the result matters, preventing elimination of the call and the loop. The assignment outside the loop also prevents the compiler from optimising the accumulation inside the loop.
For more complex cases, use the sink pattern from the standard library:
1
2
3
4
5
6
7
8
9
| var Sink interface{} // package-level variable
func BenchmarkComplexOp(b *testing.B) {
var result SomeStruct
for i := 0; i < b.N; i++ {
result = complexOperation()
}
Sink = result // escapes to heap, defeats all optimisations on result
}
|
Allocation Benchmarks: Pre-allocated vs. Fresh#
When benchmarking allocation patterns, be clear about whether you’re measuring allocation or logic:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| // Measures allocation + processing:
func BenchmarkProcessWithAlloc(b *testing.B) {
for i := 0; i < b.N; i++ {
result := make([]Order, 0, 100)
result = processOrders(input, result)
_ = result
}
}
// Measures processing only, with pre-allocated slice:
func BenchmarkProcessPrealloc(b *testing.B) {
result := make([]Order, 0, 100)
for i := 0; i < b.N; i++ {
result = result[:0] // reset length, keep capacity
result = processOrders(input, result)
}
_ = result
}
|
If your production code pre-allocates, benchmark the pre-allocated version. If you’re trying to measure whether pre-allocation helps, benchmark both and compare.
Sub-Benchmarks and Table-Driven Benchmarks#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| func BenchmarkEncode(b *testing.B) {
cases := []struct {
name string
size int
}{
{"small", 10},
{"medium", 1000},
{"large", 100000},
}
for _, tc := range cases {
b.Run(tc.name, func(b *testing.B) {
data := generateData(tc.size)
b.ResetTimer()
b.SetBytes(int64(len(data))) // enables MB/s reporting
for i := 0; i < b.N; i++ {
encoded := encode(data)
_ = encoded
}
})
}
}
|
b.SetBytes(n) adds throughput in MB/s to the output:
BenchmarkEncode/small-8 9823401 121 ns/op 826 MB/s 0 B/op
BenchmarkEncode/medium-8 112341 10823 ns/op 924 MB/s 0 B/op
BenchmarkEncode/large-8 10234 1082341 ns/op 924 MB/s 0 B/op
MB/s reporting is useful for serialisation, compression, and I/O benchmarks where throughput is the meaningful metric.
Benchmarking Concurrent Code#
1
2
3
4
5
6
7
8
9
10
11
12
| func BenchmarkConcurrentMap(b *testing.B) {
m := sync.Map{}
b.RunParallel(func(pb *testing.PB) {
i := 0
for pb.Next() {
key := strconv.Itoa(i % 1000)
m.Store(key, i)
i++
}
})
}
|
b.RunParallel runs the inner function in GOMAXPROCS goroutines simultaneously. pb.Next() is the parallel equivalent of the b.N loop. Use this to measure concurrent data structure performance under realistic contention.
Run with -cpu 1,2,4,8 to see scaling:
1
2
3
4
5
6
| go test -bench=BenchmarkConcurrentMap -cpu=1,2,4,8
BenchmarkConcurrentMap-1 5234891 229 ns/op
BenchmarkConcurrentMap-2 8234123 145 ns/op
BenchmarkConcurrentMap-4 14234891 84 ns/op
BenchmarkConcurrentMap-8 14891234 80 ns/op ← scaling stops at 4
|
The scalability plateau at 4 suggests lock contention at that concurrency level.
Profiling Benchmarks#
Benchmarks are the ideal environment for CPU and memory profiling — controlled, repeatable, no test overhead:
1
2
3
4
5
6
7
8
9
10
11
| # CPU profile:
go test -bench=BenchmarkParseOrderMessage -cpuprofile cpu.prof
go tool pprof -http=:8080 cpu.prof
# Memory profile:
go test -bench=BenchmarkParseOrderMessage -memprofile mem.prof
go tool pprof -http=:8080 mem.prof
# Both:
go test -bench=BenchmarkParseOrderMessage \
-cpuprofile cpu.prof -memprofile mem.prof
|
The web UI from go tool pprof -http gives you flame graphs, top functions by CPU time, allocation sites — everything you need to find the hot path.
benchstat: Comparing Results#
The benchstat tool from golang.org/x/perf compares benchmark runs statistically:
1
2
3
4
5
6
7
8
9
10
11
| go test -bench=. -count=10 > before.txt
# Make your change
go test -bench=. -count=10 > after.txt
benchstat before.txt after.txt
name old time/op new time/op delta
ParseOrderMessage-8 420ns ± 1% 187ns ± 2% -55.5% (p=0.000 n=10+10)
ParseOrderMessage/json-8 419ns ± 1% 186ns ± 2% -55.6% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
ParseOrderMessage-8 256B ± 0% 0B ± 0% -100.0% (p=0.000 n=10+10)
|
p=0.000 means the difference is statistically significant (Mann-Whitney test). Small n values make it easy to get false significance — use -count=10 or more.
The Cache Warming Trap#
Benchmarks run in process with warm caches. Production code starts with cold caches after deployment or on low-traffic instances. If your code is cache-sensitive, the benchmark may over-optimistically show performance that doesn’t appear in production.
For cache-sensitive benchmarks, add explicit cache thrashing:
1
2
3
4
5
6
7
8
9
10
11
| func BenchmarkLookup(b *testing.B) {
table := buildLookupTable(100000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
// Access different keys each iteration to prevent L1 cache warmth:
key := keys[i%len(keys)]
result := table.Lookup(key)
_ = result
}
}
|
Or use large enough working sets that the data doesn’t fit in cache:
1
| b.SetBytes(int64(len(data))) // make data large enough to exceed L3 cache
|
Good benchmarks are the difference between “I think this is faster” and “this is 23% faster at p99 under realistic load.” Given how easy Go makes writing and running benchmarks, there’s no excuse for the former when you can have the latter.