Every Go service should have profiling endpoints enabled by default. The overhead of having net/http/pprof imported and listening is negligible — a few goroutines, no continuous sampling. The payoff when you need it is enormous.
This post is about the workflow I use for diagnosing real performance problems in production Go services.
Setup: Enabling pprof
For HTTP services:
| |
Important: don’t expose pprof on a public-facing port. Either:
- Run a separate internal HTTP server on
localhost:6060 - Protect the
/debug/pprof/path with authentication - Use a dedicated management interface
For gRPC or custom protocol services:
| |
A separate goroutine listening on localhost — accessible via kubectl port-forward or SSH tunnel, not from the internet.
CPU Profiling
Capture a 30-second CPU profile:
| |
This runs pprof in interactive mode. The most useful commands:
(pprof) top10
Showing nodes accounting for 8.23s, 82.30% of 10.00s total
flat flat% sum% cum cum%
3.12s 31.20% 31.20% 3.12s 31.20% runtime.mallocgc
1.84s 18.40% 49.60% 2.23s 22.30% encoding/json.Marshal
0.91s 9.10% 58.70% 0.91s 9.10% runtime.memmove
0.78s 7.80% 66.50% 0.78s 7.80% bytes.(*Buffer).Write
...
flat is time spent in the function itself. cum is time in the function plus everything it calls.
This output says: 31% of CPU time is in mallocgc (memory allocation) and 18% is in json.Marshal. Both are symptoms of excessive allocation in the request path — most likely from JSON serialisation of response objects.
The flame graph is usually more useful than top10:
| |
This opens a browser UI with an interactive flame graph. The width of each bar is proportional to CPU time. Click a bar to zoom in. Look for wide bars that are deeper than expected — that’s where the time is going.
Memory Profiling
Three distinct memory profiles:
Heap profile — current live allocations:
| |
Allocation profile — all allocations since start (or since last reset):
| |
Goroutine profile — what every goroutine is doing right now:
| |
For memory leak investigation, the heap profile is most useful:
(pprof) top10 -cum
Showing nodes accounting for 1234.5MB, 92.10% of 1340.5MB total
flat flat% sum% cum cum%
843.2MB 62.90% 62.90% 843.2MB 62.90% github.com/company/cache.(*LRU).Set
231.4MB 17.26% 80.16% 231.4MB 17.26% encoding/json.(*Decoder).more
843MB held by cache.(*LRU).Set. That’s the leak — something is filling the LRU cache and it’s not evicting. The top10 -cum flag sorts by cumulative allocation to surface the holders, not the leaf allocators.
The Goroutine Leak Pattern
Goroutine profiles are the fastest way to diagnose goroutine leaks:
| |
Normal output for a healthy service: a small number of goroutines, all in expected states (blocked on channel recv, in select, in network I/O).
Leaking goroutine signature:
1000 @ 0x432fd9 0x44cf41 0x512b23 0x4ca...
# 0x512b23 github.com/company/processor.processRequest+0x123
# github.com/company/processor.(*Worker).run+0x89
goroutine 1001 [select, 48 minutes]:
github.com/company/processor.processRequest(...)
1000 goroutines in select for 48 minutes — they started, launched a goroutine, and never cleaned up. This is a context leak: the goroutines are waiting on a channel that will never send because the originating request context was cancelled.
| |
Continuous Profiling in Production
Ad-hoc profiling is for investigation. Continuous profiling captures profiles on a schedule for retrospective analysis:
| |
In production, I use Grafana Pyroscope (formerly Phlare) or Datadog Continuous Profiler for this — they integrate with pprof natively and provide the UI for querying and comparing profiles over time.
The scenario this enables: “CPU was high yesterday between 14:00 and 15:00, what was different?” Pull the profiles from that window, compare with a normal period, find the difference.
The pprof Workflow for a Real Incident
The service was using 85% CPU during peak hours. Expected was 40%.
| |
The flame graph showed that json.Marshal was being called inside a tight loop that processed market data updates. The original code marshalled each update to JSON for logging:
| |
The log.Debugf call evaluated its arguments even when debug logging was disabled. jsonMust(update) marshalled the update to JSON on every iteration — 5,000 times per second.
Fix: lazy evaluation:
| |
CPU dropped from 85% to 41%.
Total investigation time: 15 minutes. The tooling worked because it was already enabled, the profile was captured quickly, and the flame graph pointed directly at the issue.