Every Go service should have profiling endpoints enabled by default. The overhead of having net/http/pprof imported and listening is negligible — a few goroutines, no continuous sampling. The payoff when you need it is enormous.

This post is about the workflow I use for diagnosing real performance problems in production Go services.

Setup: Enabling pprof

For HTTP services:

1
2
3
4
5
6
7
8
import _ "net/http/pprof"  // registers handlers on DefaultServeMux

// Or if you're using a custom mux:
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

Important: don’t expose pprof on a public-facing port. Either:

  • Run a separate internal HTTP server on localhost:6060
  • Protect the /debug/pprof/ path with authentication
  • Use a dedicated management interface

For gRPC or custom protocol services:

1
2
3
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

A separate goroutine listening on localhost — accessible via kubectl port-forward or SSH tunnel, not from the internet.

CPU Profiling

Capture a 30-second CPU profile:

1
2
3
4
5
# From within the cluster (port-forward first):
kubectl port-forward pod/order-service-abc123 6060:6060

# Capture:
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

This runs pprof in interactive mode. The most useful commands:

(pprof) top10
Showing nodes accounting for 8.23s, 82.30% of 10.00s total
  flat  flat%   sum%        cum   cum%
 3.12s 31.20% 31.20%      3.12s 31.20%  runtime.mallocgc
 1.84s 18.40% 49.60%      2.23s 22.30%  encoding/json.Marshal
 0.91s  9.10% 58.70%      0.91s  9.10%  runtime.memmove
 0.78s  7.80% 66.50%      0.78s  7.80%  bytes.(*Buffer).Write
 ...

flat is time spent in the function itself. cum is time in the function plus everything it calls.

This output says: 31% of CPU time is in mallocgc (memory allocation) and 18% is in json.Marshal. Both are symptoms of excessive allocation in the request path — most likely from JSON serialisation of response objects.

The flame graph is usually more useful than top10:

1
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

This opens a browser UI with an interactive flame graph. The width of each bar is proportional to CPU time. Click a bar to zoom in. Look for wide bars that are deeper than expected — that’s where the time is going.

Memory Profiling

Three distinct memory profiles:

Heap profile — current live allocations:

1
go tool pprof http://localhost:6060/debug/pprof/heap

Allocation profile — all allocations since start (or since last reset):

1
go tool pprof http://localhost:6060/debug/pprof/allocs

Goroutine profile — what every goroutine is doing right now:

1
go tool pprof http://localhost:6060/debug/pprof/goroutine

For memory leak investigation, the heap profile is most useful:

(pprof) top10 -cum
Showing nodes accounting for 1234.5MB, 92.10% of 1340.5MB total
      flat  flat%   sum%        cum   cum%
   843.2MB 62.90% 62.90%    843.2MB 62.90%  github.com/company/cache.(*LRU).Set
   231.4MB 17.26% 80.16%    231.4MB 17.26%  encoding/json.(*Decoder).more

843MB held by cache.(*LRU).Set. That’s the leak — something is filling the LRU cache and it’s not evicting. The top10 -cum flag sorts by cumulative allocation to surface the holders, not the leaf allocators.

The Goroutine Leak Pattern

Goroutine profiles are the fastest way to diagnose goroutine leaks:

1
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | head -100

Normal output for a healthy service: a small number of goroutines, all in expected states (blocked on channel recv, in select, in network I/O).

Leaking goroutine signature:

1000 @ 0x432fd9 0x44cf41 0x512b23 0x4ca...
# 0x512b23  github.com/company/processor.processRequest+0x123
#           github.com/company/processor.(*Worker).run+0x89

goroutine 1001 [select, 48 minutes]:
  github.com/company/processor.processRequest(...)

1000 goroutines in select for 48 minutes — they started, launched a goroutine, and never cleaned up. This is a context leak: the goroutines are waiting on a channel that will never send because the originating request context was cancelled.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// The bug:
go func() {
    result := <-resultCh  // blocks forever if context was cancelled
    process(result)
}()

// The fix:
go func() {
    select {
    case result := <-resultCh:
        process(result)
    case <-ctx.Done():
        return  // clean up when context is cancelled
    }
}()

Continuous Profiling in Production

Ad-hoc profiling is for investigation. Continuous profiling captures profiles on a schedule for retrospective analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import "runtime/pprof"

// Sample every 5 minutes, keep last hour in a ring buffer:
go func() {
    ticker := time.NewTicker(5 * time.Minute)
    for range ticker.C {
        buf := &bytes.Buffer{}
        pprof.WriteHeapProfile(buf)
        profileStore.Save("heap", time.Now(), buf.Bytes())
    }
}()

In production, I use Grafana Pyroscope (formerly Phlare) or Datadog Continuous Profiler for this — they integrate with pprof natively and provide the UI for querying and comparing profiles over time.

The scenario this enables: “CPU was high yesterday between 14:00 and 15:00, what was different?” Pull the profiles from that window, compare with a normal period, find the difference.

The pprof Workflow for a Real Incident

The service was using 85% CPU during peak hours. Expected was 40%.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Step 1: Capture CPU profile during the high-CPU window
go tool pprof http://service:6060/debug/pprof/profile?seconds=60

# Step 2: top10 — what's consuming the CPU?
(pprof) top10
  2.8s  encoding/json.Marshal
  1.9s  runtime.mallocgc
  1.2s  bytes.(*Buffer).Grow
  ...

# Step 3: Drill into json.Marshal
(pprof) list encoding/json.Marshal
# Shows which call sites are calling json.Marshal most

# Step 4: Identify source
(pprof) web  # opens flame graph in browser

The flame graph showed that json.Marshal was being called inside a tight loop that processed market data updates. The original code marshalled each update to JSON for logging:

1
2
3
4
for _, update := range updates {
    log.Debugf("processing update: %s", jsonMust(update))  // marshals even when debug is off
    process(update)
}

The log.Debugf call evaluated its arguments even when debug logging was disabled. jsonMust(update) marshalled the update to JSON on every iteration — 5,000 times per second.

Fix: lazy evaluation:

1
2
3
if log.IsDebug() {
    log.Debugf("processing update: %s", jsonMust(update))
}

CPU dropped from 85% to 41%.

Total investigation time: 15 minutes. The tooling worked because it was already enabled, the profile was captured quickly, and the flame graph pointed directly at the issue.