Go's Scheduler: GOMAXPROCS, Work Stealing, and Why It Matters

Go’s goroutine scheduler sits between your code and the OS. Understanding it is useful not because you’ll tune it (you rarely should) but because its behaviour explains a class of performance surprises and concurrency patterns that look odd until you see why they exist.

The Three-Layer Model: G, M, P

The Go scheduler operates with three entities:

G (goroutine): the unit of concurrency. Starts at 2–8KB stack, grows as needed. Cheap to create — you can have hundreds of thousands.
M (machine): an OS thread. Expensive. The scheduler creates and parks these as needed.
P (processor): the scheduling context. Holds a run queue of goroutines. The number of Ps is GOMAXPROCS (default: number of CPU cores).

The relationship: M runs goroutines by acquiring a P. If there are no Ps available, M parks. If all Ps are busy, goroutines queue up in run queues or the global queue.

This M:N mapping (many goroutines onto fewer OS threads) is why goroutines are cheap — you’re not paying the ~512KB OS thread stack or the OS scheduler overhead per goroutine.

GOMAXPROCS and When to Change It

GOMAXPROCS controls how many OS threads can execute Go code simultaneously. Default is number of logical CPU cores.

You’d change it if:

Running in a container with a CPU quota and runtime.NumCPU() returns the host’s core count (not the quota). Set GOMAXPROCS to match the quota, or use automaxprocs (the go.uber.org/automaxprocs library reads the cgroups quota and sets it correctly on startup).
You’re I/O-bound and want more threads to overlap blocking work. But adding goroutines is usually a better solution than increasing GOMAXPROCS above core count.
Profiling shows excessive context switching, suggesting you have more runnable goroutines than Ps. Increasing GOMAXPROCS doesn’t help here — the problem is goroutine count, not P count.

A common mistake: running Go in Kubernetes with CPU limits without setting GOMAXPROCS to match. A container limited to 2 CPUs but running with GOMAXPROCS=64 (the host’s core count) creates unnecessary parallelism, lock contention, and GC pressure. We saw this in staging — a service performing fine locally that degraded badly under Kubernetes CPU limits until we added automaxprocs.

Work Stealing

Each P has a local run queue (256 goroutines). When a P’s local queue is empty, it steals goroutines from other Ps’ run queues (takes half their queue). This is work stealing, and it keeps CPUs busy without a central scheduler bottleneck.

The practical implication: goroutines are not necessarily executed on the P that created them. If you’re doing goroutine-local caching using runtime-specific state, you’ll be surprised when the goroutine migrates. This is unusual but worth knowing.

Blocking Operations and the Handoff Mechanism

When a goroutine blocks on a syscall (disk read, network I/O), the M executing it also blocks — it’s in kernel space and can’t be preempted by the Go scheduler. To avoid all Ps being stuck waiting for syscalls:

The scheduler hands off the blocked M’s P to another M (either an idle M or a newly created one). The blocked M continues waiting for the syscall; when it completes, it tries to re-acquire a P. If none are available, the goroutine goes into the global queue and the M parks.

This is why Go can have high goroutine concurrency without all of them mapping to OS threads — most are waiting, and the scheduler efficiently multiplexes the ones that are runnable.

For network I/O specifically, Go uses a non-blocking I/O model with a network poller (epoll/kqueue/IOCP). Network goroutines don’t block the M at all — they park the goroutine and register with the poller. When the network event fires, the goroutine is made runnable again. This is why Go’s network servers can handle thousands of concurrent connections with minimal OS thread count.

The Preemption Evolution

Early Go (pre-1.14) had cooperative preemption: goroutines yielded only at function calls or blocking operations. A goroutine in a tight loop with no function calls or blocking ops could monopolise an M indefinitely, preventing other goroutines from running on that P.

Go 1.14 added asynchronous preemption via signals (on Unix). The runtime sends SIGURG to M, interrupting the goroutine at a safe point. Tight CPU loops are now preemptible.

This matters for: CPU-bound goroutines with tight loops, and for GC (which needs all goroutines to reach safepoints). Pre-1.14, a tight loop could delay GC, which delayed all other goroutines. Post-1.14, this class of problem largely disappears.

A Practical Debugging Example

We had a service where some requests had multi-second latency spikes under load, with no obvious cause in traces. GODEBUG=schedtrace=1000 (prints scheduler state every 1000ms) showed:

SCHED 1001ms: gomaxprocs=8 idleprocs=0 threads=9 spinningthreads=0 idlethreads=0 runqueue=847 ...

runqueue=847 — 847 goroutines in the global run queue, waiting for a P. All 8 Ps busy. The request handler goroutines were being created faster than they could be scheduled.

The root cause: we were spawning a goroutine per incoming message without any backpressure. Under load, the goroutine backlog grew unbounded, and new request goroutines queued for seconds before being scheduled. Fixing the producer side (rate limiting + a worker pool) dropped the runqueue to near zero and eliminated the spikes.

GODEBUG=schedtrace is the first tool to reach for when you have mysterious latency under load and the traces look clean.

The Three-Layer Model: G, M, P#

GOMAXPROCS and When to Change It#

Work Stealing#

Blocking Operations and the Handoff Mechanism#

The Preemption Evolution#

A Practical Debugging Example#