The first thing every Java developer learns about goroutines is that they’re cheap. Start a million of them, no problem. The mental model that follows from this — “goroutines are threads but lighter” — is close enough to be useful and wrong enough to cause confusion.
Here’s the refined model.
What a Thread Actually Is
An OS thread is a unit of CPU scheduling managed by the kernel. Each thread has:
- A kernel-managed execution context (register state, program counter)
- A stack (typically 1–8MB, pre-allocated)
- A spot in the kernel’s scheduler queues
Context-switching between threads requires a kernel transition, saving and restoring all CPU registers, and updating kernel data structures. On a modern CPU this takes ~1–10 microseconds. With thousands of threads, context-switching overhead becomes significant.
What a Goroutine Is
A goroutine is a unit of execution managed by the Go runtime scheduler, not the kernel. Each goroutine has:
- A Go runtime execution context (goroutine stack, program counter, etc.)
- A stack that starts at 2KB and grows as needed (segmented stacks, or contiguous with copying)
- A spot in the Go scheduler’s run queues
The Go scheduler (M:N scheduling) multiplexes goroutines onto OS threads. With GOMAXPROCS=8, there are 8 OS threads (M’s, for “machine”) and the scheduler distributes goroutines (G’s) across them:
┌─────────────────────────────────────────────────────────────┐
│ Go Runtime Scheduler │
│ │
│ P0 (processor) P1 P2 P3 │
│ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌────────┐ │
│ │ runqueue │ │ runqueue │ │ runqueue │ │runqueue│ │
│ │ G1,G2,G3 │ │ G4,G5 │ │ G6 │ │ G7,G8 │ │
│ └─────┬──────┘ └────┬───────┘ └────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ M0 M1 M2 M3 │
└────────┼───────────────┼───────────────┼─────────────┼──────┘
│ │ │ │
OS Thread 0 OS Thread 1 OS Thread 2 OS Thread 3
Each P (processor) has a local run queue. G’s in the local queue run on that P’s M without any kernel involvement. Context-switching between goroutines is cooperative (goroutines yield at function calls, channel operations, syscalls) and happens in userspace — no kernel transition, no register dump, much cheaper.
Where the Model Breaks Down
Blocking syscalls park the OS thread. When a goroutine calls a blocking syscall (network read, file read, time.Sleep), the Go runtime:
- Detaches the M from its P
- Lets the goroutine block on the syscall (in the OS)
- Creates or wakes a spare M to take over the P and keep running other goroutines
This is transparent to the Go programmer — from the goroutine’s perspective, the blocking call just blocked. From the runtime’s perspective, the OS thread was parked and a new one took its place.
This is why goroutines can “block” on I/O without blocking other goroutines. The runtime handles the multiplexing.
But CGO and certain syscalls are different. When a goroutine calls into C code via CGO, the Go runtime can’t do the park-and-replace trick because it doesn’t control the C call stack. The OS thread is truly blocked. With many concurrent CGO calls, you can exhaust GOMAXPROCS and starve the Go scheduler. This is one reason CGO is discouraged in hot paths.
CPU-bound goroutines don’t yield voluntarily. A goroutine in a tight compute loop with no function calls (or only inlined function calls) won’t yield to other goroutines on the same P. Go 1.14 added asynchronous preemption — signals can interrupt a running goroutine at a safe point — but this is still coarser than OS thread preemption.
The Scheduler’s Work-Stealing
When a P’s local run queue is empty, it steals goroutines from other P’s queues:
| |
Work-stealing means load distributes automatically. If you spawn 1000 goroutines on P0, the other P’s will steal half of them. You don’t need to manually distribute goroutines across processors.
Practical Consequences
You can spawn goroutines freely for concurrent I/O. Waiting on network calls, database queries, file reads — each can be its own goroutine. The runtime handles the thread multiplexing efficiently.
| |
CPU-bound work needs manual parallelism control. Spawning 10,000 goroutines for CPU-bound tasks doesn’t give you 10,000× parallelism — you have GOMAXPROCS cores. Excessive goroutines mean overhead and scheduler churn. For CPU-bound parallel work, keep concurrency ≤ GOMAXPROCS:
| |
Goroutine leaks are real. Unlike threads (which the OS cleans up), a goroutine blocked on a channel or mutex that’s never signalled leaks forever. A service that creates goroutines in request handlers can accumulate thousands of leaked goroutines over time.
| |
The fix: always use context.Context with timeouts, ensure channels are closed or use select with ctx.Done().
Stack growth is automatic but not free. Go goroutine stacks start small (2KB) and grow by doubling when exhausted. The growth involves copying the entire stack to a new location and updating all pointers. This is fast but not free — code that causes frequent stack growth in a tight loop has measurable overhead. Use //go:nosplit for hot functions that must not grow the stack (but this is a rare, low-level optimisation).
The Comparison Table
| Property | OS Thread | Goroutine |
|---|---|---|
| Creation cost | ~10µs, 1MB+ stack | ~0.3µs, 2KB stack |
| Context-switch | ~1–10µs (kernel) | ~0.1µs (userspace) |
| Blocking I/O | Blocks OS thread | Yields to other goroutines |
| Blocking CPU | Preempted by OS | Preempted by Go 1.14+ signals |
| Stack size | Fixed (large) | Dynamic (starts small) |
| Scheduling | OS kernel | Go runtime |
| GOMAXPROCS limit | No | Yes (for simultaneous execution) |
The model to internalise: goroutines are cheap to create and switch between, but they still run on OS threads. The scheduler is good at hiding this — right up until you do something that makes OS threads expensive again (CGO, long blocking syscalls without the runtime’s I/O multiplexer, excessive CPU-bound work without yielding).