Goroutines Are Not Threads: The Mental Model Shift

The first thing every Java developer learns about goroutines is that they’re cheap. Start a million of them, no problem. The mental model that follows from this — “goroutines are threads but lighter” — is close enough to be useful and wrong enough to cause confusion.

Here’s the refined model.

What a Thread Actually Is

An OS thread is a unit of CPU scheduling managed by the kernel. Each thread has:

A kernel-managed execution context (register state, program counter)
A stack (typically 1–8MB, pre-allocated)
A spot in the kernel’s scheduler queues

Context-switching between threads requires a kernel transition, saving and restoring all CPU registers, and updating kernel data structures. On a modern CPU this takes ~1–10 microseconds. With thousands of threads, context-switching overhead becomes significant.

What a Goroutine Is

A goroutine is a unit of execution managed by the Go runtime scheduler, not the kernel. Each goroutine has:

A Go runtime execution context (goroutine stack, program counter, etc.)
A stack that starts at 2KB and grows as needed (segmented stacks, or contiguous with copying)
A spot in the Go scheduler’s run queues

The Go scheduler (M:N scheduling) multiplexes goroutines onto OS threads. With GOMAXPROCS=8, there are 8 OS threads (M’s, for “machine”) and the scheduler distributes goroutines (G’s) across them:

┌─────────────────────────────────────────────────────────────┐
│                    Go Runtime Scheduler                     │
│                                                             │
│  P0 (processor)         P1              P2              P3  │
│  ┌────────────┐   ┌────────────┐  ┌──────────┐  ┌────────┐ │
│  │ runqueue   │   │ runqueue   │  │ runqueue │  │runqueue│ │
│  │ G1,G2,G3  │   │ G4,G5      │  │ G6       │  │ G7,G8  │ │
│  └─────┬──────┘   └────┬───────┘  └────┬─────┘  └───┬────┘ │
│        │               │               │             │      │
│        M0              M1              M2            M3     │
└────────┼───────────────┼───────────────┼─────────────┼──────┘
         │               │               │             │
    OS Thread 0     OS Thread 1    OS Thread 2   OS Thread 3

Each P (processor) has a local run queue. G’s in the local queue run on that P’s M without any kernel involvement. Context-switching between goroutines is cooperative (goroutines yield at function calls, channel operations, syscalls) and happens in userspace — no kernel transition, no register dump, much cheaper.

Where the Model Breaks Down

Blocking syscalls park the OS thread. When a goroutine calls a blocking syscall (network read, file read, time.Sleep), the Go runtime:

Detaches the M from its P
Lets the goroutine block on the syscall (in the OS)
Creates or wakes a spare M to take over the P and keep running other goroutines

This is transparent to the Go programmer — from the goroutine’s perspective, the blocking call just blocked. From the runtime’s perspective, the OS thread was parked and a new one took its place.

This is why goroutines can “block” on I/O without blocking other goroutines. The runtime handles the multiplexing.

But CGO and certain syscalls are different. When a goroutine calls into C code via CGO, the Go runtime can’t do the park-and-replace trick because it doesn’t control the C call stack. The OS thread is truly blocked. With many concurrent CGO calls, you can exhaust GOMAXPROCS and starve the Go scheduler. This is one reason CGO is discouraged in hot paths.

CPU-bound goroutines don’t yield voluntarily. A goroutine in a tight compute loop with no function calls (or only inlined function calls) won’t yield to other goroutines on the same P. Go 1.14 added asynchronous preemption — signals can interrupt a running goroutine at a safe point — but this is still coarser than OS thread preemption.

The Scheduler’s Work-Stealing

When a P’s local run queue is empty, it steals goroutines from other P’s queues:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Conceptually (not actual Go runtime code):
for {
    if g := localQueue.pop(); g != nil {
        run(g)
        continue
    }
    if g := globalQueue.pop(); g != nil {
        run(g)
        continue
    }
    // Steal from a random other P
    victim := randomOtherP()
    for n := len(victim.localQueue) / 2; n > 0; n-- {
        if g := victim.localQueue.steal(); g != nil {
            localQueue.push(g)
        }
    }
}

Work-stealing means load distributes automatically. If you spawn 1000 goroutines on P0, the other P’s will steal half of them. You don’t need to manually distribute goroutines across processors.

Practical Consequences

You can spawn goroutines freely for concurrent I/O. Waiting on network calls, database queries, file reads — each can be its own goroutine. The runtime handles the thread multiplexing efficiently.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// This is idiomatic and efficient:
var wg sync.WaitGroup
for _, symbol := range symbols {
    wg.Add(1)
    go func(s string) {
        defer wg.Done()
        price, err := priceService.Fetch(ctx, s)
        // ... handle
    }(symbol)
}
wg.Wait()

CPU-bound work needs manual parallelism control. Spawning 10,000 goroutines for CPU-bound tasks doesn’t give you 10,000× parallelism — you have GOMAXPROCS cores. Excessive goroutines mean overhead and scheduler churn. For CPU-bound parallel work, keep concurrency ≤ GOMAXPROCS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Worker pool pattern for CPU-bound work:
sem := make(chan struct{}, runtime.GOMAXPROCS(0))
for _, item := range items {
    sem <- struct{}{}
    go func(i Item) {
        defer func() { <-sem }()
        cpuIntensiveWork(i)
    }(item)
}
// Drain semaphore
for i := 0; i < cap(sem); i++ {
    sem <- struct{}{}
}

Goroutine leaks are real. Unlike threads (which the OS cleans up), a goroutine blocked on a channel or mutex that’s never signalled leaks forever. A service that creates goroutines in request handlers can accumulate thousands of leaked goroutines over time.

1
2
3
4
5
// Goroutine leak: the goroutine blocks on ch forever if nobody sends
go func() {
    result := <-ch  // leaks if ch is never written to and not closed
    process(result)
}()

The fix: always use context.Context with timeouts, ensure channels are closed or use select with ctx.Done().

Stack growth is automatic but not free. Go goroutine stacks start small (2KB) and grow by doubling when exhausted. The growth involves copying the entire stack to a new location and updating all pointers. This is fast but not free — code that causes frequent stack growth in a tight loop has measurable overhead. Use //go:nosplit for hot functions that must not grow the stack (but this is a rare, low-level optimisation).

The Comparison Table

Property	OS Thread	Goroutine
Creation cost	~10µs, 1MB+ stack	~0.3µs, 2KB stack
Context-switch	~1–10µs (kernel)	~0.1µs (userspace)
Blocking I/O	Blocks OS thread	Yields to other goroutines
Blocking CPU	Preempted by OS	Preempted by Go 1.14+ signals
Stack size	Fixed (large)	Dynamic (starts small)
Scheduling	OS kernel	Go runtime
GOMAXPROCS limit	No	Yes (for simultaneous execution)

The model to internalise: goroutines are cheap to create and switch between, but they still run on OS threads. The scheduler is good at hiding this — right up until you do something that makes OS threads expensive again (CGO, long blocking syscalls without the runtime’s I/O multiplexer, excessive CPU-bound work without yielding).

What a Thread Actually Is#

What a Goroutine Is#

Where the Model Breaks Down#

The Scheduler’s Work-Stealing#

Practical Consequences#

The Comparison Table#