A service that calls a database, calls another service, or writes to a message queue will eventually encounter a failure. The question is not whether — it’s whether your service handles failures gracefully or cascades them into a larger outage.

These patterns are well-documented in the resilience literature. What this post focuses on is the specific Go implementation and the traps that make naive implementations incorrect.

Retry with Exponential Backoff and Jitter

The simplest resilience pattern. When an operation fails, wait a bit and try again.

The naive version:

1
2
3
4
5
6
// WRONG: fixed delay, no jitter, no limit
for {
    err := callService()
    if err == nil { break }
    time.Sleep(1 * time.Second)
}

Problems:

  1. No maximum retry count — loops forever if the service is permanently down
  2. Fixed delay — all retrying callers wake simultaneously, creating a thundering herd
  3. No distinction between retriable and non-retriable errors (don’t retry a 400 Bad Request)

The correct version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
type RetryConfig struct {
    MaxAttempts  int
    InitialDelay time.Duration
    MaxDelay     time.Duration
    Multiplier   float64
}

var DefaultRetry = RetryConfig{
    MaxAttempts:  5,
    InitialDelay: 100 * time.Millisecond,
    MaxDelay:     30 * time.Second,
    Multiplier:   2.0,
}

func Retry(ctx context.Context, cfg RetryConfig, op func() error) error {
    delay := cfg.InitialDelay

    for attempt := 1; attempt <= cfg.MaxAttempts; attempt++ {
        err := op()
        if err == nil {
            return nil
        }

        // Don't retry non-retriable errors:
        if !isRetriable(err) {
            return err
        }

        if attempt == cfg.MaxAttempts {
            return fmt.Errorf("after %d attempts: %w", attempt, err)
        }

        // Jitter: add random ±25% to avoid thundering herd:
        jitter := time.Duration(rand.Int63n(int64(delay / 2)))
        if rand.Intn(2) == 0 { jitter = -jitter }
        sleepDuration := delay + jitter

        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(sleepDuration):
        }

        delay = time.Duration(float64(delay) * cfg.Multiplier)
        if delay > cfg.MaxDelay {
            delay = cfg.MaxDelay
        }
    }
    return nil
}

func isRetriable(err error) bool {
    // Don't retry context cancellation or permanent errors:
    if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
        return false
    }
    var httpErr *HTTPError
    if errors.As(err, &httpErr) {
        return httpErr.StatusCode >= 500  // retry 5xx, not 4xx
    }
    return true  // default: retry
}

The ctx.Done() check in the sleep loop is critical: if the request context is cancelled while the retry is sleeping, the retry exits immediately rather than sleeping for the full duration.

Circuit Breaker

A circuit breaker prevents a failing dependency from absorbing all your retry attempts (and your thread pool) when it’s clearly unhealthy.

States:
  CLOSED → normal operation, requests pass through
  OPEN   → dependency failing, requests fail fast without trying
  HALF-OPEN → testing recovery, one request allowed through
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
type CircuitBreaker struct {
    mu            sync.Mutex
    state         State
    failures      int
    successes     int
    lastFailure   time.Time
    threshold     int           // failures before opening
    timeout       time.Duration // how long to stay open
    halfOpenMax   int           // successes needed to close
}

func (cb *CircuitBreaker) Execute(op func() error) error {
    cb.mu.Lock()
    state := cb.currentState()

    if state == StateOpen {
        cb.mu.Unlock()
        return ErrCircuitOpen
    }
    cb.mu.Unlock()

    err := op()

    cb.mu.Lock()
    defer cb.mu.Unlock()

    if err != nil {
        cb.failures++
        cb.successes = 0
        cb.lastFailure = time.Now()
        if cb.failures >= cb.threshold {
            cb.state = StateOpen
        }
    } else {
        cb.successes++
        if cb.state == StateHalfOpen && cb.successes >= cb.halfOpenMax {
            cb.state = StateClosed
            cb.failures = 0
        }
    }

    return err
}

func (cb *CircuitBreaker) currentState() State {
    if cb.state == StateOpen {
        if time.Since(cb.lastFailure) > cb.timeout {
            cb.state = StateHalfOpen
            cb.successes = 0
        }
    }
    return cb.state
}

Usage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
var breaker = &CircuitBreaker{
    threshold:   5,              // open after 5 consecutive failures
    timeout:     30 * time.Second,
    halfOpenMax: 2,
}

func callPricingService(ctx context.Context, req PriceRequest) (Price, error) {
    var price Price
    err := breaker.Execute(func() error {
        var err error
        price, err = pricingClient.Get(ctx, req)
        return err
    })
    if errors.Is(err, ErrCircuitOpen) {
        return Price{}, fmt.Errorf("pricing service unavailable (circuit open)")
    }
    return price, err
}

When the pricing service fails 5 times, the circuit opens. Subsequent calls fail immediately (ErrCircuitOpen) without trying the service. After 30 seconds, one request is allowed through to test recovery. If it succeeds twice, the circuit closes.

Combining Retry and Circuit Breaker

The order matters: retry wraps the circuit breaker, not the other way around.

1
2
3
4
5
err := Retry(ctx, DefaultRetry, func() error {
    return breaker.Execute(func() error {
        return callDependency(ctx, req)
    })
})

When the circuit is open, breaker.Execute returns ErrCircuitOpen immediately. But ErrCircuitOpen is not retriable (the dependency is known-down), so the retry stops. This is the intended behaviour — don’t retry when the circuit is open.

Mark ErrCircuitOpen as non-retriable in isRetriable:

1
2
3
4
5
6
7
func isRetriable(err error) bool {
    if errors.Is(err, ErrCircuitOpen) {
        return false
    }
    // ... other non-retriable errors
    return true
}

Timeout Hierarchy

Every outbound call should have a timeout shorter than the caller’s context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func processRequest(ctx context.Context, req Request) (Response, error) {
    // This context has the user's request deadline (e.g. 5s)

    // Call pricing service with a tighter deadline:
    priceCtx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)
    defer cancel()

    price, err := pricingService.Get(priceCtx, req.Symbol)
    if err != nil {
        // Pricing failed but we can fall back:
        price = fallbackPrice(req.Symbol)
    }

    // Call inventory with another tight deadline:
    invCtx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
    defer cancel()

    inventory, err := inventoryService.Check(invCtx, req.ItemID)
    if err != nil {
        return Response{}, fmt.Errorf("inventory check: %w", err)
    }

    return buildResponse(price, inventory), nil
}

If the pricing service takes 600ms and the timeout is 500ms, we get a timeout error after 500ms, trigger the fallback, and still respond to the user within the 5s deadline. Without per-call timeouts, a slow pricing service would hold the request open until the full 5s deadline.

The rule: set individual call timeouts to a fraction (10–25%) of the calling context’s remaining deadline.


These patterns are not optional for production services. A service that calls external dependencies without retry logic will have worse availability than one that has it. A service without circuit breakers will cascade failures from its dependencies into its own failure. The implementation cost is a few hundred lines of tested, reusable code. The operational benefit is paid back every time a dependency has an incident.