Cache Design as a Reliability Practice, Not an Optimisation

At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms.

Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate.

This is a cache reliability failure, not a cache performance failure.

The Two Purposes of Caching

Performance caching: reduce latency for the common case. The system is correct without the cache — just slower.

Reliability caching: absorb load spikes that would otherwise overwhelm dependencies. The system may be technically correct without the cache but operationally incorrect (too slow or unavailable under load).

Most cache designs conflate these. The design patterns for each are different, and confusing them leads to systems that rely on the cache being available while treating it as optional.

The Thundering Herd

The core reliability failure: a cache that holds back traffic is unavailable, the held-back traffic hits the origin simultaneously, the origin is overwhelmed.

Normal state (cache healthy):
  1000 req/s → cache (98% hit) → 20 req/s to DB
                                 DB fine

Cache failure:
  1000 req/s → cache (0% hit) → 1000 req/s to DB
                                 DB saturated
                                 Latency → 10s
                                 Service degraded/down

The fixes operate at different levels:

1. Request coalescing (de-dup inflight requests)

Multiple simultaneous requests for the same key should produce one origin request, not N:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
type Cache struct {
    store  map[string]*cacheEntry
    mu     sync.RWMutex
    loader *singleflight.Group
}

func (c *Cache) Get(ctx context.Context, key string) (Value, error) {
    // Fast path: cache hit
    c.mu.RLock()
    if entry, ok := c.store[key]; ok && !entry.expired() {
        c.mu.RUnlock()
        return entry.value, nil
    }
    c.mu.RUnlock()

    // Slow path: coalesce concurrent requests for the same key
    val, err, _ := c.loader.Do(key, func() (interface{}, error) {
        return c.origin.Fetch(ctx, key)
    })
    if err != nil {
        return Value{}, err
    }
    v := val.(Value)
    c.set(key, v)
    return v, nil
}

singleflight.Group.Do ensures only one origin request is in flight per key at any time. If 100 goroutines simultaneously miss on the same key, one origin request is made and 100 goroutines receive the result.

2. Staggered TTLs (jitter)

Uniform TTLs cause synchronized cache expiry — all entries for a batch-loaded dataset expire at the same time, producing a thundering herd every N seconds:

1
2
3
4
5
6
// Bad: all entries expire at exactly the same time
expiry := time.Now().Add(5 * time.Minute)

// Good: add random jitter to spread expiry across a window
jitter := time.Duration(rand.Int63n(int64(30 * time.Second)))
expiry := time.Now().Add(5*time.Minute + jitter)

A ±15 second jitter on a 5-minute TTL spreads the cache reload across a 30-second window instead of a single instant.

3. Background refresh (stale-while-revalidate)

Rather than waiting for a TTL expiry to refresh, refresh proactively in the background before expiry:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
type CacheEntry struct {
    value     Value
    expiresAt time.Time
    refreshAt time.Time  // refresh before hard expiry
}

func (c *Cache) Get(ctx context.Context, key string) (Value, error) {
    entry, ok := c.store[key]
    if !ok {
        return c.loadAndStore(ctx, key)
    }

    if time.Now().After(entry.expiresAt) {
        // Hard expiry — must fetch synchronously
        return c.loadAndStore(ctx, key)
    }

    if time.Now().After(entry.refreshAt) {
        // Soft expiry — serve stale, refresh in background
        go c.backgroundRefresh(key)
    }

    return entry.value, nil
}

With background refresh, a cache miss only happens when the entry is completely absent or past its hard expiry. Soft expiry causes a background refresh while still serving the cached value. This eliminates the latency spike from synchronous cache misses in the common case.

4. Circuit breaker on cache (protect origin)

When the origin is degraded, a circuit breaker prevents the cascade of cache misses from overwhelming it further:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
type CacheWithBreaker struct {
    cache   *Cache
    breaker *gobreaker.CircuitBreaker
}

func (c *CacheWithBreaker) Get(ctx context.Context, key string) (Value, error) {
    cached, err := c.cache.Get(ctx, key)
    if err == nil {
        return cached, nil  // cache hit, no origin call needed
    }

    // Cache miss — try origin through circuit breaker
    val, err := c.breaker.Execute(func() (interface{}, error) {
        return c.origin.Fetch(ctx, key)
    })
    if err != nil {
        // Circuit open or origin error — return stale value if available
        if stale, ok := c.cache.GetStale(key); ok {
            return stale, nil  // serve stale rather than error
        }
        return Value{}, err
    }

    return val.(Value), nil
}

When the circuit breaker is open (origin is unhealthy), the cache falls back to serving stale values rather than returning errors. For reference data (counterparty names, instrument definitions) that changes slowly, a stale value is usually better than no value.

Cache Invalidation Design

“Cache invalidation is hard” is a cliché because it’s true. The patterns that make it manageable:

TTL-based invalidation (simplest, works for eventually-consistent data): Set a TTL. Accept that data may be stale for up to TTL seconds. Appropriate for reference data that changes infrequently.

Event-driven invalidation (for data that changes with known events): When the source changes, publish an event. The cache subscribes and invalidates/refreshes the affected key.

1
2
3
4
5
6
7
func (c *Cache) consumeInvalidations(ch <-chan InvalidationEvent) {
    for event := range ch {
        c.mu.Lock()
        delete(c.store, event.Key)  // force next access to reload
        c.mu.Unlock()
    }
}

Version-tagged keys (for cache coherence across service instances): Include a version or hash in the cache key so that a data update naturally results in a new key, and old entries expire on their TTL:

Key: instrument:ISIN:US0378331005:v42

When the instrument data is updated to version 43, the key changes. Old entries are never explicitly invalidated — they just expire. New reads get the v43 entry.

The Metrics That Matter

For cache reliability (not just performance):

hit_rate:           % of requests served from cache
miss_rate:          % that fall through to origin
origin_latency_p99: when misses happen, how slow is origin?
stale_serve_rate:   % served from stale cache (background refresh model)
eviction_rate:      items evicted before TTL (indicates cache too small)

The alert that would have caught the incident: origin_request_rate > 50/s AND origin_latency_p99 > 500ms. That’s the signature of a thundering herd in progress. With that alert, the rolling restart would have been caught before the database saturated.

Caches are load-bearing infrastructure, not performance frosting. Design them with failure modes in mind: what happens when the cache is cold? When the origin is slow? When TTLs all expire simultaneously? The answers to those questions determine whether your service is reliable or just fast when everything is working.

The Two Purposes of Caching#

The Thundering Herd#

Cache Invalidation Design#

The Metrics That Matter#

The Two Purposes of Caching

The Thundering Herd

Cache Invalidation Design

The Metrics That Matter