Feature Flags at Scale: Beyond the On/Off Switch

The first feature flag I implemented was a boolean in a config file. Toggle it to true, deploy, feature is on. Toggle to false, deploy, feature is off. Simple.

The fifteenth feature flag was a percentage rollout with user cohort targeting, metric-based automatic rollback, and a kill-switch that worked without a deployment. The gap between those two implementations is what this post is about.

Why Feature Flags Exist

Feature flags solve a specific problem: deploying code and releasing features are different events, and you want to decouple them.

Without flags, deploying code = releasing the feature. You can’t deploy your code until it’s ready to release. This creates pressure to batch releases, which increases deployment size, which increases risk.

With flags:

Deploy code frequently (small, low-risk deployments)
Release features independently (on your schedule, with rollback capability)
Roll out gradually (percentage of users, beta cohort, internal only)
A/B test (different variants to different user groups)
Kill-switch without redeployment (disable a feature in seconds if it misbehaves)

These properties become more valuable as the system grows and the cost of production incidents increases.

The Flag Types

Not all feature flags are the same. Treating them as one thing leads to cluttered flag management and slow evaluation:

Type                  Lifespan    Evaluation    Example
─────────────────────────────────────────────────────────────────────
Release flag          Weeks       Per-request   "new-order-confirmation-ui"
Experiment flag       Weeks       Per-user      "checkout-flow-variant"
Ops flag              Months      Per-request   "enable-rate-limiting"
Permission flag       Months–∞    Per-user      "beta-api-access"
Kill-switch           Indefinite  Per-request   "disable-payment-processor"

Release flags gate new features until they’re fully rolled out, then get deleted. They should live days to weeks, not months.

Experiment flags assign users to variants for A/B tests. They need stable assignment (same user always gets the same variant) and metrics integration.

Ops flags control operational behaviour — rate limits, circuit breakers, retry policies. They’re changed rarely but need to propagate quickly.

Kill-switches disable specific features immediately when problems are detected. They must evaluate to false (off) if the flag service is unavailable.

The Anatomy of a Flag Evaluation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
type FlagContext struct {
    UserID      string
    TenantID    string
    Environment string
    Attributes  map[string]string  // arbitrary key-value for targeting rules
}

type FlagEvaluation struct {
    Value      interface{}  // bool, string, number, JSON
    Variant    string       // which variant was selected
    Reason     string       // "DEFAULT", "RULE_MATCH", "PERCENTAGE_ROLLOUT"
}

type FlagEvaluator interface {
    EvaluateBool(ctx context.Context, flagKey string, defaultValue bool, fctx FlagContext) FlagEvaluation
    EvaluateString(ctx context.Context, flagKey string, defaultValue string, fctx FlagContext) FlagEvaluation
}

The defaultValue parameter is critical — it’s what the client returns if the flag service is unavailable or the flag doesn’t exist. Kill-switches should have defaultValue = false (disabled). Features under development should have defaultValue = false (not yet active). High-availability features should have defaultValue = true (assume enabled if flag service is down).

Consistent User Assignment

Percentage rollouts need to assign users consistently — the same user should always be in the same bucket, across calls, across service instances, across time (until the rollout configuration changes).

The standard approach: hash the flag key and user ID, map to a bucket:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func assignBucket(flagKey, userID string, buckets int) int {
    h := fnv.New32a()
    h.Write([]byte(flagKey))
    h.Write([]byte(":"))
    h.Write([]byte(userID))
    return int(h.Sum32()) % buckets  // 0 to buckets-1
}

// For a 10% rollout:
bucket := assignBucket("new-checkout", userID, 100)
enabled := bucket < 10  // buckets 0-9 = 10%

Including the flag key in the hash ensures that the same user isn’t always in the same rollout bucket across all flags. Without this, the first 10% of users get every feature and the last 10% never get early access to anything.

Flag Evaluation Performance

Flags are evaluated in hot paths — often on every request. The evaluation must be fast:

Implementation          Evaluation latency    Notes
────────────────────────────────────────────────────────────────
In-process (memory)     ~100ns               Best; requires sync from source
HTTP to flag service    1–10ms               Too slow for per-request
SDK with local cache    ~1µs                 Good; cache invalidation needed

The standard pattern: the flag service SDK maintains an in-process cache of all flag configurations, updated by a background goroutine polling or receiving SSE/streaming updates from the flag service. Evaluations hit the local cache; the service is only contacted for cache updates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Flag SDK initialisation (once at startup):
client := flagsdk.NewClient(flagsdk.Options{
    ServiceURL:      "https://flags.internal/",
    RefreshInterval: 30 * time.Second,
    // On flag service unavailability, use cached values
    UseLastKnownOnError: true,
})
defer client.Close()

// Per-request evaluation (hits local cache, ~100ns):
enabled := client.Bool(ctx, "new-checkout", false, flagsdk.Context{
    UserID: userID,
})

The Technical Debt Trap

Feature flags accumulate. Each flag added is code that must be maintained, evaluated, and eventually removed. A codebase with 200 active flags is:

Harder to reason about (which flags are active? what’s the default behaviour?)
Harder to test (combinatorial explosion of flag states)
Slower to evaluate (more rules to process)
Harder to onboard into (new engineers don’t know flag history)

The discipline required:

1. Every flag has an owner and an expiry date in the flag registry
2. Release flags must be deleted within N days of full rollout
3. Monthly flag review: what's still active? what's been retired?
4. Flags older than 90 days without a defined expiry are escalated

The removal is the hard part. When a flag has been enabled at 100% for 6 months, the surrounding if statement feels safe. Removing it requires testing, and the test cases feel artificial. But the alternative — accumulating permanent if featureEnabled("xyz") branches — produces a codebase where it’s impossible to understand the intended execution path.

Kill-Switch Implementation

Kill-switches have stricter requirements than other flags:

Must evaluate to false (safe default) if the flag service is unavailable
Must propagate within seconds of being toggled (not minutes)
Must not add latency to the happy path when enabled

The streaming/SSE pattern handles requirements 1 and 2: the SDK maintains a persistent connection to the flag service and receives push updates. When the connection drops, the cached value (last known state) is used. Kill-switches are defined with defaultValue = false so that a cold-start service with no cached state behaves safely.

For requirement 3: flag evaluation must be on the critical path but not slow. The in-process cache approach achieves this — the check is a hash map lookup with zero I/O.

What OpenFeature Standardises

OpenFeature (CNCF project) is doing for feature flags what OpenTelemetry did for observability: a vendor-neutral API with pluggable providers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import "github.com/open-feature/go-sdk/openfeature"

// Init with a provider (LaunchDarkly, Flagsmith, custom, etc.):
openfeature.SetProvider(myProvider)
client := openfeature.NewClient("order-service")

// Evaluate:
enabled, err := client.BooleanValue(ctx, "new-checkout", false,
    openfeature.NewEvaluationContext("usr_123", map[string]interface{}{
        "tenantId": "tenant_abc",
    }),
)

Swapping from LaunchDarkly to a self-hosted Flagsmith is a provider swap, not an instrumentation rewrite. This is the same value proposition as OTel’s exporter abstraction.

Feature flags done well are invisible infrastructure — you toggle them, things change, you move on. Feature flags done poorly are a maze of conditions that nobody understands and everyone is afraid to delete. The investment in naming conventions, expiry tracking, and regular cleanup is smaller than it looks and pays back continuously.

Why Feature Flags Exist#

The Flag Types#

The Anatomy of a Flag Evaluation#

Consistent User Assignment#

Flag Evaluation Performance#

The Technical Debt Trap#

Kill-Switch Implementation#

What OpenFeature Standardises#