The large US technology company runs at a scale where tracing every request is financially and computationally impractical. You have to sample. How you sample determines whether your traces are useful.

Most teams implement head-based sampling — decide whether to trace a request when it starts. This is the easy implementation and produces useless traces for most debugging purposes.

The Head Sampling Problem

Head-based sampling (also called head-of-line sampling) makes the keep/drop decision at the start of a request, before any work has been done.

A typical configuration: sample 1% of requests.

On a service handling 10,000 requests/second:

  • 100 traces/second are kept
  • 9,900 are dropped

The kept traces are drawn uniformly from the request population. An outage where 2% of requests fail at high latency will show up in approximately 2% of your sampled traces. At 1% sampling, you’d expect to see roughly 2 of those failure traces per second. Enough to detect the problem, maybe not enough to debug it.

But the real problem: the specific requests you most want to examine — the slowest requests, the requests that errored, the requests that hit unusual code paths — are statistically under-represented because the sampling decision was made before you knew they were interesting.

Tail-Based Sampling

Tail-based sampling defers the keep/drop decision until after the trace is complete. This lets you sample based on trace properties:

  • Always keep traces that contain errors (status code 5xx, exceptions, error spans)
  • Always keep traces above a latency threshold (p99 + 2σ or a fixed threshold like “anything over 2 seconds”)
  • Sample the rest probabilistically (1-5% of “boring” traces for baseline coverage)

The result: your trace store contains disproportionately many interesting traces (errors, outliers) and a representative sample of normal traces. This is the distribution you want for debugging.

Implementation: The Tail Sampling Processor

In OpenTelemetry, tail sampling is implemented via the tail_sampling processor in the OTel Collector:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# otelcol-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s          # wait up to 10s for all spans to arrive
    num_traces: 100000          # in-memory trace buffer size
    expected_new_traces_per_sec: 10000
    policies:
      # Policy 1: always keep errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Policy 2: always keep slow traces
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}

      # Policy 3: sample a fraction of remaining traces
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 2}

The decision_wait is important: tail sampling can only make a decision when the trace is complete (all spans have arrived). For requests that span many services, the collector must buffer incomplete traces until all spans arrive or the wait time expires.

The Collector Deployment Model

The tail sampling processor must run in a stateful collector deployment — all spans for a given trace must route to the same collector instance for the sampling decision to be made correctly.

This means:

  1. Services export spans to a load balancer
  2. The load balancer routes spans to a collector using a consistent hash on the trace ID
  3. All spans with the same trace ID land on the same collector instance
  4. That collector makes the sampling decision for the whole trace
Services ──→ OTLP Load Balancer (hash by trace ID)
                    ├──→ Collector 1 (handles trace IDs 0x0000-0x3FFF)
                    ├──→ Collector 2 (handles trace IDs 0x4000-0x7FFF)
                    ├──→ Collector 3 (handles trace IDs 0x8000-0xBFFF)
                    └──→ Collector 4 (handles trace IDs 0xC000-0xFFFF)

The OTel Collector has a loadbalancing exporter that handles this routing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
exporters:
  loadbalancing:
    routing_key: "traceID"
    protocol:
      otlp:
        timeout: 1s
    resolver:
      static:
        hostnames:
          - collector-1.internal:4317
          - collector-2.internal:4317
          - collector-3.internal:4317
          - collector-4.internal:4317

Sizing the Collector Tier

The collector buffer (num_traces) must be large enough to hold all in-flight traces during the decision_wait period.

Estimate: traces per second × decision_wait seconds × (1 + safety margin)

At 10,000 TPS with a 10s wait:

  • In-flight traces: 10,000 × 10 = 100,000 minimum
  • With 2× safety margin: 200,000

Each trace entry includes all its spans. At an average of 20 spans per trace and ~1KB per span, that’s 4GB of memory for 200,000 traces. Plan collector memory accordingly.

Memory pressure causes the collector to drop traces before a decision is made. Monitor otelcol_processor_tail_sampling_global_count_traces_sampled and the buffer saturation metrics.

What Gets Lost and What Doesn’t

Tail sampling with decision_wait=10s means traces that span longer than 10 seconds may be cut off — spans from the slow operations arrive after the decision is already made for the early spans. For request-response services, this isn’t usually a problem (requests time out before 10 seconds). For async processing pipelines where a single “trace” spans minutes, tail sampling at the request level may not be meaningful.

For long-running traces, consider sampling at the service level (always sample the root span that initiates the trace) rather than at the trace level.

Head Sampling Is Not Always Wrong

Head sampling is the right choice when:

  • Latency budget is zero: the sampling decision must be made with zero overhead at the SDK level, before any context is propagated. Tail sampling adds milliseconds (the collector processing time) to the trace.
  • Regulatory requirements: some compliance regimes require sampling every Nth transaction for audit purposes. Head sampling with a deterministic sampling rate is easier to audit than tail sampling.
  • The trace volume is manageable: if your service handles 100 requests/second and you have a comfortable budget for trace storage, trace everything (100% sampling). Sampling optimisation is for services at scale.

The central insight: head sampling is a rate limiter. Tail sampling is a filter. If you’re debugging production issues, you want a filter that keeps the interesting traces, not a rate limiter that keeps a random sample. The implementation complexity of tail sampling (stateful collectors, load-balanced routing, memory sizing) is real but manageable. The debugging capability it enables is not available any other way.