Observability at Scale: What 'Good' Looks Like When You Have Too Much Data

At a startup with a dozen services, the observability problem is getting enough signal. You don’t have enough logging, your traces are incomplete, and your metrics dashboards have gaps. You know when something is wrong because a user tells you.

At scale, the problem inverts. You have petabytes of logs, hundreds of millions of traces per day, and metrics cardinality so high that naive approaches cause your time-series database to OOM. The engineering challenge is filtering signal from noise, not generating signal.

Both problems are real. They require different solutions.

The Cardinality Problem

Metrics systems (Prometheus, VictoriaMetrics, Datadog, whatever you’re using) store time series indexed by label combinations. Cardinality is the number of unique label combinations. Low-cardinality labels are fine: {service="payments", env="prod"} — two labels, handful of values, few series.

High-cardinality labels explode the series count: {user_id="...", request_id="..."} — millions of values per label, billions of series. Most metrics systems either reject high-cardinality metrics or OOM trying to index them.

The failure mode is insidious: developers instrument code naturally, using whatever context is available, including unique identifiers. The series count grows slowly, then suddenly. By the time the time-series DB is in trouble, there are hundreds of services contributing high-cardinality labels and no single obvious fix.

The principle: metrics are for aggregates, traces are for individuals. A user_id label belongs in a trace span, not a metric label. The metric label should be {error_type="auth_failed"} — a low-cardinality attribute of the event. The trace carries the specific user context.

Enforcing this requires code review education, metric naming conventions, and (ideally) automated cardinality alerting on metrics intake. We added a metric that tracks series count per service; when a new service pushes its series count above a threshold, it triggers a review.

Trace Sampling at Volume

Sampling traces is unavoidable at scale. Storing 100% of traces for a system processing millions of requests per second is not economically viable. But naive head-based sampling (randomly sample 1% of traces at ingress) discards exactly the traces you need — the rare high-latency requests and errors are statistically likely to be in the 99% you dropped.

Tail-based sampling — making the sampling decision after the trace is complete, based on the outcome — is the solution. Keep 100% of error traces. Keep 100% of high-latency traces (above p99 threshold). Sample successes at 1-5%. The result: comprehensive coverage of interesting cases at a fraction of the storage cost.

Implementing tail-based sampling requires a trace collector that buffers complete traces before making the sampling decision. The OpenTelemetry Collector’s tail sampling processor does this. The operational cost: the collector needs to buffer in-flight traces, which requires memory proportional to trace volume × average trace duration.

Structured Logging at Scale

Unstructured logs at scale are grep archaeology. A log line that says ERROR processing request for user abc123: timeout after 30s is useful for debugging if you’re looking for that specific user or that specific error. It’s useless for understanding how many users are experiencing this error and which services are causing it.

Structured logs — JSON with consistent field names — become queryable. The same event as structured log:

1
2
3
4
5
6
7
8
9
{
  "level": "error",
  "event": "request_timeout",
  "user_id": "abc123",
  "service": "payments",
  "upstream": "fraud-check",
  "duration_ms": 30000,
  "trace_id": "4bf92f3577b34da6"
}

Now you can answer: how many requests are timing out against fraud-check in the last hour, broken down by service? Which trace IDs had this error (to pull the full trace)? What’s the distribution of timeout durations?

The trace_id field is the critical link: it connects a log line to the distributed trace, letting you jump from “I see errors in the logs” to “here is the full request context” without correlation guesswork.

The Alerting Philosophy That Actually Works

Alert on symptoms, not causes. An alert that fires when CPU > 80% is a cause alert — high CPU is often a symptom of something interesting, but it’s also the normal state under load. An alert that fires when error rate > 1% is a symptom alert — users are experiencing errors, regardless of why.

The corollary: alert on what users experience, not what systems do. Error rate and latency (from the user’s perspective, not the internal service’s) are the two metrics worth alerting on. Everything else — CPU, memory, queue depth, GC pause time — is diagnostic, not alerting.

This principle sounds obvious. In practice, most alert sets I’ve seen are 80% cause alerts and 20% symptom alerts, because cause alerts are easier to define and calibrate. “CPU > 80% for 5 minutes” is easy to write. “User-visible error rate > 0.1%” requires end-to-end instrumentation, a clear definition of “error” from the user’s perspective, and enough baseline data to set a meaningful threshold.

The investment in symptom-based alerting pays off in reduced alert fatigue (fewer alerts that don’t require action) and faster incident response (alerts tell you where the user pain is, not where a system metric looks odd).

The Signal-to-Noise Ratio Is an Engineering Investment

At the startup, I treated observability as infrastructure — set it up, maintain it, use it when things break. At scale, observability quality is a first-class engineering investment that compounds. A team with excellent instrumentation resolves incidents faster, builds features with more confidence, and spends less time on reactive debugging.

The teams I’ve seen with the best observability practices treat it as a product concern, not an infrastructure concern: who are the users of this data (on-call engineers, capacity planners, product managers)? What questions do they need to answer? What’s the fastest path from “something is wrong” to “I understand why”?

That framing leads to better instrumentation design than “add metrics and logs to everything” — which is the alternative I’ve also seen, and which produces the cardinality explosion and noise problems I described above.

The Cardinality Problem#

Trace Sampling at Volume#

Structured Logging at Scale#

The Alerting Philosophy That Actually Works#

The Signal-to-Noise Ratio Is an Engineering Investment#

The Cardinality Problem

Trace Sampling at Volume

Structured Logging at Scale

The Alerting Philosophy That Actually Works

The Signal-to-Noise Ratio Is an Engineering Investment