Before OpenTelemetry, distributed tracing was a vendor-specific integration. You chose Jaeger or Zipkin or Datadog, added their SDK, and traced against their API. Switching vendors meant rewriting instrumentation. Adding a library that used a different tracing SDK meant two tracing systems running in parallel.
OpenTelemetry solved this with a vendor-neutral API and a pluggable exporter model. The API stays the same; you swap exporters. Most major observability vendors now accept OTel format natively.
Here’s what a solid Go integration looks like.
Bootstrapping the Tracer Provider
The tracer provider is the factory for tracers. It gets configured once at startup and shut down gracefully on exit:
| |
| |
The defer shutdown is important — it flushes buffered spans before the process exits. Without it, the last batch of spans from a short-lived process may be lost.
Instrumenting HTTP Servers and Clients
The otelhttp package handles trace context propagation for HTTP automatically:
| |
The middleware reads traceparent and tracestate headers from incoming requests and creates a child span. The transport writes those headers into outgoing requests. Without this, distributed traces break at service boundaries — each service creates a new root span.
Manual Instrumentation
For business logic spans, use the tracer directly:
| |
The tracer name ("github.com/company/order-service") appears in the trace UI and helps disambiguate spans from different packages. Use your module path.
span.RecordError(err) adds the error to the span as an event with a stack trace. span.SetStatus(codes.Error, ...) marks the span as failed. Both are important: RecordError records the detail, SetStatus makes it filterable in the trace UI.
Database Instrumentation
For database calls, use the appropriate contrib library or wrap the driver:
| |
This adds a child span for every database query, including the SQL text (parameterised, not with values). The trace shows you which queries are slow and how much time in a request is spent waiting on the database.
For Redis:
| |
Sampling Strategy
Tracing everything at 100% is expensive in production at high request rates. The sampling options:
Strategy Use when
────────────────────────────────────────────────────────────────
AlwaysSample Dev/test environments
NeverSample Testing overhead, should be rare
TraceIDRatioBased(0.1) 10% of traces — good default for prod
ParentBased(...) Respect upstream sampling decision
ParentBased is critical for correct distributed tracing: if an upstream service decides to sample a request, all downstream services should also sample it. If each service independently decides, you get incomplete traces — the root span exists but some child spans are missing.
| |
For errors and slow requests, add head-based sampling override:
| |
True tail-based sampling (decide after seeing the complete trace) requires a collector that supports it — Grafana Tempo and the OpenTelemetry Collector’s tail sampling processor both support this.
What the Traces Actually Tell You
After instrumenting HTTP handlers, clients, and database calls, a trace for a slow order request looks like:
ProcessOrder [450ms]
├── validate [1ms]
├── enrichOrder [380ms]
│ ├── SELECT counterparty FROM ... [2ms]
│ ├── GET cache:instrument:AAPL [45ms] ← cache miss, slow
│ └── POST http://reference-data/v1/instruments/AAPL [330ms] ← slow downstream
└── executeOrder [60ms]
├── BEGIN [1ms]
├── INSERT INTO orders ... [55ms] ← slow write
└── COMMIT [4ms]
This trace immediately tells you: the 450ms is dominated by a slow reference data service call (330ms) and a slow database write (55ms). The cache miss for instrument:AAPL that preceded the reference data call is visible. Without distributed tracing, finding this in logs would take much longer.
The investment: ~2 days to instrument a Go service completely. The payoff: production incidents that used to take hours to diagnose take minutes.