OpenTelemetry in Go: Distributed Tracing That Doesn't Get in the Way

Before OpenTelemetry, distributed tracing was a vendor-specific integration. You chose Jaeger or Zipkin or Datadog, added their SDK, and traced against their API. Switching vendors meant rewriting instrumentation. Adding a library that used a different tracing SDK meant two tracing systems running in parallel.

OpenTelemetry solved this with a vendor-neutral API and a pluggable exporter model. The API stays the same; you swap exporters. Most major observability vendors now accept OTel format natively.

Here’s what a solid Go integration looks like.

Bootstrapping the Tracer Provider

The tracer provider is the factory for tracers. It gets configured once at startup and shut down gracefully on exit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
func initTracer(ctx context.Context, serviceName, version string) (func(context.Context) error, error) {
    // OTLP exporter to a collector (Jaeger, Grafana Tempo, Datadog Agent, etc.)
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, fmt.Errorf("creating OTLP exporter: %w", err)
    }

    resource, err := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
            semconv.ServiceVersionKey.String(version),
            attribute.String("deployment.environment", os.Getenv("ENVIRONMENT")),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("creating resource: %w", err)
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource),
        trace.WithSampler(trace.ParentBased(
            trace.TraceIDRatioBased(0.1),  // sample 10% of root spans
        )),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp.Shutdown, nil
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// In main():
shutdown, err := initTracer(ctx, "order-service", version.String())
if err != nil {
    log.Fatal(err)
}
defer func() {
    if err := shutdown(context.Background()); err != nil {
        log.Errorf("tracer shutdown: %v", err)
    }
}()

The defer shutdown is important — it flushes buffered spans before the process exits. Without it, the last batch of spans from a short-lived process may be lost.

Instrumenting HTTP Servers and Clients

The otelhttp package handles trace context propagation for HTTP automatically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Server: extract incoming trace context, create child spans:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

mux := http.NewServeMux()
mux.HandleFunc("/orders", h.handleOrders)

handler := otelhttp.NewHandler(mux, "order-service",
    otelhttp.WithFilter(func(r *http.Request) bool {
        return r.URL.Path != "/healthz"  // don't trace health checks
    }),
)
http.ListenAndServe(":8080", handler)

// Client: inject trace context into outgoing requests:
client := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}
resp, err := client.Do(req)

The middleware reads traceparent and tracestate headers from incoming requests and creates a child span. The transport writes those headers into outgoing requests. Without this, distributed traces break at service boundaries — each service creates a new root span.

Manual Instrumentation

For business logic spans, use the tracer directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
var tracer = otel.Tracer("github.com/company/order-service")

func (s *Service) ProcessOrder(ctx context.Context, req OrderRequest) (*Order, error) {
    ctx, span := tracer.Start(ctx, "ProcessOrder",
        trace.WithAttributes(
            attribute.String("order.symbol", req.Symbol),
            attribute.Int64("order.quantity", int64(req.Quantity)),
        ),
    )
    defer span.End()

    // Enrich the span as work progresses:
    span.SetAttributes(attribute.String("order.id", generatedID))

    order, err := s.executeOrder(ctx, req)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, fmt.Errorf("ProcessOrder: %w", err)
    }

    span.SetAttributes(attribute.String("order.status", string(order.Status)))
    return order, nil
}

The tracer name ("github.com/company/order-service") appears in the trace UI and helps disambiguate spans from different packages. Use your module path.

span.RecordError(err) adds the error to the span as an event with a stack trace. span.SetStatus(codes.Error, ...) marks the span as failed. Both are important: RecordError records the detail, SetStatus makes it filterable in the trace UI.

Database Instrumentation

For database calls, use the appropriate contrib library or wrap the driver:

1
2
3
4
5
6
// PostgreSQL with pgx:
import "github.com/exaring/otelpgx"

config, _ := pgxpool.ParseConfig(dsn)
config.ConnConfig.Tracer = otelpgx.NewTracer()
pool, _ := pgxpool.NewWithConfig(ctx, config)

This adds a child span for every database query, including the SQL text (parameterised, not with values). The trace shows you which queries are slow and how much time in a request is spent waiting on the database.

For Redis:

1
2
3
4
import "github.com/redis/go-redis/extra/redisotel/v9"

rdb := redis.NewClient(&redis.Options{Addr: "redis:6379"})
redisotel.InstrumentTracing(rdb)

Sampling Strategy

Tracing everything at 100% is expensive in production at high request rates. The sampling options:

Strategy                  Use when
────────────────────────────────────────────────────────────────
AlwaysSample              Dev/test environments
NeverSample               Testing overhead, should be rare
TraceIDRatioBased(0.1)    10% of traces — good default for prod
ParentBased(...)          Respect upstream sampling decision

ParentBased is critical for correct distributed tracing: if an upstream service decides to sample a request, all downstream services should also sample it. If each service independently decides, you get incomplete traces — the root span exists but some child spans are missing.

1
2
3
4
5
// Correct production sampler:
trace.ParentBased(
    trace.TraceIDRatioBased(0.1),  // 10% root spans
    // With parent: always match parent's decision
)

For errors and slow requests, add head-based sampling override:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// In your middleware — force-sample slow requests:
func samplingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rw := &responseWriter{ResponseWriter: w}
        next.ServeHTTP(rw, r)

        span := trace.SpanFromContext(r.Context())
        if time.Since(start) > 500*time.Millisecond || rw.statusCode >= 500 {
            span.SetAttributes(attribute.Bool("force.sampled", true))
            // Note: you can't retroactively change sampling in OTel —
            // this is a signal for tail-based sampling in the collector
        }
    })
}

True tail-based sampling (decide after seeing the complete trace) requires a collector that supports it — Grafana Tempo and the OpenTelemetry Collector’s tail sampling processor both support this.

What the Traces Actually Tell You

After instrumenting HTTP handlers, clients, and database calls, a trace for a slow order request looks like:

ProcessOrder [450ms]
├── validate [1ms]
├── enrichOrder [380ms]
│   ├── SELECT counterparty FROM ... [2ms]
│   ├── GET cache:instrument:AAPL [45ms]  ← cache miss, slow
│   └── POST http://reference-data/v1/instruments/AAPL [330ms]  ← slow downstream
└── executeOrder [60ms]
    ├── BEGIN [1ms]
    ├── INSERT INTO orders ... [55ms]  ← slow write
    └── COMMIT [4ms]

This trace immediately tells you: the 450ms is dominated by a slow reference data service call (330ms) and a slow database write (55ms). The cache miss for instrument:AAPL that preceded the reference data call is visible. Without distributed tracing, finding this in logs would take much longer.

The investment: ~2 days to instrument a Go service completely. The payoff: production incidents that used to take hours to diagnose take minutes.

Bootstrapping the Tracer Provider#

Instrumenting HTTP Servers and Clients#

Manual Instrumentation#

Database Instrumentation#

Sampling Strategy#

What the Traces Actually Tell You#