Why distributed tracing

When a request is slow and spans five services, logs per-service do not show you the critical path. A trace ties all spans together with a shared trace ID, showing the sequence and duration of every operation across the entire call graph.

Trace context propagation

The W3C TraceContext standard defines two headers: traceparent (trace ID + span ID + flags) and tracestate. Every service must extract these headers from incoming requests and inject them into all outgoing calls.

// OpenTelemetry auto-instrumentation handles this for HTTP and gRPC
// Manual propagation for custom transports:
const ctx = propagation.extract(context.active(), carrier);
const span = tracer.startSpan("process-order", undefined, ctx);

Sampling

  • Head-based — decide at the root span. Simple. Can miss rare errors.
  • Tail-based — decide after the full trace is assembled. Keeps slow or error traces. More complex and costly. Jaeger and Grafana Tempo support tail-based sampling.

What to annotate spans with

Add attributes that help debugging: user ID, order ID, HTTP status code, DB query text (sanitised), external API called, cache hit/miss. The more context per span, the faster the debugging.