Why distributed tracing
When a request is slow and spans five services, logs per-service do not show you the critical path. A trace ties all spans together with a shared trace ID, showing the sequence and duration of every operation across the entire call graph.
Trace context propagation
The W3C TraceContext standard defines two headers: traceparent (trace ID + span ID + flags) and tracestate. Every service must extract these headers from incoming requests and inject them into all outgoing calls.
// OpenTelemetry auto-instrumentation handles this for HTTP and gRPC
// Manual propagation for custom transports:
const ctx = propagation.extract(context.active(), carrier);
const span = tracer.startSpan("process-order", undefined, ctx);
Sampling
- Head-based — decide at the root span. Simple. Can miss rare errors.
- Tail-based — decide after the full trace is assembled. Keeps slow or error traces. More complex and costly. Jaeger and Grafana Tempo support tail-based sampling.
What to annotate spans with
Add attributes that help debugging: user ID, order ID, HTTP status code, DB query text (sanitised), external API called, cache hit/miss. The more context per span, the faster the debugging.