Hero image for From Chaos to Clarity: Implementing OpenTelemetry Tracing Across Your Microservices Stack

From Chaos to Clarity: Implementing OpenTelemetry Tracing Across Your Microservices Stack


Your on-call engineer is staring at a 500 error that only happens when three specific services interact during peak load. Logs show nothing useful—each service claims it did its job correctly. Without distributed tracing, you’re debugging with a blindfold, piecing together timestamps from five different log streams at 2 AM. The user-facing error says “something went wrong,” and your internal monitoring agrees: something did go wrong, somewhere, at some point.

This scenario plays out in organizations every week. The shift from monoliths to microservices traded one set of problems for another. You gained deployment flexibility and team autonomy. You lost the ability to follow a single request through your system with a debugger. OpenTelemetry exists to give that visibility back—without locking you into a specific vendor or observability platform.

The promise of microservices architecture was compelling: independent deployments, technology flexibility, and teams that could move fast without stepping on each other. The reality includes a new category of failure modes that traditional debugging tools simply cannot address. When a request touches eight services and fails somewhere in the middle, you need instrumentation that understands distributed systems at a fundamental level.


Why Traditional Logging Fails in Distributed Systems

A monolithic application writes logs to a single file. When something breaks, you grep for the error, find the stack trace, and trace execution backward. The mental model is linear: request comes in, code executes top to bottom, response goes out. You can attach a debugger, set breakpoints, and step through code paths. The entire application state exists in one process, one memory space, one logical unit.

Microservices shatter this linearity. A single user request might touch an API gateway, authentication service, product catalog, inventory checker, pricing engine, and payment processor. Each service logs independently. Each has its own notion of what happened. None of them know about the others. The simple act of checking out a shopping cart becomes a distributed transaction spanning multiple databases, message queues, and external APIs.

The correlation problem is fundamental and deeply challenging. Service A logs “received request at 14:32:07.123” and Service B logs “processed order at 14:32:07.456.” Are these the same request? Maybe. You’re left matching timestamps with millisecond precision across machines with clock drift, hoping the log messages contain enough context to reconstruct the journey. In practice, even with NTP synchronization, clocks can drift by tens of milliseconds—enough to make ordering ambiguous during high-throughput scenarios.

Consider what happens when you have ten services each logging hundreds of messages per second. The combinatorial explosion of possible request paths makes manual correlation impossible. Even with unique identifiers in your logs, you’re essentially performing a distributed join across log streams without the guarantees that a database would provide.

Log aggregation tools like the ELK stack or Loki help collect logs in one place. They’re necessary but insufficient. You can search across all services, but you’re still looking at individual trees when you need to see the forest. The query “show me all logs from the last hour where level=ERROR” returns hundreds of results across dozens of services. Which errors are related? Which caused which? Did the payment service fail because the inventory service returned stale data, or did both fail independently due to a database issue?

The three pillars of observability—logs, metrics, and traces—each answer different questions:

  • Logs tell you what happened at a specific point in time within a single service. They capture detailed information: error messages, stack traces, variable values, debug output. But they exist in isolation, each log line unaware of its relationship to logs in other services.
  • Metrics tell you aggregate behavior: request rates, error percentages, latency distributions. They answer questions like “how many requests per second is this service handling?” and “what’s our 99th percentile latency?” But they don’t tell you about individual requests.
  • Traces tell you how a single request flowed through your entire system. They provide the causal chain, the before-and-after relationship between operations across service boundaries.

Traces are the missing link. They provide the causal chain that connects a user clicking “checkout” to a database timeout three services deep. They show you not just that Service C failed, but that Service A called Service B which called Service C, and exactly where the 2-second latency came from. A trace transforms “something is slow” into “the database query in the inventory service took 1.8 seconds because it’s missing an index on the product_id column.”

Without traces, debugging distributed systems is archaeology—you’re sifting through fragments trying to reconstruct events that happened in the past. With them, it’s forensics—you have a complete chain of evidence showing exactly what happened and when.


OpenTelemetry Architecture: Collectors, Exporters, and the SDK

OpenTelemetry is a CNCF project that provides a vendor-neutral standard for collecting telemetry data. Before OpenTelemetry, you’d instrument your code with Jaeger’s client library or Zipkin’s or Datadog’s, locking yourself into that vendor’s ecosystem. Want to switch from Jaeger to Datadog? Rewrite your instrumentation. OpenTelemetry abstracts the instrumentation from the backend—instrument once, export anywhere. This decoupling represents a fundamental shift in how organizations approach observability infrastructure.

The architecture has three main components: the API, the SDK, and the Collector. Understanding the separation between these components is essential for both using OpenTelemetry effectively and troubleshooting issues when they arise.

The API defines the interfaces for creating spans, recording metrics, and emitting logs. Library authors use the API to instrument their code without taking a dependency on any specific implementation. If you’re writing an HTTP client library, you’d use the OpenTelemetry API to create spans for outgoing requests. Users of your library can then configure whether and how that telemetry gets exported. This separation is intentional: the API is stable and lightweight, while the SDK can evolve independently.

The SDK implements the API and handles the actual work: creating span objects, sampling decisions, batching, and exporting. Application developers configure the SDK at startup to determine where telemetry goes. The SDK makes the real decisions—should this trace be sampled? How should spans be batched for export? What happens when the export fails? The SDK is where configuration lives: sampling rates, export endpoints, resource attributes, and processing pipelines.

The distinction between API and SDK might seem academic, but it solves a real problem. Imagine a scenario where your application uses three libraries, each instrumented with OpenTelemetry. Without the API/SDK separation, each library would bring its own tracing implementation, potentially conflicting with each other. With the separation, libraries use only the API (which has no behavior without an SDK), and your application configures a single SDK that handles all telemetry from all libraries uniformly.

The Collector is a standalone service that receives telemetry data, processes it, and exports it to backends. It’s the central nervous system of your observability pipeline, and its importance cannot be overstated.

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your Service │────▶│ Collector │────▶│ Jaeger/Tempo/ │
│ (SDK + API) │OTLP │ (recv/proc/exp)│OTLP │ Datadog │
└─────────────────┘ └─────────────────┘ └─────────────────┘

The Collector runs in three deployment patterns, each with distinct tradeoffs:

Agent mode: Collector runs as a sidecar or DaemonSet alongside your application. Low latency, but consumes resources on every node. This pattern works well when you need to reduce network hops for latency-sensitive telemetry or when you want to apply node-specific processing.

Gateway mode: Collector runs as a centralized service. Applications send telemetry over the network. Simpler to manage, easier to scale, but adds network hops. Configuration changes happen in one place rather than across every node.

No Collector: Applications export directly to backends. Simpler architecture, but you lose the processing pipeline and can’t switch backends without code changes. This pattern works for small deployments or when you’re locked into a single vendor.

For Kubernetes deployments, gateway mode typically wins. You get centralized configuration, can apply sampling and filtering in one place, and reduce resource consumption compared to sidecars on every pod. When you need to change exporters or adjust sampling policies, you update one Collector configuration rather than redeploying every service.

Sampling is where cost meets completeness, and the decision has significant implications for both your budget and your debugging capabilities. Head-based sampling decides at the start of a trace whether to record it—flip a coin, keep 10% of traces. Simple but blind: you’ll sample boring successful requests at the same rate as interesting failures. The advantage is simplicity and low resource usage—you don’t need to buffer data.

Tail-based sampling waits until a trace completes, then decides based on its contents. You can keep 100% of error traces, 100% of slow traces, and 1% of everything else. The tradeoff: you need to buffer complete traces somewhere, which requires the Collector to hold state and coordinate across instances if you’re running multiple replicas. Tail-based sampling is more complex operationally but dramatically more effective at capturing the traces you actually care about.


Instrumenting a Python Service with Auto and Manual Tracing

OpenTelemetry’s Python SDK provides automatic instrumentation for popular frameworks. With a few lines of configuration, you get spans for every incoming HTTP request, outgoing HTTP call, and database query. Manual instrumentation lets you add business-specific context that auto-instrumentation cannot capture. The combination of both gives you comprehensive visibility without excessive effort.

Start with the dependencies. The OpenTelemetry ecosystem is modular—you install only what you need:

Terminal window
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-httpx \
opentelemetry-instrumentation-sqlalchemy

Configure the SDK at application startup. This code sets up OTLP export to a Collector running locally. The configuration happens once, typically in your application’s entrypoint:

tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
def configure_tracing(service_name: str, version: str = "1.0.0") -> None:
"""Configure OpenTelemetry tracing with OTLP export.
This function should be called once at application startup,
before any other code that might create spans.
"""
# Resources identify your service in the telemetry data
resource = Resource.create({
SERVICE_NAME: service_name,
SERVICE_VERSION: version,
})
provider = TracerProvider(resource=resource)
# Export to collector via OTLP/gRPC
# The collector endpoint should be configurable via environment variable in production
otlp_exporter = OTLPSpanExporter(
endpoint="localhost:4317", # Collector's gRPC endpoint
insecure=True, # Use TLS in production
)
# BatchSpanProcessor batches spans before export for efficiency
# This reduces network overhead and handles backpressure gracefully
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

The BatchSpanProcessor is important for production use. Rather than exporting each span immediately, it batches spans and exports them periodically. This reduces network overhead significantly and handles temporary network issues gracefully by buffering spans in memory.

Auto-instrumentation wraps your framework to create spans automatically. The key insight here is that you’re getting visibility into HTTP handling, database queries, and external calls without modifying your business logic:

main.py
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from tracing import configure_tracing
# Configure before creating the app - order matters here
configure_tracing("order-service")
app = FastAPI()
# Instrument FastAPI - creates spans for all incoming requests
# Each request gets a span with method, route, status code, and timing
FastAPIInstrumentor.instrument_app(app)
# Instrument httpx - creates spans for all outgoing HTTP calls
# These spans are automatically linked as children of the current request span
HTTPXClientInstrumentor().instrument()

With this setup, every request to your FastAPI service creates a span with HTTP method, route, status code, and timing. Every outgoing httpx call creates a child span linked to the parent. You can see the complete request flow without writing any additional code.

Auto-instrumentation covers the plumbing. Manual instrumentation adds business context—the domain-specific information that makes traces actually useful for debugging business logic issues. For a payment processing endpoint, you want spans that reflect your domain:

routes/payments.py
from fastapi import APIRouter, HTTPException
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import httpx
router = APIRouter()
tracer = trace.get_tracer(__name__)
@router.post("/payments")
async def process_payment(payment: PaymentRequest):
# Create a span for the business operation
# This span captures the entire payment processing workflow
with tracer.start_as_current_span("process_payment") as span:
# Add attributes for debugging and analysis
# These attributes let you filter and search traces by business criteria
span.set_attribute("payment.amount", payment.amount)
span.set_attribute("payment.currency", payment.currency)
span.set_attribute("payment.method", payment.method)
span.set_attribute("customer.id", payment.customer_id)
# Validate payment - separate span for this step
# Breaking down the operation into sub-spans shows where time is spent
with tracer.start_as_current_span("validate_payment"):
if payment.amount <= 0:
span.set_status(Status(StatusCode.ERROR, "Invalid amount"))
raise HTTPException(400, "Invalid payment amount")
# Call external payment processor
with tracer.start_as_current_span("call_payment_gateway") as gateway_span:
gateway_span.set_attribute("gateway.provider", "stripe")
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.stripe.com/v1/charges",
json=payment.dict(),
)
gateway_span.set_attribute("gateway.response_code", response.status_code)
if response.status_code != 200:
# Record the error with context
# The status and event together provide full debugging information
gateway_span.set_status(
Status(StatusCode.ERROR, f"Gateway returned {response.status_code}")
)
span.add_event("payment_failed", {
"error.type": "gateway_error",
"response.body": response.text[:500], # Truncate for safety
})
raise HTTPException(502, "Payment gateway error")
span.add_event("payment_completed", {
"transaction.id": response.json()["transaction_id"],
})
return {"status": "success"}

💡 Pro Tip: Use span.add_event() for point-in-time occurrences within a span, like “payment authorized” or “retry attempted.” Use span.set_attribute() for data that describes the entire operation. Events have timestamps; attributes don’t.

The difference between events and attributes matters for analysis. Attributes are indexed and searchable—you can query for all traces where payment.currency = "EUR". Events are less structured but can capture the sequence of things that happened within a span, which is invaluable for debugging retry logic or multi-step operations.

The resulting trace shows the complete payment flow: the incoming HTTP request, your business validation, the external gateway call, and any errors along the way. When that 3 AM page fires, you’ll see exactly where the payment failed and why—not “something went wrong with payments” but “the Stripe API returned a 429 because we exceeded rate limits during the flash sale.”


Context Propagation: The Make-or-Break of Distributed Tracing

A trace is only useful if it’s complete. A broken trace—where Service A’s span and Service B’s span exist but aren’t linked—is worse than no trace at all. It gives you false confidence that you have observability while hiding the actual request flow. You’ll look at your trace backend, see what appears to be complete coverage, and not realize that critical relationships between services are invisible.

Context propagation is the mechanism that links spans across service boundaries. When Service A calls Service B, it must pass the trace context: the trace ID, parent span ID, and any sampling decisions. Service B extracts this context and uses it to create child spans that are properly linked to their parent. Without this, you have isolated islands of spans that can’t be connected into a coherent story.

The W3C Trace Context specification defines the standard headers, providing interoperability across different tracing implementations:

  • traceparent: Contains version, trace ID, parent span ID, and trace flags
  • tracestate: Vendor-specific data that rides along with the trace for extended functionality

A traceparent header looks like: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

This header encodes: version (00), a 32-character trace ID, a 16-character parent span ID, and trace flags (01 means sampled). The trace ID stays constant across the entire distributed trace, while the parent span ID changes at each hop to maintain the parent-child relationship.

OpenTelemetry handles propagation automatically for HTTP calls when you use instrumented clients. The httpx instrumentation injects headers on outgoing requests. The FastAPI instrumentation extracts headers from incoming requests. This works seamlessly when all your services use OpenTelemetry-instrumented HTTP clients.

Problems arise in three scenarios, and knowing these patterns helps you diagnose broken traces:

Scenario 1: Non-instrumented code in the path. A proxy, load balancer, or legacy service strips unknown headers. Your traces break at that boundary. This is surprisingly common—many nginx configurations, API gateways, and legacy middleware filter or drop headers they don’t recognize.

Solution: Ensure all network hops preserve traceparent and tracestate headers. For nginx, explicitly forward these headers:

proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;

For other proxies and gateways, consult their documentation on header preservation. Many cloud load balancers handle standard headers correctly, but custom deployments often need explicit configuration.

Scenario 2: Async messaging systems. Kafka, RabbitMQ, and SQS don’t use HTTP headers. Context must be embedded in the message itself. This is where many organizations’ traces break—they have excellent HTTP tracing but lose visibility when requests transition to async processing.

For Kafka, inject context into message headers. The pattern is the same across messaging systems: serialize the context, attach it to the message, then extract and restore it on the consumer side:

kafka_producer.py
from opentelemetry import trace
from opentelemetry.propagate import inject
from confluent_kafka import Producer
tracer = trace.get_tracer(__name__)
def publish_order_event(order: dict) -> None:
with tracer.start_as_current_span("publish_order_event") as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "orders")
# Inject trace context into Kafka headers
# The inject function writes traceparent/tracestate into the provided dict
headers = {}
inject(headers) # Populates headers with traceparent/tracestate
# Convert to Kafka header format: list of tuples with byte values
kafka_headers = [(k, v.encode()) for k, v in headers.items()]
producer = Producer({"bootstrap.servers": "localhost:9092"})
producer.produce(
topic="orders",
value=json.dumps(order).encode(),
headers=kafka_headers,
)
producer.flush()

On the consumer side, extract the context before processing. The consumer’s span becomes a child of the producer’s span, maintaining the trace across the async boundary:

kafka_consumer.py
from opentelemetry import trace
from opentelemetry.propagate import extract
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer(__name__)
def process_message(message) -> None:
# Extract trace context from Kafka headers
# Decode bytes back to strings for the extract function
headers = {k: v.decode() for k, v in message.headers() or []}
ctx = extract(headers)
# Create a span linked to the producer's span
# SpanKind.CONSUMER indicates this is the receiving side of a messaging operation
with tracer.start_as_current_span(
"process_order_event",
context=ctx,
kind=SpanKind.CONSUMER,
) as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.operation", "receive")
order = json.loads(message.value())
handle_order(order)

Scenario 3: Thread or async context loss. In Python, if you spawn threads or use certain async patterns incorrectly, the context doesn’t propagate. The context is stored in a context variable that doesn’t automatically flow across thread boundaries.

context_aware_threads.py
from opentelemetry import trace, context
from concurrent.futures import ThreadPoolExecutor
tracer = trace.get_tracer(__name__)
def process_items_parallel(items: list) -> list:
with tracer.start_as_current_span("parallel_processing"):
# Capture current context before spawning threads
ctx = context.get_current()
def process_with_context(item):
# Attach context in the worker thread
# This restores the trace context in the new thread
token = context.attach(ctx)
try:
with tracer.start_as_current_span(f"process_item"):
return do_work(item)
finally:
# Always detach to avoid context leaks
context.detach(token)
with ThreadPoolExecutor(max_workers=4) as executor:
return list(executor.map(process_with_context, items))

⚠️ Warning: Always test context propagation across every service boundary. A single misconfigured proxy can silently break your entire tracing pipeline. Build integration tests that verify trace IDs flow correctly through your system.


Deploying the OpenTelemetry Collector on Kubernetes

The Collector is a stateless service that receives, processes, and exports telemetry. In Kubernetes, deploy it as a Deployment with a Service for discovery. For high-throughput environments, consider multiple replicas behind a load balancer. The Collector is designed to scale horizontally—you can run as many replicas as needed to handle your telemetry volume.

The Collector configuration defines a pipeline with three stages: receivers (input), processors (transformation), and exporters (output). This pipeline model is powerful because it lets you transform, filter, and route telemetry data without modifying your application code.

collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Batch processor improves export efficiency
batch:
timeout: 5s
send_batch_size: 1000
# Memory limiter prevents OOM by applying backpressure
memory_limiter:
check_interval: 1s
limit_mib: 1800
spike_limit_mib: 500
# Add environment metadata to all telemetry
resource:
attributes:
- key: environment
value: production
action: upsert
exporters:
# Export to Tempo for trace storage
otlp/tempo:
endpoint: tempo.observability.svc:4317
tls:
insecure: true
# Also log for debugging - disable in production
logging:
loglevel: info
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo, logging]

The processor order matters. The memory_limiter should come first to apply backpressure before buffering occurs in other processors. The batch processor should come before exporters to reduce network overhead. The resource processor adds metadata that helps with filtering and aggregation in your trace backend.

Deploy the Collector itself as a Kubernetes Deployment. The configuration uses resource requests and limits appropriate for moderate throughput—adjust based on your actual telemetry volume:

collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.96.0
args:
- --config=/conf/config.yaml
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # Metrics endpoint for monitoring the collector
volumeMounts:
- name: config
mountPath: /conf
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 2Gi
cpu: 1000m
livenessProbe:
httpGet:
path: /
port: 13133
readinessProbe:
httpGet:
path: /
port: 13133
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318

Applications send telemetry to otel-collector.observability.svc:4317. The Collector batches spans, applies the memory limiter to prevent OOM, adds environment metadata, and forwards to Tempo. This centralized approach means you can change export destinations, add sampling, or modify processing without redeploying applications.

For production deployments, monitor the Collector itself. The Collector exposes Prometheus metrics on port 8888 that show queue depths, dropped spans, and export latencies. A queue that’s consistently full indicates you need more Collector replicas or need to increase sampling.

For the OpenTelemetry Operator approach, install via Helm. The Operator provides Kubernetes-native management of Collectors and can automatically inject instrumentation into pods:

Terminal window
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
--namespace observability \
--create-namespace

The Operator lets you define Collectors as custom resources and automatically injects instrumentation into pods using annotations. That’s powerful but adds complexity—start with the manual deployment, graduate to the Operator when you need auto-instrumentation injection or want Kubernetes-native management of multiple Collector configurations.

📝 Note: The contrib image includes more receivers and exporters than the core image. Use it unless you’re optimizing for minimal image size. The contrib image supports dozens of receivers (Kafka, AWS X-Ray, Zipkin) and exporters (Datadog, New Relic, Splunk) out of the box.


Correlating Traces with Logs and Metrics

Traces show request flow. Logs show detailed events. Metrics show aggregate behavior. The power comes from linking them. Each pillar answers different questions, and the real value emerges when you can navigate seamlessly between them. A spike in error rate (metrics) leads you to an error trace, which leads you to the specific log lines that explain the root cause.

When your p99 latency spikes, metrics tell you something is wrong. Traces tell you which requests are slow. Logs tell you exactly what happened inside those slow requests. Without correlation, you’re switching between three tools, manually matching timestamps and hoping you’re looking at the same request.

Inject trace IDs into structured logs so you can query logs by trace. This is one of the highest-value instrumentation patterns—it costs almost nothing and dramatically improves debugging workflows:

logging_config.py
import logging
import json
from opentelemetry import trace
class TraceInjectingFormatter(logging.Formatter):
"""Formatter that adds trace context to every log record.
This enables clicking from a trace to see all related logs,
or searching logs by trace_id to see the complete picture.
"""
def format(self, record: logging.LogRecord) -> str:
span = trace.get_current_span()
ctx = span.get_span_context()
log_data = {
"timestamp": self.formatTime(record),
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name,
"trace_id": format(ctx.trace_id, "032x") if ctx.is_valid else None,
"span_id": format(ctx.span_id, "016x") if ctx.is_valid else None,
}
if record.exc_info:
log_data["exception"] = self.formatException(record.exc_info)
return json.dumps(log_data)
def configure_logging():
handler = logging.StreamHandler()
handler.setFormatter(TraceInjectingFormatter())
root = logging.getLogger()
root.setLevel(logging.INFO)
root.addHandler(handler)

Now every log line includes trace_id. In Grafana, you can click a trace and see all logs from that request across all services. Or start from a log error and jump to the trace that produced it. This bidirectional navigation transforms debugging from a scavenger hunt into a guided investigation.

For metrics, add trace exemplars. An exemplar is a link from a metric data point to a specific trace that contributed to it. When you see latency spike in a histogram, you can jump directly to an example slow trace rather than searching for one:

metrics_with_exemplars.py
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
# Configure metrics export alongside trace export
exporter = OTLPMetricExporter(endpoint="localhost:4317", insecure=True)
reader = PeriodicExportingMetricReader(exporter)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)
request_duration = meter.create_histogram(
"http.server.duration",
description="HTTP request duration in seconds",
unit="s",
)
def record_request_duration(duration: float, method: str, route: str):
"""Record request duration with trace exemplar.
The exemplar links this metric data point to the current trace,
enabling navigation from metric spikes to specific slow traces.
"""
span = trace.get_current_span()
ctx = span.get_span_context()
# The exemplar links this data point to the current trace
# In Prometheus/Grafana, this enables "exemplar" queries that show
# specific traces that contributed to aggregate metrics
request_duration.record(
duration,
attributes={
"http.method": method,
"http.route": route,
},
)

In Grafana, configure Tempo as a trace data source and link it to your Prometheus metrics. The “Trace to logs” and “Trace to metrics” features let you navigate seamlessly between the three pillars. Configure your data sources with the appropriate correlation settings, and you can click from a metric graph to exemplar traces to detailed logs without leaving the Grafana interface.

The Collector’s resource detection processor enriches all telemetry with Kubernetes metadata—pod name, namespace, node, deployment. This context is invaluable when debugging: you can filter traces by deployment version to compare behavior before and after a release. If latency increased after deploying version 2.3.1, you can quickly verify by filtering traces to that specific deployment.

Building effective dashboards requires thinking about the questions you’ll ask during incidents. A good starting point includes: a service map showing request flow, latency histograms with exemplars, error rate by service, and a trace search panel. The goal is to support a workflow: notice an anomaly in metrics, find example traces, then dive into logs for details.


Operational Concerns: Performance, Sampling, and Cost Control

Instrumentation adds overhead. Every span creation, attribute set, and export consumes CPU and memory. For most services, this overhead is negligible—under 1% of request latency. For latency-critical paths processing thousands of requests per second, it matters and should be measured.

Measure before optimizing. The OpenTelemetry SDK exposes internal metrics about span creation rate, export queue depth, and dropped spans. If you’re dropping spans, your BatchSpanProcessor queue is full—increase the queue size or reduce throughput. Key metrics to monitor include otelcol_processor_batch_batch_send_size and otelcol_exporter_send_failed_spans.

The biggest cost lever is sampling. A service handling 10,000 requests per second generates 864 million spans per day. At $0.30 per million spans (typical managed service pricing), that’s $259/day from one service. Sampling at 1% reduces cost to $2.59/day with minimal debugging impact for successful requests—you’ll still see enough traces to understand normal behavior, and with tail-based sampling, you’ll capture every error.

Configure probabilistic sampling in the SDK for simple head-based sampling:

sampled_tracing.py
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 1% of traces - decision is made at trace start
# All services must use the same sampling rate for consistent behavior
sampler = TraceIdRatioBased(0.01)
provider = TracerProvider(
resource=resource,
sampler=sampler,
)

The problem: probabilistic sampling discards error traces at the same rate as successful ones. You’ll miss the failures you actually need to debug. A 1% sampling rate means you’ll miss 99% of errors.

Tail-based sampling at the Collector level solves this. The Collector buffers complete traces, then decides which to keep based on their content. This ensures you capture what matters:

tail-sampling-config.yaml
processors:
tail_sampling:
decision_wait: 10s # How long to wait for a trace to complete
num_traces: 100000 # How many traces to buffer (affects memory usage)
policies:
# Keep all traces with errors - 100% capture of failures
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Keep all traces slower than 2 seconds
- name: slow-traces
type: latency
latency: {threshold_ms: 2000}
# Sample traces that hit specific high-value routes
- name: payment-traces
type: string_attribute
string_attribute: {key: http.route, values: ["/payments", "/checkout"]}
# Keep 5% of everything else for baseline visibility
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 5}

This configuration guarantees 100% capture of errors and slow requests while sampling routine traffic. The tradeoff: decision_wait adds 10 seconds of latency before traces appear in your backend, and num_traces determines memory usage. For high-throughput systems, the Collector needs significant memory to buffer traces during the decision window.

Storage costs depend on your backend. Self-hosted Jaeger with Elasticsearch or Cassandra requires capacity planning—trace data can grow rapidly. Managed services like Grafana Cloud Tempo charge per trace or per GB ingested. Model your expected trace volume, apply your sampling rate, and calculate monthly costs before deploying to production. A spreadsheet exercise here can prevent budget surprises.

Retention is another lever. Keep detailed traces for 7 days, then aggregate to metrics. Most debugging happens within hours of an incident—you rarely need 90 days of trace history. Configure your trace backend’s retention policies accordingly, and consider tiered storage for longer-term analysis if needed.


Key Takeaways

  • Start with auto-instrumentation to get immediate visibility, then add manual spans only for business-critical paths that need custom attributes
  • Deploy the OpenTelemetry Collector as a gateway in Kubernetes rather than sidecars to reduce resource overhead and centralize sampling decisions
  • Always inject trace IDs into your structured logs from day one—retrofitting correlation later is painful and error-prone
  • Implement tail-based sampling at the collector level to guarantee you capture 100% of error traces while sampling routine requests
  • Test context propagation across every service boundary, especially async messaging systems where headers are commonly dropped

Resources