Cloud / DevOps / Monitoring

Observability in Distributed Systems

March 28, 202610 min read

CloudDevOpsMonitoring

Traces, metrics, and logs — the three pillars and how to wire them together with OpenTelemetry.

Last month, a request to our checkout service started timing out intermittently. The service itself was healthy — CPU and memory were fine, error rates were low. But 2% of requests were taking 30 seconds instead of 300 milliseconds. Without distributed tracing, we would have spent days bisecting the problem. With it, we found the root cause in 20 minutes.

The Three Pillars

Observability rests on three pillars, and you need all three:

Logs tell you what happened. They're the most familiar tool, but they're also the hardest to scale. Structured logging (JSON, not plaintext) is non-negotiable in a distributed system.

Metrics tell you how things are behaving in aggregate. They're cheap to collect and query, and they're your first line of defense for alerting. The RED method (Rate, Errors, Duration) is a good starting framework.

Traces tell you why something happened by following a single request across service boundaries. This is the pillar that most teams adopt last, but it's arguably the most valuable.

OpenTelemetry: The Standard

OpenTelemetry has won. It's the CNCF project that unified OpenTracing and OpenCensus, and it provides a vendor-neutral SDK for instrumenting your code.

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('checkout-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      await validateInventory(orderId);
      await chargePayment(orderId);
      await sendConfirmation(orderId);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Context Propagation

The magic of distributed tracing is context propagation. When Service A calls Service B, the trace ID and span ID are passed in HTTP headers (typically using the W3C Trace Context format). This is what allows you to see the entire request path across dozens of services in a single waterfall view.

The Cost Question

Observability is expensive. Datadog, New Relic, and Honeycomb all charge based on data volume, and a busy microservices architecture can generate terabytes of telemetry per day.

My approach: sample traces aggressively (1-5% in production), but always capture traces for errors and slow requests. Metrics are cheap — collect everything. Logs should be structured and filtered at the source.

What Changed Our Debugging Culture

The biggest impact wasn't technical — it was cultural. Once engineers could see a request flow through the entire system in a single view, they started thinking about distributed systems differently. Instead of "my service is fine, the problem must be somewhere else," the conversation became "let me trace this request and find exactly where the latency is coming from."

That shift in mindset is worth more than any tool.