Three Pillars of Observability | Metrics, Logs, and Traces Explained

Discover the three pillars of observability: metrics, logs, and traces. Understand how these telemetry signals help diagnose problems in distributed systems.

When a request takes 5 seconds to respond, you have three fundamental questions: when the problem started, where the bottleneck is, and what caused the failure. Metrics detect when something is broken through timeline anomalies. Traces show where and how the request traveled across distributed services. Logs diagnose what happened in detail. Together, they form the foundation of modern observability.

Distributed systems generate data across multiple layers, each answering a different debugging question. Isolating a single pillar severely limits investigation capability. The correlation between metrics, logs, and traces is the key to effective diagnosis in modern architectures.

What Are the Three Pillars of Observability?

The three pillars of observability are metrics, logs, and traces. Metrics are numerical values aggregated over time. Logs are timestamped records of discrete events. Traces track the path of requests through distributed systems. Together, these pillars enable investigation of complex problems by correlating different data types.

The concept of the three pillars was popularized by the observability community starting around 2017, with significant contributions from engineers like Charity Majors, as a way to structure the telemetry needed to understand distributed systems. Each pillar captures a different dimension of the system’s operational reality, and the correlation between them creates a complete view.

Why the Three Pillars Are Complementary

No single pillar alone is sufficient for problem investigation in modern systems. The correlation between metrics, logs, and traces creates true observability.

Typical SRE investigation flow:

  1. Metrics detect the anomaly: “Latency P95 exceeded SLO at 2:32 PM”
  2. Traces locate the bottleneck: “Payment service took 1.8s on database validation”
  3. Logs diagnose the cause: “Connection timeout stack trace to the database”

In isolation, each pillar offers a partial view. Correlated, they form a complete investigation system.

Metrics

Metrics are numerical values aggregated over time, representing system state as time series data. They answer quantitative questions about system behavior: “what is the current error rate?”, “is latency within SLO?”, “how many requests per second are we processing?”.

Key characteristics:

  • Aggregation: Data is summarized into averages, sums, percentiles
  • Timestamped: Each point has a precise timestamp
  • Low cardinality: Few unique values per label dimension
  • Compressible: Take up little storage space
  • Fast query: Real-time mathematical operations

Metric Types

Counters are values that only increase over time, counting cumulative events.

  • Total HTTP requests processed
  • Total errors occurred
  • Accumulated bytes transferred

Use cases: Calculate rates (requests per second), event frequency, throughput.

Gauges are values that can go up or down, representing a momentary state.

  • CPU usage percentage
  • Available memory in MB
  • Current active connections
  • Current queue size

Use cases: Monitor capacity, resource utilization, current system state.

Histograms group values into predefined buckets, allowing distribution and percentile calculation.

  • Request latency (P50, P95, P99)
  • Payload size
  • Processing time

Use cases: Performance SLIs, understand value distribution, detect outliers.

Summaries are similar to histograms but calculate percentiles on the client side before sending.

  • Request duration percentiles calculated locally
  • Response size percentiles

Use cases: High-precision percentiles when you don’t want to depend on predefined buckets.

Golden Signals

The Google SRE “golden signals” are the 4 most important metrics for any system:

  1. Latency: Time to respond to requests
  2. Traffic: Volume of requests (requests per second)
  3. Errors: Rate of failed requests
  4. Saturation: How “full” the system is (CPU, memory, connections)

These metrics form the basis for health dashboards and SLO alerts.

Cardinality Explosion in Metrics

The biggest technical challenge in metrics systems is cardinality explosion — the growth in the number of time series when dimensions with many unique values are added.

In systems like Prometheus, the total number of active time series (S_total) depends exclusively on the Cartesian product of metadata dimensions (labels):

S_total = ∏(i=1 to n) |C_i|

Where |C_i| is the number of unique values for each label dimension.

Practical example: If you have 10 endpoints (|C_1| = 10), 5 HTTP methods (|C_2| = 5), and 3 environments (|C_3| = 3), the number of series is 10 × 5 × 3 = 150.

Important: Request volume (V) does not affect the number of series. If a single user makes one million identical requests with the same labels, the system generates only one time series. Volume merely updates the values of existing counters and gauges.

High cardinality mitigations:

  • Limit the number of labels per metric (e.g., maximum 10)
  • Use top-K aggregation to keep only the K most frequent values
  • Pre-aggregate high-cardinality data before exporting
  • Move high-cardinality data (user_id, request_id) to logs and traces

Consequence: Adding a label like user_id with 1 million unique values would create 1 million time series — generally unfeasible.

Logs

Logs are immutable, timestamped records of discrete events that occurred in the system. Each log entry captures a specific moment with detailed context, answering questions like “what exactly happened at 2:32 PM?” and “what was the error’s stack trace?”.

Key characteristics:

  • Immutable: Once written, cannot be changed
  • Timestamped: Each entry has precise timestamp (milliseconds)
  • High cardinality: Can contain unique IDs, specific messages
  • Rich context: Include metadata, stack traces, event data
  • Auditable: Form trails for compliance and investigation

Log Structure

Unstructured log (hard to parse):

[2026-06-01 14:32:15] ERROR Payment failed for user 789: timeout

Structured log (easy to search and correlate):

{
"timestamp": "2026-06-01T14:32:15.123Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123def456",
"span_id": "span456",
"user_id": "user_789",
"message": "Payment processing failed",
"error": "connection timeout",
"stack_trace": "java.net.ConnectException...",
"duration_ms": 5432
}

Log Levels (Severity Levels)

  • DEBUG: Detailed information for development debugging
  • INFO: Informative business events (request received, order created)
  • WARN: Potentially problematic conditions, but not errors
  • ERROR: Errors that don’t interrupt system operation
  • FATAL: Critical errors that interrupt operation

Best practice: Use levels consistently. DEBUG in development, INFO/WARN/ERROR in production.

Structured Logging

Structured logs use a structured data format (JSON, protobuf) instead of free text. This enables:

  • Precise queries: Search by specific field (level: ERROR AND service: payment)
  • Automatic correlation: Cross-reference trace_id between logs and traces
  • Efficient parsing: Tools process automatically
  • SIEM integration: Splunk, Datadog, Elasticsearch understand natively

Practical benefits:

  • Reduces debugging time from hours to minutes
  • Enables dynamic dashboards based on any field
  • Facilitates compliance (PCI-DSS, GDPR require audit trails)

Correlation IDs

Including correlation IDs in logs allows tracing requests across multiple services and log entries:

{
"trace_id": "abc123def456",
"span_id": "span456",
"parent_span_id": "span123",
"user_id": "user_789",
"request_id": "req999",
...
}

These IDs connect logs, traces, and metrics, creating a unified view of the event.

Context Enrichment

Adding context enriches logs for more efficient debugging:

  • User context: user_id, session_id, tenant_id, account_type
  • Request context: request_id, trace_id, http_method, path, query_params
  • System context: host, service, version, environment, region
  • Business context: order_id, cart_id, transaction_id, payment_method

The more context, the faster the investigation.

Traces (Distributed Tracing)

Traces represent the complete path of a request through multiple services in a distributed system. They connect events from different systems into a unified journey, answering questions like “why did this request take 2 seconds?” and “which services were called?”.

Key characteristics:

  • End-to-end: From start to finish of a request
  • Cross-service: Traverses multiple services, databases, caches
  • Causal: Shows cause and effect relationships between operations
  • Temporal: Each step has precise timestamp and duration

Trace ID and Span ID

Trace ID is the unique identifier for a complete request, shared across all involved services. It allows grouping all events from a single journey.

Span ID is the unique identifier for each individual operation within the trace. Each span represents a unit of work.

Parent Span ID connects spans in a dependency tree, showing the call hierarchy.

Trace structure:

Trace: abc123def456 [duration: 50ms]
├── Span: span001 (API Gateway) [0ms - 50ms]
│ ├── Span: span002 (Auth Service) [10ms - 25ms]
│ └── Span: span003 (Payment Service) [30ms - 45ms]
│ ├── Span: span004 (Database Query) [35ms - 38ms]
│ └── Span: span005 (Cache Lookup) [40ms - 42ms]

Context Propagation

Context propagation is the mechanism of passing trace ID, span ID, and other metadata between services to connect spans in a unified journey.

Propagation mechanisms:

  • HTTP Headers: traceparent, tracestate (W3C Trace Context standard)
  • gRPC Metadata: Key-value in RPC metadata
  • Message Headers: Headers in messaging systems (Kafka, RabbitMQ, SQS)

W3C Trace Context example:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value

The W3C Trace Context standard guarantees interoperability between systems and tools from different vendors.

Sampling

Collecting 100% of traces is unfeasible for high-volume systems. Sampling reduces storage and processing cost.

Sampling types:

  • Head-based sampling: Decision made at the start of the trace (more efficient, may miss error traces)
  • Tail-based sampling: Decision at the end of the trace (smarter, captures errors and high latency)
  • Adaptive sampling: Adjusts rate dynamically based on volume and budget

Typical sampling rates:

  • 0.1% - 1% for high-volume systems (10k+ RPS)
  • 10% - 50% for medium systems
  • 100% for critical or low-volume systems

Practical tip: Always capture 100% of error and high-latency traces, even in high-volume systems.

Use Cases for Traces

  • Identify latency bottlenecks (which service is slow?)
  • Visualize dependencies between services
  • Debug cascading failures (service A failed because service B is down)
  • Optimize critical performance paths
  • Understand data flow in microservices architectures

The Synergy of the Three Pillars

The correlation between metrics, logs, and traces through shared IDs is what transforms isolated data into true observability.

Correlated Investigation Flow

Scenario: P95 latency increased from 100ms to 2 seconds.

  1. Metrics detect anomaly: Dashboard shows P95 latency > SLO at 2:32 PM
  2. Traces locate service: Trace ID abc123 shows Payment Service took 1.8s
  3. Logs diagnose cause: Logs with trace_id abc123 reveal database connection timeout

Correlation via trace_id:

// Aggregated metric
{ "metric": "latency_p95", "value": 2000, "service": "payment", "timestamp": "..." }
// Correlated log
{ "trace_id": "abc123", "level": "ERROR", "message": "timeout", "service": "payment" }
// Correlated trace
{ "trace_id": "abc123", "spans": [...], "duration_ms": 2000, "service": "payment" }

When to Use Each Pillar

ScenarioPrimary PillarComplemented by
Anomaly alertMetricsLogs + Traces (to investigate)
Error debuggingLogsTraces (for context)
Latency investigationTracesMetrics (for baseline)
Compliance/auditingLogsMetrics (for summary)
Executive dashboardMetrics-
Root cause analysisAll correlated-

Beyond the Three Pillars: OpenTelemetry

The three pillars model is evolving toward correlated signals through OpenTelemetry, the CNCF open standard for unified telemetry.

Problems with Isolated Pillars

  • Logs, metrics, and traces in separate tools
  • Manual correlation between systems
  • Duplicated code instrumentation
  • Vendor lock-in with proprietary tools

OpenTelemetry Solution

OpenTelemetry unifies the collection of all three pillars in a single API and SDK:

  • Traces API: Collect traces and spans with context propagation
  • Metrics API: Collect metrics with automatic instrumentation
  • Logs API: Collect structured logs (in active development)
  • Baggage: Shared context across all signals

Benefits:

  • One instrumentation for all signals: No code duplication
  • Native correlation: Logs, metrics, and traces automatically connected
  • Vendor-neutral: Works with any backend (Prometheus, Jaeger, Datadog, etc.)
  • Eliminates vendor lock-in: Same instrumentation runs locally, in the cloud, or on WinterCG-compatible distributed runtimes

OpenTelemetry status:

  • Tracing: Generally Available (GA)
  • Metrics: Generally Available (GA)
  • Logs: In active development (beta)

W3C Trace Context Standard

The W3C Trace Context standard guarantees interoperability between systems:

  • Standardized format for traceparent and tracestate
  • Supported by all major frameworks and tools
  • Allows context propagation across heterogeneous systems

This means you can use OpenTelemetry to instrument your code and choose any backend later without rewriting instrumentation.

Telemetry Collection in Distributed Architectures

In globally distributed systems, collecting metrics, logs, and traces from multiple regions introduces latency that can hinder rapid incident response.

Distributed Telemetry Processing

Processing telemetry data directly on distributed infrastructure offers significant advantages:

  • Ultra-low latency collection: Events available in under 60 seconds
  • Unified streaming: Logs, metrics, and events in a single flow
  • Automatic correlation: Trace IDs and request IDs natively connected
  • Multiple destinations: Splunk, Datadog, BigQuery, S3, Azure Monitor

Modern Transport Protocols

Telemetry ingestion and streaming critically depend on transport protocols:

TCP Limitations:

  • Head-of-line blocking: Packet loss paralyzes the entire connection
  • Handshake overhead: Three-way handshake adds latency
  • Aggressive congestion control: Excessive backoff on lossy networks

Advantages of QUIC/HTTP3:

The QUIC (Quick UDP Internet Connections) protocol, the foundation of HTTP/3, solves these limitations:

  • No head-of-line blocking: Independent streams don’t affect each other
  • 0-RTT connection resumption: Resume connections instantly
  • Native multiplexing: Multiple streams over a single connection
  • Seamless network migration: IP migration without connection breakage

Practical impact: Streaming logs and events via QUIC/HTTP3 eliminates head-of-line blocking, ensuring that real-time metrics and cybersecurity logs reach analytical destinations (SIEM) in under 60 seconds, even under unstable network conditions.

WebSockets for Sub-second Latency

Native WebSocket support allows monitoring dashboards and interactive telemetry systems to update data in real time with sub-second latency, without HTTP polling.

Success Stories: Three Pillars in Practice

Netshoes: 385 TB of Correlated Logs and Events

Netshoes is the largest sports lifestyle e-commerce platform in Latin America, with 54 million unique visitors per month.

Use of the three pillars:

Verified results:

Magalu: 20 TB/month Correlated in Real Time

Magazine Luiza is one of the most innovative retail companies in Latin America, with R$ 10 billion in digital sales in 2021.

Use of the three pillars:

  • Metrics: Availability and performance dashboards for hundreds of applications
  • Logs: 20 TB/month via Data Streaming sent to SIEM platforms
  • Traces: Cross-service incident investigation during critical events

Verified results:

  • Millions of threats automatically blocked
  • High availability guaranteed during Black Friday peak events
  • Real-time correlation of WAF events with business metrics

Comparison: Metrics vs Logs vs Traces

DimensionMetricsLogsTraces
Data typeAggregated numericalStructured textConnected spans
GranularityLow (aggregate)High (individual)Medium (journey)
CardinalityLimitedHighMedium
Storage costLowHighMedium
ContextMinimalRichFull journey
Best forAlerts, trendsDebugging, auditingLatency, dependencies
Typical query”What is P95 latency?""What happened at 2 PM?""Which service was slow?”
ResponseNumerical valueDetailed eventsVisualized journey
ToolsPrometheus, InfluxDBElasticsearch, LokiJaeger, Zipkin

Frequently Asked Questions about the Three Pillars

What are the three pillars of observability?

The three pillars are metrics, logs, and traces. Metrics are numerical values aggregated over time (like latency, error rate). Logs are records of discrete events with timestamps and detailed context. Traces track request paths through distributed systems, connecting multiple services into a unified journey.

What is the difference between metrics, logs, and traces?

Metrics aggregate numerical values (e.g., average latency) without individual event context. Logs capture specific events with rich details (e.g., stack trace, user_id). Traces connect events across multiple services, showing the complete request journey. Use metrics for trends and alerts, logs for detailed debugging, and traces for understanding flow in distributed systems.

When to use metrics, logs, or traces?

Use metrics for dashboards, alerts, and trend analysis (e.g., P95 latency > SLO). Use logs for detailed debugging, auditing, and compliance (e.g., who executed this action, what was the specific error). Use traces for investigating latency, dependencies, and cascading failures (e.g., which service is slow, how did the failure propagate). Ideally, correlate all three pillars via trace IDs.

What is distributed tracing?

Distributed tracing tracks requests across multiple services in distributed architectures. Each request receives a unique trace ID shared across all services, and each operation within it is a span. This allows visualizing the complete journey, identifying latency bottlenecks, and understanding how failures propagate between services.

What are structured logs?

Structured logs use a structured data format (like JSON) instead of free text. Each field has a defined name and value (e.g., {"level": "ERROR", "user_id": "123", "message": "timeout"}). This enables faster and more precise queries, automatic field correlation, parsing by analysis tools, and native SIEM integration.

How to correlate metrics, logs, and traces?

Use shared correlation IDs across all three pillars. Include trace_id, span_id, and request_id in logs and traces. Use trace IDs to group spans into a journey. Metrics can be filtered by service name and correlated with traces and logs from the same period. Tools like OpenTelemetry facilitate automatic correlation.

Which tool is most important: metrics, logs, or traces?

None is more important — they are complementary. Metrics detect that there’s a problem, logs diagnose what happened, and traces show where and how it happened. Mature systems use all three pillars correlated via trace IDs. Start with metrics (golden signals), add structured logs, and implement tracing for distributed systems.

Conclusion

The three pillars of observability — metrics, logs, and traces — form the foundation for investigating problems in modern distributed systems.

Key concepts to remember:

  • Metrics detect: Answer “when” through timeline anomalies
  • Traces locate: Show “where” and “how” through request journeys
  • Logs diagnose: Explain “what” happened in detail
  • Correlation is essential: Trace IDs connect the three pillars
  • OpenTelemetry unifies: Open standard eliminates vendor lock-in

Recommended next steps:

For beginners:

  1. Implement golden signals: latency, traffic, errors, saturation
  2. Use structured logging (JSON) with correlation IDs
  3. Start with metrics, then add logs and traces

For intermediate teams:

  1. Add distributed tracing for critical services
  2. Correlate logs and traces via trace IDs
  3. Define SLOs based on metrics

For advanced teams:

  1. Adopt OpenTelemetry for unified instrumentation
  2. Implement automatic correlation across pillars
  3. Use Data Streaming for real-time analysis

Want to correlate metrics, logs, and traces in real time with ultra-low latency? Discover how Data Stream, Real-Time Events, and Real-Time Metrics can transform your operational visibility in a global distributed architecture. Get started free.

stay up to date

Subscribe to our Newsletter

Get the latest product updates, event highlights, and tech industry insights delivered to your inbox.