Three Pillars of Observability | Metrics, Logs, and Traces

When a request takes 5 seconds to respond, you have three fundamental questions: when the problem started, where the bottleneck is, and what caused the failure. Metrics detect when something is broken through timeline anomalies. Traces show where and how the request traveled across distributed services. Logs diagnose what happened in detail. Together, they form the foundation of modern observability.

Distributed systems generate data across multiple layers, each answering a different debugging question. Isolating a single pillar severely limits investigation capability. The correlation between metrics, logs, and traces is the key to effective diagnosis in modern architectures.

What Are the Three Pillars of Observability?

The three pillars of observability are metrics, logs, and traces. Metrics are numerical values aggregated over time. Logs are timestamped records of discrete events. Traces track the path of requests through distributed systems. Together, these pillars enable investigation of complex problems by correlating different data types.

The concept of the three pillars was popularized by the observability community starting around 2017, with significant contributions from engineers like Charity Majors, as a way to structure the telemetry needed to understand distributed systems. Each pillar captures a different dimension of the system’s operational reality, and the correlation between them creates a complete view.

Why the Three Pillars Are Complementary

No single pillar alone is sufficient for problem investigation in modern systems. The correlation between metrics, logs, and traces creates true observability.

Typical SRE investigation flow:

Metrics detect the anomaly: “Latency P95 exceeded SLO at 2:32 PM”
Traces locate the bottleneck: “Payment service took 1.8s on database validation”
Logs diagnose the cause: “Connection timeout stack trace to the database”

In isolation, each pillar offers a partial view. Correlated, they form a complete investigation system.

Metrics

Metrics are numerical values aggregated over time, representing system state as time series data. They answer quantitative questions about system behavior: “what is the current error rate?”, “is latency within SLO?”, “how many requests per second are we processing?”.

Key characteristics:

Aggregation: Data is summarized into averages, sums, percentiles
Timestamped: Each point has a precise timestamp
Low cardinality: Few unique values per label dimension
Compressible: Take up little storage space
Fast query: Real-time mathematical operations

Metric Types

Counters are values that only increase over time, counting cumulative events.

Total HTTP requests processed
Total errors occurred
Accumulated bytes transferred

Use cases: Calculate rates (requests per second), event frequency, throughput.

Gauges are values that can go up or down, representing a momentary state.

CPU usage percentage
Available memory in MB
Current active connections
Current queue size

Use cases: Monitor capacity, resource utilization, current system state.

Histograms group values into predefined buckets, allowing distribution and percentile calculation.

Request latency (P50, P95, P99)
Payload size
Processing time

Use cases: Performance SLIs, understand value distribution, detect outliers.

Summaries are similar to histograms but calculate percentiles on the client side before sending.

Request duration percentiles calculated locally
Response size percentiles

Use cases: High-precision percentiles when you don’t want to depend on predefined buckets.

Golden Signals

The Google SRE “golden signals” are the 4 most important metrics for any system:

Latency: Time to respond to requests
Traffic: Volume of requests (requests per second)
Errors: Rate of failed requests
Saturation: How “full” the system is (CPU, memory, connections)

These metrics form the basis for health dashboards and SLO alerts.

Cardinality Explosion in Metrics

The biggest technical challenge in metrics systems is cardinality explosion — the growth in the number of time series when dimensions with many unique values are added.

In systems like Prometheus, the total number of active time series (S_total) depends exclusively on the Cartesian product of metadata dimensions (labels):

S_total = ∏(i=1 to n) |C_i|

Where |C_i| is the number of unique values for each label dimension.

Practical example: If you have 10 endpoints (|C_1| = 10), 5 HTTP methods (|C_2| = 5), and 3 environments (|C_3| = 3), the number of series is 10 × 5 × 3 = 150.

Important: Request volume (V) does not affect the number of series. If a single user makes one million identical requests with the same labels, the system generates only one time series. Volume merely updates the values of existing counters and gauges.

High cardinality mitigations:

Limit the number of labels per metric (e.g., maximum 10)
Use top-K aggregation to keep only the K most frequent values
Pre-aggregate high-cardinality data before exporting
Move high-cardinality data (user_id, request_id) to logs and traces

Consequence: Adding a label like user_id with 1 million unique values would create 1 million time series — generally unfeasible.

Logs

Logs are immutable, timestamped records of discrete events that occurred in the system. Each log entry captures a specific moment with detailed context, answering questions like “what exactly happened at 2:32 PM?” and “what was the error’s stack trace?”.

Key characteristics:

Immutable: Once written, cannot be changed
Timestamped: Each entry has precise timestamp (milliseconds)
High cardinality: Can contain unique IDs, specific messages
Rich context: Include metadata, stack traces, event data
Auditable: Form trails for compliance and investigation

Log Structure

Unstructured log (hard to parse):

[2026-06-01 14:32:15] ERROR Payment failed for user 789: timeout

Structured log (easy to search and correlate):

{
  "timestamp": "2026-06-01T14:32:15.123Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "span_id": "span456",
  "user_id": "user_789",
  "message": "Payment processing failed",
  "error": "connection timeout",
  "stack_trace": "java.net.ConnectException...",
  "duration_ms": 5432
}

Log Levels (Severity Levels)

DEBUG: Detailed information for development debugging
INFO: Informative business events (request received, order created)
WARN: Potentially problematic conditions, but not errors
ERROR: Errors that don’t interrupt system operation
FATAL: Critical errors that interrupt operation

Best practice: Use levels consistently. DEBUG in development, INFO/WARN/ERROR in production.

Structured Logging

Structured logs use a structured data format (JSON, protobuf) instead of free text. This enables:

Precise queries: Search by specific field (level: ERROR AND service: payment)
Automatic correlation: Cross-reference trace_id between logs and traces
Efficient parsing: Tools process automatically
SIEM integration: Splunk, Datadog, Elasticsearch understand natively

Practical benefits:

Reduces debugging time from hours to minutes
Enables dynamic dashboards based on any field
Facilitates compliance (PCI-DSS, GDPR require audit trails)

Correlation IDs

Including correlation IDs in logs allows tracing requests across multiple services and log entries:

{
  "trace_id": "abc123def456",
  "span_id": "span456",
  "parent_span_id": "span123",
  "user_id": "user_789",
  "request_id": "req999",
  ...
}

These IDs connect logs, traces, and metrics, creating a unified view of the event.

Context Enrichment

Adding context enriches logs for more efficient debugging:

User context: user_id, session_id, tenant_id, account_type
Request context: request_id, trace_id, http_method, path, query_params
System context: host, service, version, environment, region
Business context: order_id, cart_id, transaction_id, payment_method

The more context, the faster the investigation.

Traces (Distributed Tracing)

Traces represent the complete path of a request through multiple services in a distributed system. They connect events from different systems into a unified journey, answering questions like “why did this request take 2 seconds?” and “which services were called?”.

Key characteristics:

End-to-end: From start to finish of a request
Cross-service: Traverses multiple services, databases, caches
Causal: Shows cause and effect relationships between operations
Temporal: Each step has precise timestamp and duration

Trace ID and Span ID

Trace ID is the unique identifier for a complete request, shared across all involved services. It allows grouping all events from a single journey.

Span ID is the unique identifier for each individual operation within the trace. Each span represents a unit of work.

Parent Span ID connects spans in a dependency tree, showing the call hierarchy.

Trace structure:

Trace: abc123def456 [duration: 50ms]
├── Span: span001 (API Gateway) [0ms - 50ms]
│   ├── Span: span002 (Auth Service) [10ms - 25ms]
│   └── Span: span003 (Payment Service) [30ms - 45ms]
│       ├── Span: span004 (Database Query) [35ms - 38ms]
│       └── Span: span005 (Cache Lookup) [40ms - 42ms]

Context Propagation

Context propagation is the mechanism of passing trace ID, span ID, and other metadata between services to connect spans in a unified journey.

Propagation mechanisms:

HTTP Headers: traceparent, tracestate (W3C Trace Context standard)
gRPC Metadata: Key-value in RPC metadata
Message Headers: Headers in messaging systems (Kafka, RabbitMQ, SQS)

W3C Trace Context example:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value

The W3C Trace Context standard guarantees interoperability between systems and tools from different vendors.

Sampling

Collecting 100% of traces is unfeasible for high-volume systems. Sampling reduces storage and processing cost.

Sampling types:

Head-based sampling: Decision made at the start of the trace (more efficient, may miss error traces)
Tail-based sampling: Decision at the end of the trace (smarter, captures errors and high latency)
Adaptive sampling: Adjusts rate dynamically based on volume and budget

Typical sampling rates:

0.1% - 1% for high-volume systems (10k+ RPS)
10% - 50% for medium systems
100% for critical or low-volume systems

Practical tip: Always capture 100% of error and high-latency traces, even in high-volume systems.

Use Cases for Traces

Identify latency bottlenecks (which service is slow?)
Visualize dependencies between services
Debug cascading failures (service A failed because service B is down)
Optimize critical performance paths
Understand data flow in microservices architectures

The Synergy of the Three Pillars

The correlation between metrics, logs, and traces through shared IDs is what transforms isolated data into true observability.

Correlated Investigation Flow

Scenario: P95 latency increased from 100ms to 2 seconds.

Metrics detect anomaly: Dashboard shows P95 latency > SLO at 2:32 PM
Traces locate service: Trace ID abc123 shows Payment Service took 1.8s
Logs diagnose cause: Logs with trace_id abc123 reveal database connection timeout

Correlation via trace_id:

// Aggregated metric
{ "metric": "latency_p95", "value": 2000, "service": "payment", "timestamp": "..." }

// Correlated log
{ "trace_id": "abc123", "level": "ERROR", "message": "timeout", "service": "payment" }

// Correlated trace
{ "trace_id": "abc123", "spans": [...], "duration_ms": 2000, "service": "payment" }

When to Use Each Pillar

Scenario	Primary Pillar	Complemented by
Anomaly alert	Metrics	Logs + Traces (to investigate)
Error debugging	Logs	Traces (for context)
Latency investigation	Traces	Metrics (for baseline)
Compliance/auditing	Logs	Metrics (for summary)
Executive dashboard	Metrics	-
Root cause analysis	All correlated	-

Beyond the Three Pillars: OpenTelemetry

The three pillars model is evolving toward correlated signals through OpenTelemetry, the CNCF open standard for unified telemetry.

Problems with Isolated Pillars

Logs, metrics, and traces in separate tools
Manual correlation between systems
Duplicated code instrumentation
Vendor lock-in with proprietary tools

OpenTelemetry Solution

OpenTelemetry unifies the collection of all three pillars in a single API and SDK:

Traces API: Collect traces and spans with context propagation
Metrics API: Collect metrics with automatic instrumentation
Logs API: Collect structured logs (in active development)
Baggage: Shared context across all signals

Benefits:

One instrumentation for all signals: No code duplication
Native correlation: Logs, metrics, and traces automatically connected
Vendor-neutral: Works with any backend (Prometheus, Jaeger, Datadog, etc.)
Eliminates vendor lock-in: Same instrumentation runs locally, in the cloud, or on WinterCG-compatible distributed runtimes

OpenTelemetry status:

Tracing: Generally Available (GA)
Metrics: Generally Available (GA)
Logs: In active development (beta)

W3C Trace Context Standard

The W3C Trace Context standard guarantees interoperability between systems:

Standardized format for traceparent and tracestate
Supported by all major frameworks and tools
Allows context propagation across heterogeneous systems

This means you can use OpenTelemetry to instrument your code and choose any backend later without rewriting instrumentation.

Telemetry Collection in Distributed Architectures

In globally distributed systems, collecting metrics, logs, and traces from multiple regions introduces latency that can hinder rapid incident response.

Distributed Telemetry Processing

Processing telemetry data directly on distributed infrastructure offers significant advantages:

Ultra-low latency collection: Events available in under 60 seconds
Unified streaming: Logs, metrics, and events in a single flow
Automatic correlation: Trace IDs and request IDs natively connected
Multiple destinations: Splunk, Datadog, BigQuery, S3, Azure Monitor

Modern Transport Protocols

Telemetry ingestion and streaming critically depend on transport protocols:

TCP Limitations:

Head-of-line blocking: Packet loss paralyzes the entire connection
Handshake overhead: Three-way handshake adds latency
Aggressive congestion control: Excessive backoff on lossy networks

Advantages of QUIC/HTTP3:

The QUIC (Quick UDP Internet Connections) protocol, the foundation of HTTP/3, solves these limitations:

No head-of-line blocking: Independent streams don’t affect each other
0-RTT connection resumption: Resume connections instantly
Native multiplexing: Multiple streams over a single connection
Seamless network migration: IP migration without connection breakage

Practical impact: Streaming logs and events via QUIC/HTTP3 eliminates head-of-line blocking, ensuring that real-time metrics and cybersecurity logs reach analytical destinations (SIEM) in under 60 seconds, even under unstable network conditions.

WebSockets for Sub-second Latency

Native WebSocket support allows monitoring dashboards and interactive telemetry systems to update data in real time with sub-second latency, without HTTP polling.

Success Stories: Three Pillars in Practice

Netshoes: 385 TB of Correlated Logs and Events

Netshoes is the largest sports lifestyle e-commerce platform in Latin America, with 54 million unique visitors per month.

Use of the three pillars:

Metrics: Real-time monitoring of latency, error rates, throughput
Logs: 385 TB of events collected via Data Streaming in 6 months
Traces: End-to-end request correlation for debugging

Verified results:

4 million threats automatically blocked by WAF in the first half of 2020
84% of processing migrated to distributed infrastructure, with 200 billion requests processed
Correlation of logs with WAF metrics for security intelligence

Magalu: 20 TB/month Correlated in Real Time

Magazine Luiza is one of the most innovative retail companies in Latin America, with R$ 10 billion in digital sales in 2021.

Use of the three pillars:

Metrics: Availability and performance dashboards for hundreds of applications
Logs: 20 TB/month via Data Streaming sent to SIEM platforms
Traces: Cross-service incident investigation during critical events

Verified results:

Millions of threats automatically blocked
High availability guaranteed during Black Friday peak events
Real-time correlation of WAF events with business metrics

Comparison: Metrics vs Logs vs Traces

Dimension	Metrics	Logs	Traces
Data type	Aggregated numerical	Structured text	Connected spans
Granularity	Low (aggregate)	High (individual)	Medium (journey)
Cardinality	Limited	High	Medium
Storage cost	Low	High	Medium
Context	Minimal	Rich	Full journey
Best for	Alerts, trends	Debugging, auditing	Latency, dependencies
Typical query	”What is P95 latency?"	"What happened at 2 PM?"	"Which service was slow?”
Response	Numerical value	Detailed events	Visualized journey
Tools	Prometheus, InfluxDB	Elasticsearch, Loki	Jaeger, Zipkin

Frequently Asked Questions about the Three Pillars

What are the three pillars of observability?

The three pillars are metrics, logs, and traces. Metrics are numerical values aggregated over time (like latency, error rate). Logs are records of discrete events with timestamps and detailed context. Traces track request paths through distributed systems, connecting multiple services into a unified journey.

What is the difference between metrics, logs, and traces?

Metrics aggregate numerical values (e.g., average latency) without individual event context. Logs capture specific events with rich details (e.g., stack trace, user_id). Traces connect events across multiple services, showing the complete request journey. Use metrics for trends and alerts, logs for detailed debugging, and traces for understanding flow in distributed systems.

When to use metrics, logs, or traces?

Use metrics for dashboards, alerts, and trend analysis (e.g., P95 latency > SLO). Use logs for detailed debugging, auditing, and compliance (e.g., who executed this action, what was the specific error). Use traces for investigating latency, dependencies, and cascading failures (e.g., which service is slow, how did the failure propagate). Ideally, correlate all three pillars via trace IDs.

What is distributed tracing?

Distributed tracing tracks requests across multiple services in distributed architectures. Each request receives a unique trace ID shared across all services, and each operation within it is a span. This allows visualizing the complete journey, identifying latency bottlenecks, and understanding how failures propagate between services.

What are structured logs?

Structured logs use a structured data format (like JSON) instead of free text. Each field has a defined name and value (e.g., {"level": "ERROR", "user_id": "123", "message": "timeout"}). This enables faster and more precise queries, automatic field correlation, parsing by analysis tools, and native SIEM integration.

How to correlate metrics, logs, and traces?

Use shared correlation IDs across all three pillars. Include trace_id, span_id, and request_id in logs and traces. Use trace IDs to group spans into a journey. Metrics can be filtered by service name and correlated with traces and logs from the same period. Tools like OpenTelemetry facilitate automatic correlation.

Which tool is most important: metrics, logs, or traces?

None is more important — they are complementary. Metrics detect that there’s a problem, logs diagnose what happened, and traces show where and how it happened. Mature systems use all three pillars correlated via trace IDs. Start with metrics (golden signals), add structured logs, and implement tracing for distributed systems.

Conclusion

The three pillars of observability — metrics, logs, and traces — form the foundation for investigating problems in modern distributed systems.

Key concepts to remember:

Metrics detect: Answer “when” through timeline anomalies
Traces locate: Show “where” and “how” through request journeys
Logs diagnose: Explain “what” happened in detail
Correlation is essential: Trace IDs connect the three pillars
OpenTelemetry unifies: Open standard eliminates vendor lock-in

Recommended next steps:

For beginners:

Implement golden signals: latency, traffic, errors, saturation
Use structured logging (JSON) with correlation IDs
Start with metrics, then add logs and traces

For intermediate teams:

Add distributed tracing for critical services
Correlate logs and traces via trace IDs
Define SLOs based on metrics

For advanced teams:

Adopt OpenTelemetry for unified instrumentation
Implement automatic correlation across pillars
Use Data Streaming for real-time analysis

Want to correlate metrics, logs, and traces in real time with ultra-low latency? Discover how Data Stream, Real-Time Events, and Real-Time Metrics can transform your operational visibility in a global distributed architecture. Get started free.

Join our community

Three Pillars of Observability | Metrics, Logs, and Traces

Discover the three pillars of observability: metrics, logs, and traces. Understand how these telemetry signals help diagnose problems in distributed systems.

What Are the Three Pillars of Observability?

Why the Three Pillars Are Complementary

Metrics

Metric Types

Golden Signals

Cardinality Explosion in Metrics

Logs

Log Structure

Log Levels (Severity Levels)

Structured Logging

Correlation IDs

Context Enrichment

Traces (Distributed Tracing)

Trace ID and Span ID

Context Propagation

Sampling

Use Cases for Traces

The Synergy of the Three Pillars

Correlated Investigation Flow

When to Use Each Pillar

Beyond the Three Pillars: OpenTelemetry

Problems with Isolated Pillars

OpenTelemetry Solution

W3C Trace Context Standard

Telemetry Collection in Distributed Architectures

Distributed Telemetry Processing

Modern Transport Protocols

WebSockets for Sub-second Latency

Success Stories: Three Pillars in Practice

Netshoes: 385 TB of Correlated Logs and Events

Magalu: 20 TB/month Correlated in Real Time

Comparison: Metrics vs Logs vs Traces

Frequently Asked Questions about the Three Pillars

What are the three pillars of observability?

What is the difference between metrics, logs, and traces?

When to use metrics, logs, or traces?

What is distributed tracing?

What are structured logs?

How to correlate metrics, logs, and traces?

Which tool is most important: metrics, logs, or traces?

Conclusion

Subscribe to our Newsletter