When a request takes 5 seconds to respond, you have three fundamental questions: when the problem started, where the bottleneck is, and what caused the failure. Metrics detect when something is broken through timeline anomalies. Traces show where and how the request traveled across distributed services. Logs diagnose what happened in detail. Together, they form the foundation of modern observability.
Distributed systems generate data across multiple layers, each answering a different debugging question. Isolating a single pillar severely limits investigation capability. The correlation between metrics, logs, and traces is the key to effective diagnosis in modern architectures.
What Are the Three Pillars of Observability?
The three pillars of observability are metrics, logs, and traces. Metrics are numerical values aggregated over time. Logs are timestamped records of discrete events. Traces track the path of requests through distributed systems. Together, these pillars enable investigation of complex problems by correlating different data types.
The concept of the three pillars was popularized by the observability community starting around 2017, with significant contributions from engineers like Charity Majors, as a way to structure the telemetry needed to understand distributed systems. Each pillar captures a different dimension of the system’s operational reality, and the correlation between them creates a complete view.
Why the Three Pillars Are Complementary
No single pillar alone is sufficient for problem investigation in modern systems. The correlation between metrics, logs, and traces creates true observability.
Typical SRE investigation flow:
- Metrics detect the anomaly: “Latency P95 exceeded SLO at 2:32 PM”
- Traces locate the bottleneck: “Payment service took 1.8s on database validation”
- Logs diagnose the cause: “Connection timeout stack trace to the database”
In isolation, each pillar offers a partial view. Correlated, they form a complete investigation system.
Metrics
Metrics are numerical values aggregated over time, representing system state as time series data. They answer quantitative questions about system behavior: “what is the current error rate?”, “is latency within SLO?”, “how many requests per second are we processing?”.
Key characteristics:
- Aggregation: Data is summarized into averages, sums, percentiles
- Timestamped: Each point has a precise timestamp
- Low cardinality: Few unique values per label dimension
- Compressible: Take up little storage space
- Fast query: Real-time mathematical operations
Metric Types
Counters are values that only increase over time, counting cumulative events.
- Total HTTP requests processed
- Total errors occurred
- Accumulated bytes transferred
Use cases: Calculate rates (requests per second), event frequency, throughput.
Gauges are values that can go up or down, representing a momentary state.
- CPU usage percentage
- Available memory in MB
- Current active connections
- Current queue size
Use cases: Monitor capacity, resource utilization, current system state.
Histograms group values into predefined buckets, allowing distribution and percentile calculation.
- Request latency (P50, P95, P99)
- Payload size
- Processing time
Use cases: Performance SLIs, understand value distribution, detect outliers.
Summaries are similar to histograms but calculate percentiles on the client side before sending.
- Request duration percentiles calculated locally
- Response size percentiles
Use cases: High-precision percentiles when you don’t want to depend on predefined buckets.
Golden Signals
The Google SRE “golden signals” are the 4 most important metrics for any system:
- Latency: Time to respond to requests
- Traffic: Volume of requests (requests per second)
- Errors: Rate of failed requests
- Saturation: How “full” the system is (CPU, memory, connections)
These metrics form the basis for health dashboards and SLO alerts.
Cardinality Explosion in Metrics
The biggest technical challenge in metrics systems is cardinality explosion — the growth in the number of time series when dimensions with many unique values are added.
In systems like Prometheus, the total number of active time series (S_total) depends exclusively on the Cartesian product of metadata dimensions (labels):
S_total = ∏(i=1 to n) |C_i|Where |C_i| is the number of unique values for each label dimension.
Practical example: If you have 10 endpoints (|C_1| = 10), 5 HTTP methods (|C_2| = 5), and 3 environments (|C_3| = 3), the number of series is 10 × 5 × 3 = 150.
Important: Request volume (V) does not affect the number of series. If a single user makes one million identical requests with the same labels, the system generates only one time series. Volume merely updates the values of existing counters and gauges.
High cardinality mitigations:
- Limit the number of labels per metric (e.g., maximum 10)
- Use top-K aggregation to keep only the K most frequent values
- Pre-aggregate high-cardinality data before exporting
- Move high-cardinality data (user_id, request_id) to logs and traces
Consequence: Adding a label like user_id with 1 million unique values would create 1 million time series — generally unfeasible.
Logs
Logs are immutable, timestamped records of discrete events that occurred in the system. Each log entry captures a specific moment with detailed context, answering questions like “what exactly happened at 2:32 PM?” and “what was the error’s stack trace?”.
Key characteristics:
- Immutable: Once written, cannot be changed
- Timestamped: Each entry has precise timestamp (milliseconds)
- High cardinality: Can contain unique IDs, specific messages
- Rich context: Include metadata, stack traces, event data
- Auditable: Form trails for compliance and investigation
Log Structure
Unstructured log (hard to parse):
[2026-06-01 14:32:15] ERROR Payment failed for user 789: timeoutStructured log (easy to search and correlate):
{ "timestamp": "2026-06-01T14:32:15.123Z", "level": "ERROR", "service": "payment-api", "trace_id": "abc123def456", "span_id": "span456", "user_id": "user_789", "message": "Payment processing failed", "error": "connection timeout", "stack_trace": "java.net.ConnectException...", "duration_ms": 5432}Log Levels (Severity Levels)
- DEBUG: Detailed information for development debugging
- INFO: Informative business events (request received, order created)
- WARN: Potentially problematic conditions, but not errors
- ERROR: Errors that don’t interrupt system operation
- FATAL: Critical errors that interrupt operation
Best practice: Use levels consistently. DEBUG in development, INFO/WARN/ERROR in production.
Structured Logging
Structured logs use a structured data format (JSON, protobuf) instead of free text. This enables:
- Precise queries: Search by specific field (
level: ERROR AND service: payment) - Automatic correlation: Cross-reference trace_id between logs and traces
- Efficient parsing: Tools process automatically
- SIEM integration: Splunk, Datadog, Elasticsearch understand natively
Practical benefits:
- Reduces debugging time from hours to minutes
- Enables dynamic dashboards based on any field
- Facilitates compliance (PCI-DSS, GDPR require audit trails)
Correlation IDs
Including correlation IDs in logs allows tracing requests across multiple services and log entries:
{ "trace_id": "abc123def456", "span_id": "span456", "parent_span_id": "span123", "user_id": "user_789", "request_id": "req999", ...}These IDs connect logs, traces, and metrics, creating a unified view of the event.
Context Enrichment
Adding context enriches logs for more efficient debugging:
- User context: user_id, session_id, tenant_id, account_type
- Request context: request_id, trace_id, http_method, path, query_params
- System context: host, service, version, environment, region
- Business context: order_id, cart_id, transaction_id, payment_method
The more context, the faster the investigation.
Traces (Distributed Tracing)
Traces represent the complete path of a request through multiple services in a distributed system. They connect events from different systems into a unified journey, answering questions like “why did this request take 2 seconds?” and “which services were called?”.
Key characteristics:
- End-to-end: From start to finish of a request
- Cross-service: Traverses multiple services, databases, caches
- Causal: Shows cause and effect relationships between operations
- Temporal: Each step has precise timestamp and duration
Trace ID and Span ID
Trace ID is the unique identifier for a complete request, shared across all involved services. It allows grouping all events from a single journey.
Span ID is the unique identifier for each individual operation within the trace. Each span represents a unit of work.
Parent Span ID connects spans in a dependency tree, showing the call hierarchy.
Trace structure:
Trace: abc123def456 [duration: 50ms]├── Span: span001 (API Gateway) [0ms - 50ms]│ ├── Span: span002 (Auth Service) [10ms - 25ms]│ └── Span: span003 (Payment Service) [30ms - 45ms]│ ├── Span: span004 (Database Query) [35ms - 38ms]│ └── Span: span005 (Cache Lookup) [40ms - 42ms]Context Propagation
Context propagation is the mechanism of passing trace ID, span ID, and other metadata between services to connect spans in a unified journey.
Propagation mechanisms:
- HTTP Headers:
traceparent,tracestate(W3C Trace Context standard) - gRPC Metadata: Key-value in RPC metadata
- Message Headers: Headers in messaging systems (Kafka, RabbitMQ, SQS)
W3C Trace Context example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01tracestate: vendor=valueThe W3C Trace Context standard guarantees interoperability between systems and tools from different vendors.
Sampling
Collecting 100% of traces is unfeasible for high-volume systems. Sampling reduces storage and processing cost.
Sampling types:
- Head-based sampling: Decision made at the start of the trace (more efficient, may miss error traces)
- Tail-based sampling: Decision at the end of the trace (smarter, captures errors and high latency)
- Adaptive sampling: Adjusts rate dynamically based on volume and budget
Typical sampling rates:
- 0.1% - 1% for high-volume systems (10k+ RPS)
- 10% - 50% for medium systems
- 100% for critical or low-volume systems
Practical tip: Always capture 100% of error and high-latency traces, even in high-volume systems.
Use Cases for Traces
- Identify latency bottlenecks (which service is slow?)
- Visualize dependencies between services
- Debug cascading failures (service A failed because service B is down)
- Optimize critical performance paths
- Understand data flow in microservices architectures
The Synergy of the Three Pillars
The correlation between metrics, logs, and traces through shared IDs is what transforms isolated data into true observability.
Correlated Investigation Flow
Scenario: P95 latency increased from 100ms to 2 seconds.
- Metrics detect anomaly: Dashboard shows P95 latency > SLO at 2:32 PM
- Traces locate service: Trace ID abc123 shows Payment Service took 1.8s
- Logs diagnose cause: Logs with trace_id abc123 reveal database connection timeout
Correlation via trace_id:
// Aggregated metric{ "metric": "latency_p95", "value": 2000, "service": "payment", "timestamp": "..." }
// Correlated log{ "trace_id": "abc123", "level": "ERROR", "message": "timeout", "service": "payment" }
// Correlated trace{ "trace_id": "abc123", "spans": [...], "duration_ms": 2000, "service": "payment" }When to Use Each Pillar
| Scenario | Primary Pillar | Complemented by |
|---|---|---|
| Anomaly alert | Metrics | Logs + Traces (to investigate) |
| Error debugging | Logs | Traces (for context) |
| Latency investigation | Traces | Metrics (for baseline) |
| Compliance/auditing | Logs | Metrics (for summary) |
| Executive dashboard | Metrics | - |
| Root cause analysis | All correlated | - |
Beyond the Three Pillars: OpenTelemetry
The three pillars model is evolving toward correlated signals through OpenTelemetry, the CNCF open standard for unified telemetry.
Problems with Isolated Pillars
- Logs, metrics, and traces in separate tools
- Manual correlation between systems
- Duplicated code instrumentation
- Vendor lock-in with proprietary tools
OpenTelemetry Solution
OpenTelemetry unifies the collection of all three pillars in a single API and SDK:
- Traces API: Collect traces and spans with context propagation
- Metrics API: Collect metrics with automatic instrumentation
- Logs API: Collect structured logs (in active development)
- Baggage: Shared context across all signals
Benefits:
- One instrumentation for all signals: No code duplication
- Native correlation: Logs, metrics, and traces automatically connected
- Vendor-neutral: Works with any backend (Prometheus, Jaeger, Datadog, etc.)
- Eliminates vendor lock-in: Same instrumentation runs locally, in the cloud, or on WinterCG-compatible distributed runtimes
OpenTelemetry status:
- Tracing: Generally Available (GA)
- Metrics: Generally Available (GA)
- Logs: In active development (beta)
W3C Trace Context Standard
The W3C Trace Context standard guarantees interoperability between systems:
- Standardized format for
traceparentandtracestate - Supported by all major frameworks and tools
- Allows context propagation across heterogeneous systems
This means you can use OpenTelemetry to instrument your code and choose any backend later without rewriting instrumentation.
Telemetry Collection in Distributed Architectures
In globally distributed systems, collecting metrics, logs, and traces from multiple regions introduces latency that can hinder rapid incident response.
Distributed Telemetry Processing
Processing telemetry data directly on distributed infrastructure offers significant advantages:
- Ultra-low latency collection: Events available in under 60 seconds
- Unified streaming: Logs, metrics, and events in a single flow
- Automatic correlation: Trace IDs and request IDs natively connected
- Multiple destinations: Splunk, Datadog, BigQuery, S3, Azure Monitor
Modern Transport Protocols
Telemetry ingestion and streaming critically depend on transport protocols:
TCP Limitations:
- Head-of-line blocking: Packet loss paralyzes the entire connection
- Handshake overhead: Three-way handshake adds latency
- Aggressive congestion control: Excessive backoff on lossy networks
Advantages of QUIC/HTTP3:
The QUIC (Quick UDP Internet Connections) protocol, the foundation of HTTP/3, solves these limitations:
- No head-of-line blocking: Independent streams don’t affect each other
- 0-RTT connection resumption: Resume connections instantly
- Native multiplexing: Multiple streams over a single connection
- Seamless network migration: IP migration without connection breakage
Practical impact: Streaming logs and events via QUIC/HTTP3 eliminates head-of-line blocking, ensuring that real-time metrics and cybersecurity logs reach analytical destinations (SIEM) in under 60 seconds, even under unstable network conditions.
WebSockets for Sub-second Latency
Native WebSocket support allows monitoring dashboards and interactive telemetry systems to update data in real time with sub-second latency, without HTTP polling.
Success Stories: Three Pillars in Practice
Netshoes: 385 TB of Correlated Logs and Events
Netshoes is the largest sports lifestyle e-commerce platform in Latin America, with 54 million unique visitors per month.
Use of the three pillars:
- Metrics: Real-time monitoring of latency, error rates, throughput
- Logs: 385 TB of events collected via Data Streaming in 6 months
- Traces: End-to-end request correlation for debugging
Verified results:
- 4 million threats automatically blocked by WAF in the first half of 2020
- 84% of processing migrated to distributed infrastructure, with 200 billion requests processed
- Correlation of logs with WAF metrics for security intelligence
Magalu: 20 TB/month Correlated in Real Time
Magazine Luiza is one of the most innovative retail companies in Latin America, with R$ 10 billion in digital sales in 2021.
Use of the three pillars:
- Metrics: Availability and performance dashboards for hundreds of applications
- Logs: 20 TB/month via Data Streaming sent to SIEM platforms
- Traces: Cross-service incident investigation during critical events
Verified results:
- Millions of threats automatically blocked
- High availability guaranteed during Black Friday peak events
- Real-time correlation of WAF events with business metrics
Comparison: Metrics vs Logs vs Traces
| Dimension | Metrics | Logs | Traces |
|---|---|---|---|
| Data type | Aggregated numerical | Structured text | Connected spans |
| Granularity | Low (aggregate) | High (individual) | Medium (journey) |
| Cardinality | Limited | High | Medium |
| Storage cost | Low | High | Medium |
| Context | Minimal | Rich | Full journey |
| Best for | Alerts, trends | Debugging, auditing | Latency, dependencies |
| Typical query | ”What is P95 latency?" | "What happened at 2 PM?" | "Which service was slow?” |
| Response | Numerical value | Detailed events | Visualized journey |
| Tools | Prometheus, InfluxDB | Elasticsearch, Loki | Jaeger, Zipkin |
Frequently Asked Questions about the Three Pillars
What are the three pillars of observability?
The three pillars are metrics, logs, and traces. Metrics are numerical values aggregated over time (like latency, error rate). Logs are records of discrete events with timestamps and detailed context. Traces track request paths through distributed systems, connecting multiple services into a unified journey.
What is the difference between metrics, logs, and traces?
Metrics aggregate numerical values (e.g., average latency) without individual event context. Logs capture specific events with rich details (e.g., stack trace, user_id). Traces connect events across multiple services, showing the complete request journey. Use metrics for trends and alerts, logs for detailed debugging, and traces for understanding flow in distributed systems.
When to use metrics, logs, or traces?
Use metrics for dashboards, alerts, and trend analysis (e.g., P95 latency > SLO). Use logs for detailed debugging, auditing, and compliance (e.g., who executed this action, what was the specific error). Use traces for investigating latency, dependencies, and cascading failures (e.g., which service is slow, how did the failure propagate). Ideally, correlate all three pillars via trace IDs.
What is distributed tracing?
Distributed tracing tracks requests across multiple services in distributed architectures. Each request receives a unique trace ID shared across all services, and each operation within it is a span. This allows visualizing the complete journey, identifying latency bottlenecks, and understanding how failures propagate between services.
What are structured logs?
Structured logs use a structured data format (like JSON) instead of free text. Each field has a defined name and value (e.g., {"level": "ERROR", "user_id": "123", "message": "timeout"}). This enables faster and more precise queries, automatic field correlation, parsing by analysis tools, and native SIEM integration.
How to correlate metrics, logs, and traces?
Use shared correlation IDs across all three pillars. Include trace_id, span_id, and request_id in logs and traces. Use trace IDs to group spans into a journey. Metrics can be filtered by service name and correlated with traces and logs from the same period. Tools like OpenTelemetry facilitate automatic correlation.
Which tool is most important: metrics, logs, or traces?
None is more important — they are complementary. Metrics detect that there’s a problem, logs diagnose what happened, and traces show where and how it happened. Mature systems use all three pillars correlated via trace IDs. Start with metrics (golden signals), add structured logs, and implement tracing for distributed systems.
Conclusion
The three pillars of observability — metrics, logs, and traces — form the foundation for investigating problems in modern distributed systems.
Key concepts to remember:
- Metrics detect: Answer “when” through timeline anomalies
- Traces locate: Show “where” and “how” through request journeys
- Logs diagnose: Explain “what” happened in detail
- Correlation is essential: Trace IDs connect the three pillars
- OpenTelemetry unifies: Open standard eliminates vendor lock-in
Recommended next steps:
For beginners:
- Implement golden signals: latency, traffic, errors, saturation
- Use structured logging (JSON) with correlation IDs
- Start with metrics, then add logs and traces
For intermediate teams:
- Add distributed tracing for critical services
- Correlate logs and traces via trace IDs
- Define SLOs based on metrics
For advanced teams:
- Adopt OpenTelemetry for unified instrumentation
- Implement automatic correlation across pillars
- Use Data Streaming for real-time analysis
Want to correlate metrics, logs, and traces in real time with ultra-low latency? Discover how Data Stream, Real-Time Events, and Real-Time Metrics can transform your operational visibility in a global distributed architecture. Get started free.