Modern systems are distributed, complex, and dynamic. Microservices communicate across networks, containers are created and destroyed in seconds, and failures can occur at any point in the chain. Traditional monitoring alerts you about known problems: “CPU above 90%”, “disk full”, “server down”. But what about a problem you never imagined?
This is where observability becomes essential. Unlike simply knowing something is wrong, observability lets you understand why it’s wrong by correlating data from multiple sources to quickly identify root causes.
What is Observability?
Observability is the ability to understand the internal state of a system by examining only its external outputs — logs, metrics, and traces. Unlike traditional monitoring, which answers predefined questions, it allows you to investigate unknown problems in real time by correlating data from multiple sources to quickly identify root causes.
The concept of observability originated in control theory, introduced by mathematician Rudolf E. Kalman in 1960. In control engineering, a system is observable if you can determine its internal state by measuring its outputs. Applied to software systems, this means you should be able to understand what’s happening inside your application just by observing the data it generates: logs, metrics, and traces.
The term was popularized in software engineering by Charity Majors, co-founder of Honeycomb, around 2017. The key distinction she established is that monitoring is reactive, while observability is proactive. Monitoring answers questions you already know to ask; observability lets you ask questions you didn’t know you needed to ask.
Why Observability Matters Now
The shift from monolithic systems to distributed architectures has exponentially increased the complexity of modern systems. In a monolith, when something fails, you investigate a single place. In a microservices architecture, a single request can pass through dozens of services, each with its own database, cache, and external dependencies.
This complexity brings new challenges:
- Cascading failures: A problem in one service can propagate to others unpredictably
- Variable latency: Network, database, and external APIs introduce variability
- Blind spots: Parts of the system may lack adequate visibility
- Production debugging: Reproducing errors in complex environments is extremely difficult
Without observability, engineering teams spend hours or days investigating incidents, often resorting to trial and error. With observability, the same investigation can take minutes, with concrete evidence pointing to the root cause.
The Three Pillars of Observability
Observability relies on three complementary data types: metrics, logs, and traces. Each answers a different type of question, and together they form a complete foundation for problem investigation.
Metrics
Metrics are numerical values aggregated over time, representing system state as time series data. They answer questions like “how many requests per second are we receiving?”, “what is the average latency?”, and “what is the current error rate?”.
Characteristics of metrics:
- Low storage cost: Because they are aggregated, metrics take up little space
- Ideal for alerts and dashboards: Easy to visualize and configure thresholds
- No individual context: They show the aggregate, not the specific event
- High cardinality is problematic: Tags with many unique values exponentially increase storage cost
Common metric types:
- Counters: Values that only increase (e.g., total requests)
- Gauges: Values that can go up or down (e.g., CPU usage, memory)
- Histograms: Distribution of values (e.g., request latency)
- Summaries: Similar to histograms, but calculate percentiles on the client side
Practical metric examples:
- Requests per second (RPS)
- Latency P50, P95, P99
- Error rate
- CPU and memory usage
- Cache hit/miss ratio
The cardinality explosion challenge:
The biggest financial bottleneck in observability systems is cardinality explosion — the multiplicative growth in time series volume when metadata dimensions with many unique values are added to metrics. Mathematically, the total time series volume (Stotal) grows multiplicatively:
Stotal ∝ V × ∏ni=1 |Ci|
Where V is the base request volume and |Ci| is the number of unique values for each label dimension (such as user_id, endpoint_url, device_type). If you have 10 endpoints (|C1| = 10) and 1,000 unique users (|C2| = 1000), the number of time series explodes to 10,000 possible combinations.
Cardinality mitigations:
- Local edge filtering: Aggregate data before sending to the backend
- Intelligent sampling: Collect only a fraction of high-cardinality data
- Limit dimensions: Restrict the number of tags per metric
- Top-K aggregation: Keep only the K most frequent values
Metrics are the first level of visibility. They tell you that something is happening, but not why.
Logs
Logs are immutable, timestamped records of discrete events that occurred in the system. They answer questions like “what happened at 2:32 PM?” and “what was the specific error that occurred?”.
Characteristics of logs:
- High storage cost: Each event is stored individually
- Rich context: They contain detailed information about each event
- Ideal for detailed debugging: Allow investigation of specific problems
- Search can be slow: Finding relevant logs in large volumes requires indexing
Types of logs:
- Application logs: Business events and application errors
- System logs: Operating system events
- Access logs: HTTP requests received
- Audit logs: User actions for compliance
Best practices for logs:
- Structured logging: Use structured JSON format instead of free text
- Log levels: DEBUG, INFO, WARN, ERROR, FATAL — use them consistently
- Correlation IDs: Include IDs that allow tracing requests across services
- Context enrichment: Add relevant metadata such as user ID, session ID, host
Logs are the second level of visibility. They explain what happened in detail, but may not show the complete journey.
Traces (Distributed Tracing)
Traces represent the complete path of a request through multiple services in a distributed system. They answer questions like “why did this request take 2 seconds?” and “which services were called?”.
Characteristics of traces:
- Medium storage cost: Sampling is common to control volume
- Connect multiple services: Show the complete journey of a request
- Ideal for understanding latency and dependencies: Visualize bottlenecks
- Requires instrumentation: Each service must propagate trace context
Key distributed tracing concepts:
- Trace: Complete journey of a request, from start to finish
- Span: Individual unit of work within a trace (e.g., a database call)
- Context propagation: Passing the trace ID between services to connect spans
- Sampling: Collecting only a fraction of traces to control cost
Use cases for traces:
- Identifying latency bottlenecks (which service is slow?)
- Visualizing dependencies between services
- Debugging cascading failures
- Optimizing critical performance paths
Traces are the third level of visibility. They show how components interact and where the problem lies.
The Synergy of the Three Pillars
No single pillar is sufficient alone. Metrics show there’s a problem, logs explain the details of what happened, and traces reveal how components interacted. Together, they form a complete investigation system.
Practical example:
- Metric alerts: P95 latency increased from 100ms to 2 seconds
- Trace investigates: Shows the payment service is taking 1.8 seconds
- Log details: Timeout error in the database connection of the payment service
This correlation is the heart of observability.
Observability vs. Monitoring: What’s the Difference?
Although often used interchangeably, observability and monitoring are different concepts with complementary purposes.
| Dimension | Monitoring | Observability |
|---|---|---|
| Definition | Collects and alerts on known metrics | Ability to ask arbitrary questions about the system |
| Focus | ”Is the system healthy?" | "Why is the system not healthy?” |
| Data | Aggregated metrics | Metrics + Logs + Traces |
| Questions | Predefined (“CPU > 90%?”) | Ad-hoc, unpredictable (“Why did latency spike at 2 PM?”) |
| Root cause | Trial and error | Evidence correlation |
| Complexity | Simple systems | Complex distributed systems |
| Tools | Nagios, Zabbix, Prometheus | Honeycomb, Datadog, Jaeger |
Monitoring is about alerting when something you know happens. Observability is about being able to investigate something you don’t know.
When to Use Each?
Use monitoring for:
- Basic infrastructure alerts (CPU, memory, disk)
- System health dashboards
- SLA and SLO tracking
- Availability checks
Use observability for:
- Complex debugging in distributed systems
- Production incident investigation
- Performance optimization
- Understanding system behavior
They are not mutually exclusive. Observability complements and extends traditional monitoring. You still need basic alerts, but you need more to investigate complex problems.
Why Observability Matters
Beyond solving problems faster, observability brings concrete, measurable benefits to organizations.
1. Reduced MTTR (Mean Time to Resolve)
Google SRE studies indicate that mature observability practices can significantly reduce the average time to resolve incidents. Automatic event correlation across services dramatically accelerates diagnosis, eliminating the need to hunt for logs across multiple systems.
2. Proactive Problem Detection
With observability, you can identify anomalies before they become incidents. Trend analysis lets you predict capacity problems, while intelligent alerts detect anomalous patterns that indicate imminent issues.
3. Improved User Experience
Observability allows you to correlate technical metrics with real user experience. Core Web Vitals like LCP (Largest Contentful Paint), INP (Interaction to Next Paint), and CLS (Cumulative Layout Shift) can be monitored and correlated with business metrics like conversion and retention.
The INP (Interaction to Next Paint) replaced FID (First Input Delay) as Google’s official responsiveness metric in March 2024. INP measures the time from user interaction to the next visual paint, capturing latency of all interactions during the page’s lifetime — not just the first one. A mature observability system correlates INP with distributed traces, allowing you to identify which services or operations block the main thread and degrade responsiveness.
4. Cost Optimization
With complete system visibility, you can identify underutilized resources, performance bottlenecks that waste compute, and right-sizing opportunities. Detailed metrics enable data-driven decisions.
5. Compliance and Auditing
Structured logs and appropriate retention policies ensure you have complete audit trails for regulatory requirements such as PCI-DSS, GDPR, HIPAA, and SOC 2.
Market statistics:
- The global Observability market is growing rapidly, driven by the adoption of distributed architectures, microservices, and cloud computing, according to market analyses by the CNCF (Cloud Native Computing Foundation)
- The complexity of hybrid and multicloud environments is a key driver of observability adoption, with organizations seeking unified visibility of their distributed systems
How to Implement Observability
Implementing observability isn’t just about installing tools — it’s a culture and process change. Here’s a practical guide.
Technology Stack
A typical observability stack includes components for collecting, storing, and visualizing each data type.
| Component | Popular Tools | Function |
|---|---|---|
| Metrics Collection | Prometheus, StatsD, Telegraf | Collect and aggregate metrics |
| Metrics Storage | Prometheus, InfluxDB, Victoria Metrics | Store time series |
| Log Collection | Fluentd, Logstash, Fluent Bit | Collect and format logs |
| Log Storage | Elasticsearch, Loki, Splunk | Index and search logs |
| Distributed Tracing | Jaeger, Zipkin, OpenTelemetry | Trace requests |
| Visualization | Grafana, Kibana, Datadog | Dashboards and alerts |
Tool selection depends on your context: data volume, budget, team expertise, and vendor lock-in requirements.
OpenTelemetry — The Open Standard
OpenTelemetry is an open source project under the CNCF that provides a unified standard for telemetry collection: metrics, logs, and traces.
Why OpenTelemetry matters:
- Vendor-neutral: Works with any backend, avoiding lock-in
- Unified instrumentation: A single API for all three pillars
- Multi-language support: Java, Python, Go, JavaScript, .NET, Ruby, Rust
- Industry standard: Backed by Google, Microsoft, AWS, and other major companies
Current status:
- Tracing: Generally Available (GA)
- Metrics: Generally Available (GA)
- Logs: In active development
OpenTelemetry lets you start with one backend (e.g., Prometheus + Jaeger) and switch to another (e.g., Datadog) without rewriting instrumentation.
Implementation Best Practices
1. Define SLIs (Service Level Indicators)
Start by defining what matters to your users:
- Latency: How fast the system responds
- Availability: Percentage of time the system is functional
- Error rate: Percentage of requests that fail
- Throughput: How many requests the system can process
2. Establish SLOs (Service Level Objectives)
Define measurable targets:
- 99.9% availability (max 8.76 hours of downtime per year)
- P95 latency < 200ms
- Error rate < 0.1%
SLOs transform observability from a technical practice into a business instrument.
3. Implement the Three Pillars
Metrics: Start with the “golden signals” — latency, traffic, errors, and saturation. These are the most important indicators for any system.
Logs: Use structured logging (JSON) with correlation IDs. Always include context: timestamp, service name, log level, message, and relevant extra fields.
Traces: Implement 1-10% sampling for high-volume systems. Use trace IDs in logs for correlation.
4. Create Meaningful Alerts
Alert based on SLOs, not individual metrics. Instead of “CPU > 90%”, alert on “P95 latency > SLO”. This reduces alert noise and focuses on what matters to users.
Use multiple severity levels:
- Warning: Approaching the limit (e.g., P95 > 150ms when SLO is 200ms)
- Critical: SLO violated (e.g., P95 > 200ms)
5. Establish Observability Culture
Observability is not just tools — it’s process:
- Post-mortems: Document incidents and learnings
- Shared dashboards: All teams should have access
- Training: Developers need to know how to use the tools
- Ownership: Each team is responsible for their services’ observability
Real-Time Observability in Distributed Architectures
Modern distributed systems operate across multiple regions, requiring real-time visibility of events, metrics, and logs. Latency in data collection can be critical for incident response.
Processing observability data directly on distributed infrastructure enables:
- Ultra-low latency collection: Events available in under 60 seconds
- Streaming to multiple destinations: SIEM, analytics, storage
- Real-time dashboards: Metrics aggregated instantly
- Reduced origin load: Processing close to the user
Specific benefits for observability in distributed architectures:
| Benefit | Description |
|---|---|
| Reduced latency | Data collected close to users, not at a centralized origin |
| Automatic scalability | Infrastructure scales with traffic without manual intervention |
| Simplified integration | Native connectors to Splunk, Datadog, BigQuery, S3, Azure Monitor |
| Optimized cost | Pay-as-you-go model, no infrastructure to manage |
| Compliance ready | Configurable retention for auditing and regulations |
Transport Protocols: TCP vs. UDP/QUIC
The ingestion and streaming of observability data critically depends on the underlying transport protocols. Traditional systems based on HTTP/1.1 or HTTP/2 over TCP face severe latency limitations due to head-of-line blocking — when a single packet loss blocks all subsequent packets on the connection, causing cascading delays.
TCP Limitations for Observability:
- Head-of-line blocking: Packet loss paralyzes the entire connection
- Handshake overhead: Three-way handshake adds connection latency
- Aggressive congestion control: Excessive backoff on lossy networks
- Stateful connections: Difficult to multiplex multiple streams
Advantages of QUIC/HTTP/3:
The QUIC (Quick UDP Internet Connections) protocol, the foundation of HTTP/3, solves these limitations through a UDP-based architecture:
- No head-of-line blocking: Independent streams don’t affect each other
- 0-RTT connection resumption: Resume connections instantly
- Native multiplexing: Multiple streams over a single connection
- Seamless network migration: IP migration without connection breakage
Impact on Observability:
Modern data streaming solutions use optimized transport architectures to ensure critical security and performance events reach analysis tools in under 60 seconds, even under adverse network conditions. Transport protocol choice directly impacts:
- Delivery latency: QUIC reduces latency by 30-50% vs TCP on lossy networks
- Reliability: Lower rate of lost events
- Throughput: Higher data volume transmitted per second
- Resilience: Better performance on unstable networks
Tool Comparison
Tool selection depends on your context, budget, and maturity. Here’s a comparison of popular options.
| Tool | Type | Open Source | Vendor Lock-in | Best For |
|---|---|---|---|---|
| Prometheus | Metrics | Yes | No | Metrics collection, alerts |
| Grafana | Visualization | Yes | No | Unified dashboards |
| Jaeger | Tracing | Yes | No | Distributed tracing |
| Elasticsearch | Logs | Partial | Medium | Log search and analysis |
| Datadog | Full Stack | No | Yes | Full SaaS platform |
| Honeycomb | Observability | No | Yes | Ad-hoc querying, debugging |
| OpenTelemetry | Collection | Yes | No | Unified standard, vendor-neutral |
To avoid lock-in, consider using OpenTelemetry for instrumentation, allowing you to switch backends without rewriting code.
Data Type Comparison
Each data type has distinct characteristics that influence cost and use case.
| Data Type | Storage Cost | Cardinality | Context | Best Use |
|---|---|---|---|---|
| Metrics | Low | Limited | Aggregated | Dashboards, alerts, trend analysis |
| Logs | High | High | Rich | Detailed debugging, auditing |
| Traces | Medium | Medium | Full journey | Latency, dependencies, causality |
An effective strategy combines all three types with differentiated retention policies to optimize costs.
Observability in Practice
20 TB/month of Data and High Availability
Magazine Luiza, one of the most innovative retail companies in Latin America with R$ 10 billion in digital sales in 2021, needed to guarantee high availability for hundreds of applications while evolving its security perimeter and improving cyber threat intelligence.
Implemented solution:
- Distributed firewall (Network Shield + WAF + DDoS Protection)
- Data Streaming to send security events in real time
- Radware Bot Manager for bot management
Verified results:
- 20 TB of data per month sent via Data Streaming
- Data visualized in real time on the team’s preferred SIEM platforms
- Millions of threats automatically blocked
- High availability guaranteed during peak events (Black Friday)
- High-granularity security micro-perimeters
Frequently Asked Questions about Observability
What is observability and what is it for?
Observability is the ability to understand the internal state of a system by examining its external outputs — logs, metrics, and traces. It serves to diagnose problems in distributed systems, correlate events across multiple services, quickly identify root causes, and reduce incident resolution time (MTTR).
What is the difference between observability and monitoring?
Monitoring collects predefined metrics and alerts on known conditions (“CPU > 90%”). Observability lets you ask arbitrary questions about the system (“why did latency spike at 2 PM on service X?”), correlating multiple data types to diagnose unknown problems. Monitoring answers questions you already know to ask; observability lets you ask questions you didn’t know you needed to ask.
What are the three pillars of observability?
The three pillars are: Metrics (aggregated numerical data like latency and error rate), Logs (records of discrete events with rich context), and Traces (request tracing across multiple services). Together, they allow you to understand what happened, when, and why, connecting different levels of detail.
How to choose observability tools?
Evaluate: (1) support for all three pillars, (2) storage and scaling cost, (3) integration with existing stack, (4) vendor lock-in, (5) ease of use, and (6) OpenTelemetry support. Prefer open standards solutions to avoid vendor dependency. Start with open source tools like Prometheus, Grafana, and Jaeger, and consider SaaS as you scale.
What is OpenTelemetry?
OpenTelemetry is an open source CNCF project that provides a unified standard for telemetry collection: metrics, logs, and traces. It is vendor-neutral, supports multiple programming languages (Java, Python, Go, JavaScript, .NET, etc.), and lets you choose any backend without changing your code instrumentation.
How to implement observability in microservices?
Start with: (1) OpenTelemetry instrumentation in each service, (2) correlation IDs in all logs to trace requests, (3) distributed tracing to connect spans across services, (4) golden signal metrics (latency, traffic, errors, saturation), and (5) unified dashboards with Grafana or similar. Implement gradually, starting with the most critical services.
How much does observability cost?
Costs vary by data volume, retention, and chosen tools. Logs are the most expensive (high volume), metrics are cheaper (aggregated). Open source reduces license cost but requires operations. SaaS simplifies but may have vendor lock-in. Estimate 5-15% of infrastructure budget for mature observability. Start small, measure value delivered, and scale as needed.
Conclusion and Next Steps
Observability is essential for modern distributed systems. It transforms incident response from trial and error to evidence-based investigation, reducing MTTR and improving user experience.
Recommended next steps:
For beginners:
- Read our articles on each pillar: Metrics, Logs, and Distributed Tracing
- Install OpenTelemetry SDK in your application
- Set up a basic stack: Prometheus + Grafana to start
For teams with some observability:
- Assess gaps in the three pillars
- Implement correlation IDs across services
- Define SLOs based on measurable SLIs
For mature companies:
- Automate incident response with observability-driven remediation
- Integrate with SIEM platforms for security observability
- Use Data Streaming for real-time analysis
Want to implement real-time observability with ultra-low latency on the Azion Web Platform? Discover how Data Stream, Real-Time Events, and Real-Time Metrics can transform your operational visibility in a global distributed architecture. Get started free.