In 2019, a major e-commerce platform faced a 4-hour outage. The monitoring system showed all servers “green” — CPU normal, memory OK, disk available. But nobody could complete checkout. The problem? A version incompatibility between two microservices that traditional monitoring couldn’t detect.
This scenario illustrates why the distinction between monitoring and observability matters. For large e-commerce platforms, an outage during peak hours can cost between US$ 1 million and US$ 5 million per hour, according to Gartner data. ITIC research reveals that over 90% of organizations estimate the cost of a single hour of downtime exceeds US$ 300,000.
Traditional monitoring is effective for detecting previously modeled conditions. Modern systems suffer from problems that were not anticipated. Understanding this difference is fundamental to choosing the right approach.
What is Monitoring?
Monitoring is the practice of collecting, processing, aggregating, and displaying predefined metrics about a system’s state. Its primary goal is to detect known conditions that indicate problems or degradation, answering questions like “is the server available?” and “is CPU above 90%?”.
Key characteristics:
- Reactive: Responds to predefined conditions after the fact
- Threshold-based: Alerts configured with fixed limits (e.g., CPU > 90%)
- Dashboard-oriented: Visualization of aggregated metrics in real time
- Simplicity focus: Easier to implement in systems with simple architecture
Typical components:
- Metrics collection: CPU, memory, disk, network, requests per second
- Aggregation: Average, sum, min, max, percentiles
- Alerts: Notifications when thresholds are exceeded
- Dashboards: Time series visualization
When monitoring tends to be sufficient:
- Monolithic systems with simple architecture
- Static infrastructure (fixed servers, few changes)
- Well-known and recurring problems
- Basic availability requirements (99% uptime)
- Small teams without dedicated reliability engineers
Practical example:
Alert: CPU on web-01 > 90% for the last 5 minutesDashboard: Requests per second = 1,200Dashboard: Error rate = 0.5%Limitation: This indicates that there’s a problem, but not necessarily why. If the issue is an incompatibility between services, CPU thresholds won’t help the investigation.
What is Observability?
Observability is a property of systems that allows inferring the internal state by examining external outputs. In software engineering, this means the ability to ask arbitrary questions about system behavior without needing to predict those questions in advance.
The term comes from control theory, introduced by Rudolf E. Kalman in 1960. A system is observable if you can determine its internal state by measuring its outputs. In software engineering, it was popularized by Charity Majors starting around 2017 as a necessary evolution of traditional monitoring for distributed systems.
Key characteristics:
- Investigative: Allows exploring unforeseen problems in real time
- Query-oriented: Ask arbitrary questions without predefined dashboards
- Rich in context: Individual data with detailed context (user_id, trace_id)
- Correlational: Connects events across multiple systems
- Natively distributed: Designed for complex microservices architectures
The Three Pillars of Observability
Observability relies on three complementary data types:
1. Metrics:
- Numerical values aggregated over time
- Answer: “What is the trend? Is the system healthy?”
- Example: P95 latency, error rate, throughput, cache hit rate
- Cost: Low (aggregated data, high compression)
2. Logs:
- Timestamped records of discrete events
- Answer: “What exactly happened at this moment?”
- Example: Stack traces, error messages, business events
- Cost: High (individual data, high storage volume)
3. Traces (Distributed Tracing):
- Request tracing across multiple services
- Answer: “How did this request travel? Where is the bottleneck?”
- Example: API Gateway → Auth Service → Payment Service → Database journey
- Cost: Medium (sampling common for volume control)
Together, these pillars enable complete investigation: metrics detect anomalies, traces locate problematic services, logs diagnose specific causes.
Known Knowns, Known Unknowns, Unknown Unknowns
The fundamental difference between monitoring and observability becomes clear when categorizing problem types:
Known Knowns (what we know we know):
- “CPU is at 90%”
- “Disk is 80% full”
- “Server is down”
- “Error rate increased to 5%”
→ Traditional monitoring is adequate — predefined questions, threshold-based alerts.
Known Unknowns (what we know we don’t know):
- “Why did latency spike at 2 PM?”
- “Which service is slow?”
- “Where is the performance bottleneck?”
- “Which user is being affected?”
→ Observability is more suitable — ad-hoc queries, data correlation, dynamic investigation.
Unknown Unknowns (what we don’t know we don’t know):
- “The application is silently failing without explicit errors”
- “Users are experiencing degraded performance we haven’t detected”
- “A configuration change caused gradual degradation over days”
- “A new feature introduced subtle incompatibility between services”
→ Observability is essential — ability to investigate problems you never imagined could exist.
This is the crucial point: monitoring detects conditions you predicted might happen. Observability allows investigating problems you didn’t predict.
Key Differences between Monitoring and Observability
| Dimension | Monitoring | Observability |
|---|---|---|
| Definition | Action of collecting predefined data | System property of being investigable |
| Focus | Detect known conditions | Investigate unknown problems |
| Questions | Predefined (“CPU > 90%?”) | Arbitrary (“Why did latency spike at 2 PM?”) |
| Approach | Reactive (alert after the fact) | Investigative (explore in real time) |
| Data | Aggregated metrics | Metrics + Logs + Traces correlated |
| Context | Limited (aggregated) | Rich (individual events) |
| Complexity | Simple systems | Complex distributed systems |
| Dashboard | Fixed, predefined | Dynamic, ad-hoc queries |
| MTTR | Tends to be higher (hours to days) | Tends to be lower (minutes to hours) |
| Cost | Low to medium | Medium to high |
MTTR and the Cost of Investigation
Mean time to resolution (MTTR) directly impacts the cost of downtime. Two main factors determine this cost:
- MTTD (Mean Time to Detect): Average time to detect there’s a problem
- MTTR (Mean Time to Resolve): Average time to restore the system after detection
Traditional monitoring focuses on reducing detection time for obvious failures (server down, high CPU). Observability tends to reduce both detection and resolution time for complex and silent failures — problems that can take hours to detect and days to resolve without data correlation.
Illustrative example: A silent failure that takes 4 hours to detect and 2 hours to resolve (6 hours total), at a cost of US$ 100,000 per hour, results in a loss of US$ 600,000. With well-implemented observability, the same investigation can take 30 minutes to detect and 30 minutes to resolve — reducing the cost to US$ 100,000.
Monitoring and Observability Work Together
Observability doesn’t replace monitoring — it extends and complements it:
Phase 1: Basic monitoring
- Infrastructure metrics (CPU, memory, disk, network)
- Threshold-based alerts
- Fixed dashboards
Phase 2: APM (Application Performance Monitoring)
- Application metrics (latency, throughput, errors)
- Basic distributed tracing
- Partial correlation
Phase 3: Complete observability
- Three correlated pillars (metrics, logs, traces)
- Dynamic ad-hoc queries
- Rich context in every event
- SLOs based on user experience
Practical recommendation: Start with monitoring, add observability as system complexity increases. Don’t skip stages — each phase builds on the previous one.
When to Use Each Approach
Scenarios Where Monitoring is Adequate
Monitoring tends to be sufficient when:
- Monolithic systems: Simple architecture, few dependencies, linear data flow
- Static infrastructure: Fixed number of servers and services, infrequent changes
- Known problems: You know what can go wrong and how to detect it
- Basic alerts: CPU, memory, disk, availability are sufficient
- Limited budget: Infrastructure cost needs to be low
- Small team: No dedicated reliability engineers, generalist team
- Simple SLAs: Basic availability only (99% uptime)
Example: A simple web application with 3 servers, single database, no microservices.
Scenarios That Benefit from Observability
Observability tends to be essential when:
- Distributed architectures: Microservices, containers, serverless functions
- Dynamic systems: Containers created and destroyed constantly, auto-scaling
- Unknown problems: You don’t know what can go wrong — and that’s normal
- Rigorous SLAs: 99.9%+ availability, latency < 100ms, specific SLOs
- Deep investigation: Root cause analysis in minutes, not hours or days
- User experience: Correlate technical metrics with real UX (Core Web Vitals)
- Compliance: Detailed event auditing, trails for PCI-DSS, GDPR
- Automation: AIOps, automatic incident response
Example: An e-commerce platform with 50+ microservices, multiple databases, distributed caches, external APIs, asynchronous payment flows.
How to Evolve from Monitoring to Observability
The evolution to observability requires gradual planning. Here’s a practical roadmap:
1. Start with Golden Signals
The Google SRE “golden signals” are an effective starting point:
- Latency: Time to respond to requests
- Traffic: Requests per second
- Errors: Rate of failed requests
- Saturation: Resource usage (CPU, memory, disk)
Implement these signals first on the most critical services.
2. Implement Structured Logs
JSON-formatted logs with consistent fields facilitate correlation:
{ "timestamp": "2024-01-15T10:30:00Z", "level": "error", "service": "payment-service", "trace_id": "abc123", "user_id": "user-456", "message": "Payment timeout after 30s"}Best practices:
- Use ISO 8601 timestamps
- Include trace_id in all logs
- Standardize field names
- Avoid unstructured logs
3. Add Distributed Tracing
Propagate trace IDs across all services:
- Use the W3C Trace Context standard (
traceparentheader) - Instrument entry points (API gateways, load balancers)
- Propagate context through HTTP headers, messages, etc.
- Start with the most critical services
4. Define SLOs Based on Experience
Service Level Objectives translate technical metrics into user experience:
- SLI (Service Level Indicator): Metric measuring behavior (e.g., P95 latency)
- SLO (Service Level Objective): Target for the SLI (e.g., 99% of requests < 200ms)
- SLA (Service Level Agreement): Contractual commitment with consequences
Example: “99.9% of checkout requests must complete in under 2 seconds”
5. Consider OpenTelemetry
OpenTelemetry provides unified, vendor-neutral instrumentation:
- One instrumentation, multiple backends
- Open standard with CNCF support
- APIs for metrics, logs, and traces
- Exporters for various tools
This reduces vendor lock-in and facilitates tool migration.
Best Practices and Common Pitfalls
Best Practices
Start small:
- Instrument critical services first
- Avoid instrumenting everything at once
- Iterate based on real incidents
Use intelligent sampling:
- 100% traces for errors
- Proportional sampling for success
- Adjust based on volume and cost
Correlate data:
- Trace IDs in logs and metrics
- User IDs for business context
- Consistent tags across pillars
Monitor the monitoring:
- Are alerts working?
- Is data arriving?
- Are dashboards updated?
Common Pitfalls
High cardinality:
- Metrics with many label combinations explode costs
- Example: user_id as metric label → millions of time series
- Solution: Use logs for high cardinality, metrics for aggregates
Insufficient retention:
- Logs deleted before incident investigation
- Metrics with lost granularity after a few days
- Solution: Define retention based on SLAs and compliance requirements
Inconsistent instrumentation:
- Different service names in logs vs metrics
- Varied timestamp formats
- Solution: Standardize conventions before scaling
Uncontrolled cost:
- Data volume growing without limit
- Duplicated tools
- Solution: Define budget, use sampling, consolidate tools
Comparison: Investigation Flow
Scenario: Users report checkout slowness, but CPU and memory metrics are normal.
| Step | Traditional Monitoring | Observability |
|---|---|---|
| 1. Detection | Dashboard shows normal CPU → no alert | Metric detects P95 latency > SLO |
| 2. Investigation | Check each server manually | Ad-hoc query: “which services have high latency?“ |
| 3. Location | Trial and error across multiple logs | Traces show payment service as bottleneck |
| 4. Diagnosis | Manual logs across multiple files | Correlated logs via trace_id reveal timeout |
| 5. Resolution | Tends to take hours to days | Tends to take minutes to hours |
Critical difference: Monitoring may indicate “no problem” (false negative). Observability detects and locates the actual problem.
Frequently Asked Questions
What is the main difference between monitoring and observability?
Monitoring is a reactive action that collects predefined metrics to detect known conditions. Observability is a system property that allows asking arbitrary questions about its internal state, investigating even unforeseen problems through the correlation of logs, metrics, and traces. Monitoring answers “is the system healthy?”, observability answers “why is it not healthy?”.
Does observability replace monitoring?
No, observability complements and extends traditional monitoring. You still need basic infrastructure alerts (CPU, memory, disk, availability), but observability adds deep investigation capability for complex distributed systems. They are not mutually exclusive — they are evolutionary.
What are the 3 pillars of observability?
The three pillars are: Metrics (aggregated numerical data like P95 latency, error rate, throughput), Logs (records of discrete events with detailed context like stack traces, user_id, trace_id), and Traces (request tracing across multiple services, showing the complete journey). Together, they enable complete problem investigation.
When to use monitoring or observability?
Use monitoring for simple systems, monolithic architecture, known problems, basic infrastructure alerts, small teams. Use observability for complex distributed systems, microservices, unknown problems, deep investigation, rigorous SLAs (99.9%+), SRE/DevOps teams. The natural evolution is to start with monitoring and add observability as complexity increases.
What are “unknown unknowns” in observability?
“Unknown unknowns” are problems you don’t know exist and didn’t anticipate could happen. Traditional monitoring detects “known unknowns” (problems you predicted and configured alerts for). Observability allows investigating even problems you never imagined could occur — like a subtle incompatibility between services that gradually degrades performance without explicit errors.
What is the role of APM in monitoring vs observability?
APM (Application Performance Monitoring) is an evolution of traditional monitoring that adds distributed tracing and application metrics. It’s an intermediate step between basic monitoring and complete observability. APM still focuses on predefined questions and fixed dashboards, while observability enables arbitrary queries and investigation of unforeseen problems.
How to migrate from monitoring to observability?
Start by implementing the three pillars gradually: (1) Metrics with golden signals (latency, traffic, errors, saturation), (2) Structured logs with correlation IDs in JSON format, (3) Distributed tracing on critical services. Use OpenTelemetry for unified, vendor-neutral instrumentation. Establish a data-driven investigation culture with defined SLOs. Don’t skip stages — each phase builds on the previous one.
Conclusion
Monitoring and observability are not competitors — they are evolutionary. Monitoring is necessary for detecting obvious problems. Observability is essential for investigating complex problems in distributed systems.
Key concepts to remember:
- Monitoring: Reactive, predefined, detects known conditions
- Observability: Investigative, ad-hoc, explores unforeseen problems
- Relationship: Observability extends and complements monitoring
- Evolution: Start with monitoring, add observability as you grow
- ROI: Well-implemented observability tends to reduce MTTR
Next steps to go deeper:
- Read about the Three Pillars of Observability
- Explore the What is Observability?
- Practice implementing golden signals on a critical service