Monitoring vs Observability | Differences & When to Use

In 2019, a major e-commerce platform faced a 4-hour outage. The monitoring system showed all servers “green” — CPU normal, memory OK, disk available. But nobody could complete checkout. The problem? A version incompatibility between two microservices that traditional monitoring couldn’t detect.

This scenario illustrates why the distinction between monitoring and observability matters. For large e-commerce platforms, an outage during peak hours can cost between US$ 1 million and US$ 5 million per hour, according to Gartner data. ITIC research reveals that over 90% of organizations estimate the cost of a single hour of downtime exceeds US$ 300,000.

Traditional monitoring is effective for detecting previously modeled conditions. Modern systems suffer from problems that were not anticipated. Understanding this difference is fundamental to choosing the right approach.

What is Monitoring?

Monitoring is the practice of collecting, processing, aggregating, and displaying predefined metrics about a system’s state. Its primary goal is to detect known conditions that indicate problems or degradation, answering questions like “is the server available?” and “is CPU above 90%?”.

Key characteristics:

Reactive: Responds to predefined conditions after the fact
Threshold-based: Alerts configured with fixed limits (e.g., CPU > 90%)
Dashboard-oriented: Visualization of aggregated metrics in real time
Simplicity focus: Easier to implement in systems with simple architecture

Typical components:

Metrics collection: CPU, memory, disk, network, requests per second
Aggregation: Average, sum, min, max, percentiles
Alerts: Notifications when thresholds are exceeded
Dashboards: Time series visualization

When monitoring tends to be sufficient:

Monolithic systems with simple architecture
Static infrastructure (fixed servers, few changes)
Well-known and recurring problems
Basic availability requirements (99% uptime)
Small teams without dedicated reliability engineers

Practical example:

Alert: CPU on web-01 > 90% for the last 5 minutes
Dashboard: Requests per second = 1,200
Dashboard: Error rate = 0.5%

Limitation: This indicates that there’s a problem, but not necessarily why. If the issue is an incompatibility between services, CPU thresholds won’t help the investigation.

What is Observability?

Observability is a property of systems that allows inferring the internal state by examining external outputs. In software engineering, this means the ability to ask arbitrary questions about system behavior without needing to predict those questions in advance.

The term comes from control theory, introduced by Rudolf E. Kalman in 1960. A system is observable if you can determine its internal state by measuring its outputs. In software engineering, it was popularized by Charity Majors starting around 2017 as a necessary evolution of traditional monitoring for distributed systems.

Key characteristics:

Investigative: Allows exploring unforeseen problems in real time
Query-oriented: Ask arbitrary questions without predefined dashboards
Rich in context: Individual data with detailed context (user_id, trace_id)
Correlational: Connects events across multiple systems
Natively distributed: Designed for complex microservices architectures

The Three Pillars of Observability

Observability relies on three complementary data types:

1. Metrics:

Numerical values aggregated over time
Answer: “What is the trend? Is the system healthy?”
Example: P95 latency, error rate, throughput, cache hit rate
Cost: Low (aggregated data, high compression)

2. Logs:

Timestamped records of discrete events
Answer: “What exactly happened at this moment?”
Example: Stack traces, error messages, business events
Cost: High (individual data, high storage volume)

3. Traces (Distributed Tracing):

Request tracing across multiple services
Answer: “How did this request travel? Where is the bottleneck?”
Example: API Gateway → Auth Service → Payment Service → Database journey
Cost: Medium (sampling common for volume control)

Together, these pillars enable complete investigation: metrics detect anomalies, traces locate problematic services, logs diagnose specific causes.

Known Knowns, Known Unknowns, Unknown Unknowns

The fundamental difference between monitoring and observability becomes clear when categorizing problem types:

Known Knowns (what we know we know):

“CPU is at 90%”
“Disk is 80% full”
“Server is down”
“Error rate increased to 5%”

→ Traditional monitoring is adequate — predefined questions, threshold-based alerts.

Known Unknowns (what we know we don’t know):

“Why did latency spike at 2 PM?”
“Which service is slow?”
“Where is the performance bottleneck?”
“Which user is being affected?”

→ Observability is more suitable — ad-hoc queries, data correlation, dynamic investigation.

Unknown Unknowns (what we don’t know we don’t know):

“The application is silently failing without explicit errors”
“Users are experiencing degraded performance we haven’t detected”
“A configuration change caused gradual degradation over days”
“A new feature introduced subtle incompatibility between services”

→ Observability is essential — ability to investigate problems you never imagined could exist.

This is the crucial point: monitoring detects conditions you predicted might happen. Observability allows investigating problems you didn’t predict.

Key Differences between Monitoring and Observability

Dimension	Monitoring	Observability
Definition	Action of collecting predefined data	System property of being investigable
Focus	Detect known conditions	Investigate unknown problems
Questions	Predefined (“CPU > 90%?”)	Arbitrary (“Why did latency spike at 2 PM?”)
Approach	Reactive (alert after the fact)	Investigative (explore in real time)
Data	Aggregated metrics	Metrics + Logs + Traces correlated
Context	Limited (aggregated)	Rich (individual events)
Complexity	Simple systems	Complex distributed systems
Dashboard	Fixed, predefined	Dynamic, ad-hoc queries
MTTR	Tends to be higher (hours to days)	Tends to be lower (minutes to hours)
Cost	Low to medium	Medium to high

MTTR and the Cost of Investigation

Mean time to resolution (MTTR) directly impacts the cost of downtime. Two main factors determine this cost:

MTTD (Mean Time to Detect): Average time to detect there’s a problem
MTTR (Mean Time to Resolve): Average time to restore the system after detection

Traditional monitoring focuses on reducing detection time for obvious failures (server down, high CPU). Observability tends to reduce both detection and resolution time for complex and silent failures — problems that can take hours to detect and days to resolve without data correlation.

Illustrative example: A silent failure that takes 4 hours to detect and 2 hours to resolve (6 hours total), at a cost of US$ 100,000 per hour, results in a loss of US$ 600,000. With well-implemented observability, the same investigation can take 30 minutes to detect and 30 minutes to resolve — reducing the cost to US$ 100,000.

Monitoring and Observability Work Together

Observability doesn’t replace monitoring — it extends and complements it:

Phase 1: Basic monitoring

Infrastructure metrics (CPU, memory, disk, network)
Threshold-based alerts
Fixed dashboards

Phase 2: APM (Application Performance Monitoring)

Application metrics (latency, throughput, errors)
Basic distributed tracing
Partial correlation

Phase 3: Complete observability

Three correlated pillars (metrics, logs, traces)
Dynamic ad-hoc queries
Rich context in every event
SLOs based on user experience

Practical recommendation: Start with monitoring, add observability as system complexity increases. Don’t skip stages — each phase builds on the previous one.

When to Use Each Approach

Scenarios Where Monitoring is Adequate

Monitoring tends to be sufficient when:

Monolithic systems: Simple architecture, few dependencies, linear data flow
Static infrastructure: Fixed number of servers and services, infrequent changes
Known problems: You know what can go wrong and how to detect it
Basic alerts: CPU, memory, disk, availability are sufficient
Limited budget: Infrastructure cost needs to be low
Small team: No dedicated reliability engineers, generalist team
Simple SLAs: Basic availability only (99% uptime)

Example: A simple web application with 3 servers, single database, no microservices.

Scenarios That Benefit from Observability

Observability tends to be essential when:

Distributed architectures: Microservices, containers, serverless functions
Dynamic systems: Containers created and destroyed constantly, auto-scaling
Unknown problems: You don’t know what can go wrong — and that’s normal
Rigorous SLAs: 99.9%+ availability, latency < 100ms, specific SLOs
Deep investigation: Root cause analysis in minutes, not hours or days
User experience: Correlate technical metrics with real UX (Core Web Vitals)
Compliance: Detailed event auditing, trails for PCI-DSS, GDPR
Automation: AIOps, automatic incident response

Example: An e-commerce platform with 50+ microservices, multiple databases, distributed caches, external APIs, asynchronous payment flows.

How to Evolve from Monitoring to Observability

The evolution to observability requires gradual planning. Here’s a practical roadmap:

1. Start with Golden Signals

The Google SRE “golden signals” are an effective starting point:

Latency: Time to respond to requests
Traffic: Requests per second
Errors: Rate of failed requests
Saturation: Resource usage (CPU, memory, disk)

Implement these signals first on the most critical services.

2. Implement Structured Logs

JSON-formatted logs with consistent fields facilitate correlation:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": "user-456",
  "message": "Payment timeout after 30s"
}

Best practices:

Use ISO 8601 timestamps
Include trace_id in all logs
Standardize field names
Avoid unstructured logs

3. Add Distributed Tracing

Propagate trace IDs across all services:

Use the W3C Trace Context standard (traceparent header)
Instrument entry points (API gateways, load balancers)
Propagate context through HTTP headers, messages, etc.
Start with the most critical services

4. Define SLOs Based on Experience

Service Level Objectives translate technical metrics into user experience:

SLI (Service Level Indicator): Metric measuring behavior (e.g., P95 latency)
SLO (Service Level Objective): Target for the SLI (e.g., 99% of requests < 200ms)
SLA (Service Level Agreement): Contractual commitment with consequences

Example: “99.9% of checkout requests must complete in under 2 seconds”

5. Consider OpenTelemetry

OpenTelemetry provides unified, vendor-neutral instrumentation:

One instrumentation, multiple backends
Open standard with CNCF support
APIs for metrics, logs, and traces
Exporters for various tools

This reduces vendor lock-in and facilitates tool migration.

Best Practices and Common Pitfalls

Best Practices

Start small:

Instrument critical services first
Avoid instrumenting everything at once
Iterate based on real incidents

Use intelligent sampling:

100% traces for errors
Proportional sampling for success
Adjust based on volume and cost

Correlate data:

Trace IDs in logs and metrics
User IDs for business context
Consistent tags across pillars

Monitor the monitoring:

Are alerts working?
Is data arriving?
Are dashboards updated?

Common Pitfalls

High cardinality:

Metrics with many label combinations explode costs
Example: user_id as metric label → millions of time series
Solution: Use logs for high cardinality, metrics for aggregates

Insufficient retention:

Logs deleted before incident investigation
Metrics with lost granularity after a few days
Solution: Define retention based on SLAs and compliance requirements

Inconsistent instrumentation:

Different service names in logs vs metrics
Varied timestamp formats
Solution: Standardize conventions before scaling

Uncontrolled cost:

Data volume growing without limit
Duplicated tools
Solution: Define budget, use sampling, consolidate tools

Comparison: Investigation Flow

Scenario: Users report checkout slowness, but CPU and memory metrics are normal.

Step	Traditional Monitoring	Observability
1. Detection	Dashboard shows normal CPU → no alert	Metric detects P95 latency > SLO
2. Investigation	Check each server manually	Ad-hoc query: “which services have high latency?“
3. Location	Trial and error across multiple logs	Traces show payment service as bottleneck
4. Diagnosis	Manual logs across multiple files	Correlated logs via trace_id reveal timeout
5. Resolution	Tends to take hours to days	Tends to take minutes to hours

Critical difference: Monitoring may indicate “no problem” (false negative). Observability detects and locates the actual problem.

Frequently Asked Questions

What is the main difference between monitoring and observability?

Monitoring is a reactive action that collects predefined metrics to detect known conditions. Observability is a system property that allows asking arbitrary questions about its internal state, investigating even unforeseen problems through the correlation of logs, metrics, and traces. Monitoring answers “is the system healthy?”, observability answers “why is it not healthy?”.

Does observability replace monitoring?

No, observability complements and extends traditional monitoring. You still need basic infrastructure alerts (CPU, memory, disk, availability), but observability adds deep investigation capability for complex distributed systems. They are not mutually exclusive — they are evolutionary.

What are the 3 pillars of observability?

The three pillars are: Metrics (aggregated numerical data like P95 latency, error rate, throughput), Logs (records of discrete events with detailed context like stack traces, user_id, trace_id), and Traces (request tracing across multiple services, showing the complete journey). Together, they enable complete problem investigation.

When to use monitoring or observability?

Use monitoring for simple systems, monolithic architecture, known problems, basic infrastructure alerts, small teams. Use observability for complex distributed systems, microservices, unknown problems, deep investigation, rigorous SLAs (99.9%+), SRE/DevOps teams. The natural evolution is to start with monitoring and add observability as complexity increases.

What are “unknown unknowns” in observability?

“Unknown unknowns” are problems you don’t know exist and didn’t anticipate could happen. Traditional monitoring detects “known unknowns” (problems you predicted and configured alerts for). Observability allows investigating even problems you never imagined could occur — like a subtle incompatibility between services that gradually degrades performance without explicit errors.

What is the role of APM in monitoring vs observability?

APM (Application Performance Monitoring) is an evolution of traditional monitoring that adds distributed tracing and application metrics. It’s an intermediate step between basic monitoring and complete observability. APM still focuses on predefined questions and fixed dashboards, while observability enables arbitrary queries and investigation of unforeseen problems.

How to migrate from monitoring to observability?

Start by implementing the three pillars gradually: (1) Metrics with golden signals (latency, traffic, errors, saturation), (2) Structured logs with correlation IDs in JSON format, (3) Distributed tracing on critical services. Use OpenTelemetry for unified, vendor-neutral instrumentation. Establish a data-driven investigation culture with defined SLOs. Don’t skip stages — each phase builds on the previous one.

Conclusion

Monitoring and observability are not competitors — they are evolutionary. Monitoring is necessary for detecting obvious problems. Observability is essential for investigating complex problems in distributed systems.

Key concepts to remember:

Monitoring: Reactive, predefined, detects known conditions
Observability: Investigative, ad-hoc, explores unforeseen problems
Relationship: Observability extends and complements monitoring
Evolution: Start with monitoring, add observability as you grow
ROI: Well-implemented observability tends to reduce MTTR

Next steps to go deeper:

Read about the Three Pillars of Observability
Explore the What is Observability?
Practice implementing golden signals on a critical service

Join our community