What are Metrics? Definition, Types, and How to Use

A major Brazilian retailer processes more than 730 TB of data in 6 months through its distributed infrastructure. Without structured metrics, understanding where the performance bottleneck lies among millions of requests would be impossible. Metrics are the foundation that transforms raw data into quantitative answers: “what is the P95 latency?”, “how many errors per minute?”, “what is the current throughput?”.

Prometheus is one of the most widely adopted tools for metrics and a graduated CNCF project. Organizations with mature observability practices typically start with metrics as the first step, as they enable real-time anomaly detection and faster incident response. For more details, see the official Prometheus documentation.

What are Metrics?

Metrics are numerical values observed and collected over time, usually organized as time series. They represent the state, behavior, or performance of systems and applications. Each metric consists of:

Name: Unique identifier (e.g., http_requests_total)
Labels/Dimensions: Metadata for filtering (e.g., {method="GET", status="200"})
Numerical value: The observed or collected value
Timestamp: Moment of collection

Prometheus format (exposition):

http_requests_total{method="GET", status="200"} 12345 1622745600000
http_requests_total{method="POST", status="500"} 67 1622745600000

Difference: Metrics vs Logs vs Traces

Dimension	Metrics	Logs	Traces
Observation unit	Numerical time series	Individual event	Request/span
Detail level	Most summarized	Most detailed per event	Flow detail between services
Common question	”How much?"	"What happened?"	"Where did latency occur?”
Primary use	Monitoring, alerts, trends	Debugging, auditing	Distributed latency analysis

Beyond knowing what to measure, it’s important to understand how to model that measurement. The Prometheus instrumentation model defines four main metric types, each with specific usage characteristics.

The 4 Metric Types in the Prometheus Model

Counter, Gauge, Histogram, and Summary are metric types in the Prometheus instrumentation model, widely used to represent different measurement patterns. Other tools may have different classifications, but these concepts are applicable across various contexts.

Type	Behavior	Examples	Primary Use
Counter	Value that only increases. Resets to zero on process restart.	Total requests, total errors, bytes transmitted	Calculate rates with `rate()`, measure throughput
Gauge	Value that can go up or down. Represents current state.	CPU %, memory in use, temperature, active connections	Monitor instantaneous state, trends, capacity
Histogram	Distributes values into cumulative buckets. Allows estimating quantiles at query time.	Latency (P50, P95, P99), request size	Estimated quantiles, cross-instance aggregation
Summary	Calculates quantiles on the client at observation time.	Client-calculated latency, response time	Pre-calculated quantiles, when aggregation is not needed

Counter

A value that only increases, resetting to zero when the process is restarted.

Characteristics:

Monotonically increasing
Resets to zero on process restart
Used to calculate rates (e.g., requests per second via rate())

Examples:

http_requests_total → Total HTTP requests
errors_total → Total errors
bytes_transmitted_total → Bytes transmitted

Typical usage:

# Request rate per second over the last 5 minutes
rate(http_requests_total[5m])

# Error rate per minute
rate(errors_total[1m]) * 60

Gauge

A value that can go up or down, representing the current state.

Characteristics:

Instantaneous snapshot
Can freely go up or down
rate() does not make sense for gauges (they are not monotonically increasing)
Used to show trends and current state

Examples:

cpu_usage_percent → Current CPU usage
memory_bytes → Memory in use
active_connections → Active connections
temperature_celsius → Temperature

Typical usage:

# Current value (instantaneous)
cpu_usage_percent

# Average over the last 5 minutes
avg_over_time(cpu_usage_percent[5m])

# Maximum over the last 1 hour
max_over_time(memory_bytes[1h])

Histogram

Distributes observed values into predefined cumulative buckets, allowing quantile estimation from defined buckets at query time.

Characteristics:

Bucket counters are cumulative (each bucket includes values from smaller buckets)
Allows estimating quantiles via histogram_quantile() at query time
Allows aggregating metrics from multiple instances
More flexible than summary for distributed systems

Exposition (three series generated):

# Bucket counters
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 150
http_request_duration_seconds_bucket{le="1.0"} 180
http_request_duration_seconds_bucket{le="+Inf"} 200

# Sum of all values
http_request_duration_seconds_sum 123.4

# Count of observations
http_request_duration_seconds_count 200

Typical usage:

# P95 latency over the last 5 minutes (aggregating multiple instances)
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# P99 latency
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

Summary

Calculates quantiles on the client (instrumentation agent) at observation time.

Characteristics:

Quantiles calculated on the client side
Summary quantiles are not correctly aggregatable across instances (unlike histograms)
Quantiles depend on client configuration and may vary
Less flexible, but avoids bucket storage cost

Exposition:

http_request_duration_seconds{quantile="0.5"} 0.12
http_request_duration_seconds{quantile="0.9"} 0.35
http_request_duration_seconds{quantile="0.99"} 0.89
http_request_duration_seconds_sum 123.4
http_request_duration_seconds_count 200

Typical usage:

# Direct value (already calculated)
http_request_duration_seconds{quantile="0.99"}

Histogram vs Summary: When to use each?

Criteria	Histogram	Summary
Aggregation	✅ Yes (multiple instances)	❌ No
Quantiles	⚠️ Estimated via buckets	✅ Calculated on client
Flexibility	✅ Flexible for query-time quantile estimation	❌ Predefined on client
Client cost	Low	High (calculation)
Storage cost	Medium (multiple buckets)	Low
Recommendation	Use by default	Specific cases

After understanding metric types, it’s worth knowing a set of signals that has become a reference for service monitoring.

Golden Signals: The Essential Metrics

Golden signals are the four fundamental metrics described in the Google SRE Book. They provide an essential view of service health and are an important starting point for observing distributed systems.

Note: Golden signals are a useful starting point. Mature teams also track business metrics (conversion, revenue) and application-specific metrics (critical journeys, funnels).

Golden Signal	Key Question	What to Measure
Latency	How long?	P50, P95, P99 (response time percentiles)
Traffic	How much demand?	Requests/second, bytes transmitted, active users, simultaneous connections
Errors	Failure rate?	HTTP 4xx, HTTP 5xx, timeouts
Saturation	How full?	CPU %, memory %, disk %, connections/limit, request queue

Percentiles P95 and P99 are quantiles frequently used to measure latency. P95 means 95% of requests had latency equal to or below that value; P99 represents the threshold for 99% of requests.

SLI/SLO Metrics

SLI (Service Level Indicator): Metric measuring an aspect of the service (e.g., P95 latency).

SLO (Service Level Objective): Target for the SLI (e.g., P95 < 200ms in 99.9% of requests).

Example:

SLI	SLO	Metric
Availability	99.9% successful requests	`1 - (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])))`
Latency	P95 < 200ms	`histogram_quantile(0.95, sum by (le) (rate(latency_bucket[5m])))`
Throughput	> 1000 req/s	`sum(rate(requests_total[5m]))`

Note: Simplified examples for educational purposes. In practice, the availability SLI should reflect the service-specific success definition.

After understanding metric types and essential signals, the next step is to understand how these metrics are represented as time series.

Prometheus Data Model

Time Series

Format: <metric_name>{<label_name>=<label_value>, ...} <value> <timestamp>

Example:

http_requests_total{method="GET", status="200", endpoint="api/users"} 12345 1622745600000

Components:

Component	Description	Example
Metric name	Unique metric name	`http_requests_total`
Labels	Filtering dimensions	`{method="GET", status="200"}`
Value	Numerical value	`12345`
Timestamp	Collection time	`1622745600000` (ms)

Labels increase the analytical power of metrics, allowing filtering and grouping. However, each unique label combination creates a new time series — and this can scale quickly.

Cardinality

Definition: Number of unique time series generated by a metric.

Mathematical formula:

S_total = |C₁| × |C₂| × ... × |Cₙ|

Where:

S_total = Total number of series
Cᵢ = Set of possible values for each label i

Practical example:

Metric http_requests_duration_seconds with labels:

method: 4 values (GET, POST, PUT, DELETE)
status: 3 values (2xx, 4xx, 5xx)
endpoint: 100 values

S_total = 4 × 3 × 100 = 1,200 series

Cardinality Explosion

Problem: Labels with high cardinality (e.g., user_id, request_id) generate millions of series.

Bad example:

http_requests_total{user_id="12345", request_id="abc-123-def"}
# Result: millions of series → degraded performance

Mitigations:

Avoid unique IDs as labels (user_id, request_id, session_id)
Limit possible values for each label
Prefer aggregatable models and avoid multiplying high-cardinality labels
Monitor active series volume and validate modeling before production

With metric theory, types, and data modeling established, it’s worth seeing how these concepts apply in real scenarios.

Real-World Use Cases with Metrics

Marisa: E-commerce Performance Metrics

Marisa is one of the largest fashion retailers in Brazil, with over 11 million app downloads and 70% of digital sales concentrated on mobile.

Challenge:

Monitor e-commerce performance with millions of requests
Understand latency bottlenecks in real time
Correlate infrastructure metrics with user experience

Metrics implemented:

P95 Latency: Page load time
Throughput: Requests per second during peaks
Saturation: CPU/memory usage at origin vs edge
Error rate: HTTP error rate per endpoint

Verified results:

85% of traffic served by distributed infrastructure
730 TB transferred without origin impact
4.3 TB/day of images processed and optimized
Improvement in First Contentful Paint, Speed Index, and Time to Interactive

Learning: Well-structured metrics allow correlating infrastructure performance with digital experience at scale, transforming operational data into business decisions.

B2W: Security Metrics at Scale

B2W Digital brings together some of the largest e-commerce platforms in Latin America, with 2 billion visits per year and 17 million active customers.

Challenge:

Monitor security across millions of daily connections
Detect attacks in real time
Measure mitigation effectiveness

Metrics implemented:

Block rate: Blocked requests per second
Attack types: DDoS, SQL injection, XSS by category
Mitigation latency: Time between detection and blocking
Error rate per rule: Effectiveness of each Firewall rule

Verified results:

Millions of attacks automatically blocked
Transformation of events into real-time insights
Integration of metrics with SIEM via Data Streaming
Complete environment visibility in dashboards

Learning: Metrics are not just for performance — they are also fundamental for operational security, allowing measurement of defense effectiveness and incident response time.

Collecting metrics is only part of the work. Extracting useful signals depends on knowing how to aggregate data correctly.

Metric Aggregation

Aggregation Types

Type	Function	Use
Sum	`sum()`	Total values
Average	`avg()`	Average across instances
Min/Max	`min()` / `max()`	Extremes
Rate	`rate()`	Rate per second (counters)
Increase	`increase()`	Increment over period
Percentile	`histogram_quantile()`	Percentiles (histograms)

Aggregation by Labels

# Sum of requests by method (aggregates all endpoints)
sum by (method) (rate(http_requests_total[5m]))

# Average latency by service
avg by (service) (latency_seconds)

# Maximum CPU by region
max by (region) (cpu_usage_percent)

Temporal Aggregation

# Average of a metric over the last 5 minutes
avg_over_time(cpu_usage_percent[5m])

# Maximum over 1 hour
max_over_time(memory_bytes[1h])

# Minimum over 1 day
min_over_time(active_connections[1d])

With the concepts of metrics, types, modeling, and aggregation presented, the following frequently asked questions help consolidate learning.

Frequently Asked Questions

What are metrics?

Metrics are numerical values observed and collected over time that represent the state, behavior, or performance of systems. They are organized as time series with names, labels, and timestamps. They answer questions like “what is the current error rate?”, “is latency within SLO?”. They differ from logs (events) and traces (journeys).

What are the 4 metric types?

The four types in the Prometheus model are: Counter (values that only increase, e.g., total requests), Gauge (values that go up and down, e.g., CPU %), Histogram (distribution into cumulative buckets, allows quantile estimation at query time), and Summary (quantiles calculated on the client). Use histogram by default for distributed systems.

What is the difference between counter and gauge?

Counter is a value that only increases, resetting to zero on process restart, used to calculate rates (e.g., rate()). Gauge is a value that can go up or down, representing the current state (e.g., CPU, memory). Use counter for accumulated totals, gauge for instantaneous values.

What are golden signals?

Golden signals are the four essential metrics described by Google SRE: Latency (response time), Traffic (demand), Errors (failure rate), and Saturation (resource usage). They provide an essential view of service health and are the foundation for SLIs/SLOs.

What is cardinality in metrics?

Cardinality is the number of unique time series generated by a metric, calculated as the product of the possible values of each label. “Cardinality explosion” occurs when labels with many values (user_id, request_id) generate millions of series, degrading performance.

Histogram or Summary: when to use each?

Use Histogram by default (aggregates multiple instances and allows quantile estimation at query time from defined buckets). Use Summary only when you need quantiles calculated on the client and don’t need to aggregate them across instances. Histogram is more flexible and suitable for distributed systems.

How to avoid cardinality explosion?

Avoid labels with unique values (user_id, request_id), limit the possible values of each label, prefer aggregatable models, reduce or aggregate high-cardinality dimensions before exposition when possible, and monitor the number of active series.

Conclusion

Metrics are the foundation of observability. They transform system behavior into numerical data that can be queried, alerted on, and correlated. The four types in the Prometheus model — Counter, Gauge, Histogram, and Summary — cover most monitoring scenarios.

Key concepts:

Metrics = Numerical values collected over time to represent state, behavior, or performance
4 types (Prometheus): Counter (only up), Gauge (up/down), Histogram (cumulative buckets), Summary (client-side quantiles)
Golden signals: Latency, Traffic, Errors, Saturation
Cardinality: Beware of high-cardinality labels
Prometheus: Graduated CNCF project, widely adopted

Next steps:

For beginners:

Understand the 4 metric types
Implement golden signals in your application
Use Prometheus for exposition

For operations teams:

Configure SLOs based on SLIs
Monitor your metrics cardinality
Integrate with Real-Time Metrics for dashboards

For mature companies:

Optimize PromQL queries
Implement SLO-based alerts
Use histograms for latency SLIs

Want to visualize metrics in real time with seconds latency? Discover Real-Time Metrics and Data Stream for metric collection and analysis at scale. Get started free.

Join our community

What are Metrics? Definition, Types, and How to Use

What are metrics? Understand the 4 main types (counter, gauge, histogram, summary), how to use them with Prometheus, and how to avoid cardinality explosion.

What are Metrics?

Difference: Metrics vs Logs vs Traces

The 4 Metric Types in the Prometheus Model

Counter

Gauge

Histogram

Summary

Golden Signals: The Essential Metrics

SLI/SLO Metrics

Prometheus Data Model

Time Series

Cardinality

Cardinality Explosion

Real-World Use Cases with Metrics

Marisa: E-commerce Performance Metrics

B2W: Security Metrics at Scale

Metric Aggregation

Aggregation Types

Aggregation by Labels

Temporal Aggregation

Frequently Asked Questions

What are metrics?

What are the 4 metric types?

What is the difference between counter and gauge?

What are golden signals?

What is cardinality in metrics?

Histogram or Summary: when to use each?

How to avoid cardinality explosion?

Conclusion

Subscribe to our Newsletter