TL;DR
- “Round-robin” is the default, not the decision.
- The wrong algorithm creates slowdowns that look like app bugs.
- Most outages are “gray failures” (partial degradation), not total downtime.
- AI helps by predicting saturation, detecting anomalies, and steering traffic before users notice.
- Edge-based load balancing (e.g., Azion Load Balancer + Functions) enables low-latency, policy-driven routing close to users.
Why Round‑Robin Load Balancing Causes Incidents in Real Systems
You’ve seen this incident report before.
All the dashboards are green: error rate looks normal, CPU is fine, and no nodes are marked “down.” And yet users keep saying the app feels broken.
The on-call starts digging. Is it a bad deploy? A database regression? A new query plan?
Then the real culprit appears: one Availability Zone is slow, not dead. A subset of instances is hitting longer GC pauses. One node is getting crushed by a noisy neighbor. A cache shard is cold. A connection pool is quietly maxing out. Nothing triggers the big red alarms—until retries start piling up, and your “healthy” cluster begins to drown.
This is where the myth shows up: round-robin equals load balancing.
Round-robin assumes a world where every backend behaves the same, latency is stable, and “healthy” also means “fast enough.” Real systems aren’t like that. Instances are heterogeneous, workloads are bursty, tail latency dominates what users feel, and partial degradation is normal. In these conditions, naive balancing doesn’t just fall short—it can amplify the problem by continuing to send traffic to the slowest part of the fleet.
So the goal isn’t “spread requests evenly.” The goal is to:
- Minimize tail latency (p95/p99), not just average response time
- Prevent overload spirals caused by retries, queueing, and backpressure collapse
- Reduce blast radius, so a single degraded zone, host, or dependency doesn’t poison the whole system
That’s the lens for the rest of this post: how to move from “default balancing” to decision-driven traffic steering—and how AI makes that shift practical in production.
What Modern Load Balancers Do: Health, Retries, Failover, and Traffic Steering
If you still think a load balancer’s job is “pick a backend,” you’re living in the 2014 mental model. In 2026, the load balancer is effectively a traffic control plane: it shapes user experience, defines failure behavior, and decides whether a localized problem becomes a global incident.
Beyond distribution (a lot beyond)
Health checks (and why they lie)
Health checks answer a binary question (“up/down”) in a world that fails in gradients.
- A service can return
200 OKand still be unusable under real load (queueing, lock contention, exhausted pools). - A shallow
/healthzendpoint often measures “process is alive,” not “system is healthy.” - Even deeper checks can be misleading: they run out-of-band, hit a different code path, bypass caches, or don’t reflect end-user geography.
What you actually care about is: is this backend currently a good choice for the next request? That requires signals like latency, saturation, error rates by class, and sometimes dependency health—not just a boolean.
Connection management (keep-alive, HTTP/2, HTTP/3)
Load balancers are connection brokers. They decide how efficiently the client’s network becomes the origin’s workload.
- Keep-alive reduces handshake overhead, but can also “stick” clients to a degraded node longer than you expect.
- HTTP/2 multiplexing changes everything: one client connection can carry many concurrent streams. If you route that connection poorly, you just coupled a user’s entire session performance to one backend.
- HTTP/3 (QUIC) reduces handshake latency and improves loss recovery, but introduces different congestion dynamics and makes edge termination more valuable.
This is why distribution algorithms alone aren’t enough—connection-level behavior can dominate tail latency.
Retries and timeouts (the “retry storm” trap)
Retries are supposed to improve resilience. In practice, unmanaged retries are how “a little slow” becomes “everyone is down.”
The classic failure mode:
- One zone gets slower (not dead).
- Requests start timing out at the client / gateway.
- Retries kick in.
- Load increases on the already-slow zone and on shared dependencies (DB, cache, auth).
- Queues grow, latency spikes, more timeouts happen, more retries happen…
- Congratulations: you built a self-inflicted DDoS.
A modern load balancer has to treat retries as a controlled budget:
- cap retry attempts per request class,
- use adaptive timeouts,
- prefer hedging or fail-fast depending on idempotency and cost,
- and, most importantly, stop sending traffic into the fire when saturation is detected.
Slow-start / warmup
Even “healthy” backends shouldn’t necessarily receive full traffic immediately.
- Fresh instances have cold instruction caches, cold data caches, empty connection pools, and JIT/GC behavior that stabilizes only after warmup.
- Auto-scaling events are especially dangerous: you add capacity and immediately overload it, making it look “bad,” then oscillate.
Slow-start is the load balancer admitting a truth: capacity isn’t binary. A backend transitions from “alive” → “ready” → “fully effective,” and routing should respect that.
Traffic shaping and rate limiting
Rate limiting is no longer just an API gateway feature; it’s a survival mechanism.
Modern load balancing often includes:
- per-customer quotas,
- burst control,
- fairness across tenants,
- priority lanes (interactive vs batch),
- and protective shedding (returning fast failures instead of slow timeouts).
Why this matters for business: shaping is how you protect your highest-value flows during incidents, instead of letting everything degrade equally.
Failover across regions/clouds
“Multi-region” used to mean DNS failover and praying TTLs cooperate. In 2026, that’s not good enough.
Real failover means:
- steering traffic in seconds, not minutes,
- avoiding flapping (constant toggling),
- managing state constraints (sessions, writes, consistency),
- and routing around partial failures (a region that’s reachable but slow).
At this point the load balancer is part of your availability strategy, not an implementation detail.
Where Azion’s infrastructure changes the game
Traditional load balancers sit “near the origin,” which is convenient for the origin—but late for the user. Edge changes the geometry of failures and the speed of decisions.
Decisions closer to users reduce RTT and speed up failover
When routing decisions happen at the edge:
- you can detect user-perceived degradation faster (because you see it where it occurs),
- you can fail over without sending the user on a long round-trip to a struggling region,
- and you can apply policy based on geography, network conditions, and real-time performance.
That translates directly into business metrics: faster failover reduces abandonment, and lower latency boosts conversion.
Azion stack: load balancing + acceleration + custom logic
This is where combining components matters:
- Azion Load Balancer handles routing and backend selection with the right primitives (health, weights, failover).
- Application Acceleration improves protocol/TLS performance—optimizing handshakes, transport behavior, and end-to-end latency characteristics that classic L7 balancing often ignores.
- Functions let you implement custom routing logic at the edge: per-path rules, tenant-aware steering, gradual rollouts, circuit breakers, dynamic weights, or “send this cohort to that region” policies—without waiting for origin-side changes.
The key shift is architectural: you’re no longer limited to whatever a centralized load balancer supports. You can turn routing into software—executed close to users—so you can adapt faster than incidents evolve.
The 9 Load Balancing Algorithms (and When Each One Wins)
The uncomfortable truth: most teams pick an algorithm once (often the default), then spend years debugging the downstream consequences as if they were “application issues.” The right approach is to treat balancing as an adaptive control problem: choose the policy that matches your workload, and validate it with the signals that predict pain early (tail latency, saturation, retries, queue depth).
Below: for each algorithm, Best for, Breaks when, and Signals to watch.
1) Round Robin
Best for
- Homogeneous backends (same instance type, same perf)
- Stateless services with stable latency
- Low to moderate load where tail latency isn’t dominated by queueing
Breaks when
- Instances aren’t equal (different CPU, different noisy-neighbor conditions)
- Partial degradation (“gray failure”) hits a subset of nodes/AZs
- Any backend saturates: RR keeps feeding it traffic even when it’s already behind
Signals to watch
- Backend-level p95/p99 divergence (one node drifting “hot”)
- Rising retries/timeouts with stable average latency
- Queue depth / in-flight requests per instance growing unevenly
2) Weighted Round Robin
Best for
- Mixed instance sizes (e.g., 2x large + 6x medium) where you want proportional load
- Gradual migrations (send 10% to new stack, 90% to old)
- Capacity you understand and that stays relatively stable
Breaks when
- “Weights” are static but capacity isn’t (CPU steal, GC, downstream dependency slowdown)
- Load isn’t proportional to request count (some requests are heavier)
- Connection-level effects dominate (HTTP/2 multiplexing can make “request count” misleading)
Signals to watch
- Per-instance utilization vs assigned weight (are heavy nodes still overloaded?)
- Per-route latency distribution (some endpoints overweighted in cost)
- Error rate by backend despite “correct” weighting
3) Random
Best for
- Huge fleets where “good enough” distribution is fine
- Very simple systems where you want low coordination/low overhead
- As a baseline when RR causes periodic patterns (rare, but can happen)
Breaks when
- Small number of backends (variance is high; hotspots happen)
- Any meaningful heterogeneity or partial failure exists
- You need predictability (debugging can be harder)
Signals to watch
- High variance in per-instance RPS and latency
- Occasional hotspots not explained by traffic patterns
- Tail latency spikes correlated with unlucky concentration
4) Least Connections
Best for
- Workloads where concurrent connections correlate with load (classic HTTP/1.1 patterns)
- Backends that can handle similar per-connection cost
- Avoiding overload on a subset of instances when request durations vary
Breaks when
- HTTP/2/HTTP/3 multiplexing: fewer connections doesn’t mean less work
- “Connections” stay open a long time but do little work (idle keep-alive skew)
- Different instances have different capacity (least connections picks the weakest too)
Signals to watch
- Requests-per-connection and streams-per-connection (protocol-dependent)
- In-flight requests vs connection count mismatch
- Backend CPU/queue depth rising while “connections” look low
5) Weighted Least Connections
Best for
- Heterogeneous fleets where you still want “least load” behavior
- Mixed instance sizes or mixed node performance (some are simply stronger)
- A good default for many real-world services when you can’t do something smarter yet
Breaks when
- Weights aren’t updated to reflect real-time capacity changes
- Load correlates poorly with connection count (again: multiplexing, async workloads)
- Heavy-tail request cost (a few expensive requests dominate)
Signals to watch
- Normalized load (connections / weight) vs actual saturation signals (CPU, queue)
- Tail latency differences across instance classes
- Overload indicators: growing in-flight, increased upstream queueing, 503s
6) Least Response Time / Fastest
Best for
- Latency-sensitive user traffic
- Environments with variable network paths (multi-region, edge-to-origin variability)
- Detecting “gray failures” early (slow is often the first symptom)
Breaks when
- Feedback loops: sending more traffic to the “fastest” can make it the next bottleneck
- Measurement noise: short windows can chase jitter and create flapping
- Fast-but-failing scenarios: quick errors look “fast” unless error-aware
Signals to watch
- Backend latency trend + variance (not just point estimates)
- Saturation indicators (queue depth, in-flight requests) alongside latency
- Error rate by class (timeouts vs 5xx vs 4xx) so “fast failures” aren’t rewarded
7) Consistent Hashing (session affinity, cache locality)
Best for
- Cache-heavy workloads (maximize hit ratio by keeping keys “sticky”)
- Session affinity when you can’t fully externalize state (legacy reality)
- Stateful sharding patterns (routing user/account to a shard)
Breaks when
- A node degrades: stickiness keeps users pinned to a bad backend
- Uneven key distribution or “hot keys” (one shard melts)
- Scaling events: rebalancing can cause cache churn (even with consistent hashing, there’s still movement)
Signals to watch
- Cache hit ratio and eviction rate
- Per-key or per-tenant hotspots
- “Sticky” cohort latency (subset of users consistently slow)
8) Power of Two Choices (P2C)
Pick two backends at random, send to the better one (often “lower load” or “lower latency”).
Best for
- Spiky traffic where you need fast, robust decisions without global coordination
- Large fleets where full least-connections is expensive or noisy
- Systems where you want surprisingly strong performance with simple logic
Breaks when
- Your “better” metric is wrong (e.g., connections not load, or latency not corrected for errors)
- Small fleets (random choice gives less benefit)
- Strong heterogeneity without weighting (you might underuse big nodes)
Signals to watch
- Tail latency under burst (p95/p99 during spikes)
- Load distribution (standard deviation of in-flight/CPU)
- Flap rate (are choices oscillating?) if your metric window is too short
Common pairing: P2C + slow-start to avoid stampeding new instances.
9) Latency + Error Budget Aware Routing (SLO-driven, dynamic)
This is where “AI-assisted” or adaptive routing tends to land: routing decisions incorporate tail latency, errors, and saturation relative to an SLO—not just raw speed.
Best for
- Mission-critical APIs where p99 and availability are business KPIs
- Multi-region or multi-cloud where conditions change quickly
- Gray failures: you want to route away from risk before it becomes downtime
Breaks when
- Your telemetry is delayed, aggregated too much, or missing backend granularity
- Your policy is too reactive (causes flapping) or too conservative (routes too late)
- Your SLOs aren’t defined per endpoint (one global SLO hides the real problem)
Signals to watch
- Error budget burn rate (per endpoint, per region)
- p95/p99 latency vs SLO thresholds (not just averages)
- Saturation predictors: queue depth, in-flight, thread pool exhaustion, upstream connect time
- Retry rate and timeout rate (often the earliest “meltdown” signal)
Mini-table: “If you have X symptom, try Y algorithm first.”
Symptom you’re seeing | Try this first | Why it tends to work |
Cache-heavy endpoints, low hit ratio, expensive recompute | Consistent Hashing | Preserves locality; fewer cache misses |
Mixed instance sizes or known capacity differences | Weighted Least Connections | Accounts for heterogeneity + current load |
Spiky traffic + occasional hotspotting | P2C + slow-start | Robust under burst; avoids stampedes |
One AZ/region becomes “slow” but not down | Latency + Error Budget Aware Routing (or at least Least Response Time + error-aware) | Detects gray failure and steers away early |
RR looks fine but p99 is awful under load | Least Connections (HTTP/1.1) or P2C (in-flight/latency metric) | Reduces queueing-driven tail latency |
Scaling events cause oscillation and cold-start pain | Weighted RR + slow-start or P2C + warmup | Gradual ramp prevents overload loops |
Need session affinity (legacy constraints) but want resilience | Consistent Hashing + circuit breaker fallback | Keeps stickiness while allowing escape on failure |
Fast errors are being preferred (“it’s fast because it fails fast”) | Error-aware fastest / SLO routing | Prevents routing to failing nodes |
Where AI Fits: From Reactive Balancing to Predictive Steering
Most load balancing is reactive: latency rises, errors spike, then you fail over. AI is useful when it helps you do the opposite: steer away before users notice.
AI Use Case #1 — Predictive overload detection
The goal: detect “this is going to saturate” early enough to act gently instead of violently.
Inputs (signals that move early)
- Traffic and concurrency: request rate (RPS), active requests, queue depth
- Tail latency: p95/p99 (overall and per origin)
- Saturation: CPU, memory, GC time, thread pool usage
- Resource limits: DB/redis connection pool saturation, upstream connect time
Model types
- Forecasting (time series): predict near-future saturation (e.g., 5–15 minutes) given patterns, seasonality, and current slope.
- Anomaly detection: detect “this metric combination is unusual” even if no hard threshold is crossed yet.
Outcome
- Preemptive weight adjustment (gradually reduce traffic to the risky origin)
- Traffic shifting (move specific routes/regions to safer capacity)
- Optional: pre-warm capacity (scale up earlier, warm caches gradually)
The operational win is subtle but huge: you avoid the step-function behaviors that create incident pages (sudden failover, sudden retry storms, sudden overload).
5.2 AI Use Case #2 — Anomaly detection for “this region is weird”
Some incidents aren’t your bug—they’re the internet being the internet (or bots being bots).
AI is useful at identifying patterns like:
- Regional ISP issues (latency increases only for certain ASNs/countries)
- DDoS-like bursts and bot storms (traffic shape changes, cache bypass attempts)
- Upstream regressions (one dependency causes correlated timeouts)
- Routing path degradation (connect time rises while app time stays flat)
Trigger actions
- Shift traffic away from affected region/provider
- Tighten rate limits or bot controls temporarily
- Enable micro-caching for specific endpoints to absorb bursts
- Escalate to human review with a “why” summary (which cohort, which metric, which delta)
5.3 AI Use Case #3 — Reinforcement-style routing policies (carefully)
This is the “most powerful” and the easiest to misuse.
Reward function (example)
- Minimize: p99 latency + error rate
- Constrain: cost (egress, cross-region traffic), cache hit ratio impact, failover churn
Safety rails (non-negotiable in prod)
- Never shift more than N% per minute
- Always keep a baseline route (known-good fallback)
- Require confidence thresholds + cool-down windows to prevent flapping
- Prefer “suggest then apply” mode until it’s proven
Practical note: AI should propose, policy should enforce
In most orgs, the best adoption curve is:
- Human-in-the-loop first (recommendations + what would happen)
- Then autopilot in a sandbox (limited cohorts)
- Then guardrailed automation in production
AI is not a replacement for SRE discipline. It’s a force multiplier when your policies are already sane.
A Practical Architecture: AI-Assisted Load Balancing on Azion’s Infrastructure
Reference architecture
A pragmatic way to build this without rewriting your entire stack:
- Azion Application in front (entry point close to users)
- Azion Load Balancer distributing across multi-cloud / on-prem origins
- Azion Functions implementing routing intelligence:
- dynamic weighting (adjust per origin in near real-time)
- geo-based steering (country/region-aware)
- conditional routing (device, ASN, country, path, tenant)
- canary rules and “fail-open” logic (keep critical paths alive)
- Telemetry pipeline:
- logs/metrics/events → Azion Data Stream → your SIEM/warehouse/observability stack
- Optional protections:
- WAF / Firewall to dampen abuse spikes and bot storms
- Conditional micro-caching to absorb bursts or dependency slowness
This design keeps the “smart decisions” near the user while still integrating with your existing origins.
Data loop (closed loop)
- Collect: latency + error + saturation per origin/region/cohort
- Decide: AI generates routing recommendations (with confidence + explanation)
- Enforce: edge policy updates weights/routes via controlled rollout
- Verify: measure SLO impact; rollback automatically on regression
The important part is that it’s a loop with verification, not a one-way “AI said so.”
Conclusion: The New Default Is Adaptive + Azion + AI
The modern framing is simple: load balancing is an SLO control system.
- Azion’s infrastructure makes routing decisions faster (closer to the user, smaller blast radius, quicker failover).
- AI can make those decisions smarter (predict saturation, detect anomalies, recommend safer routes) — if you constrain it with guardrails, verification, and rollback.
If your goal is to reduce tail latency globally and prevent gray failures from becoming outages, the sequence is:
- pick the right baseline algorithm,
- add outlier detection + slow-start + sane retries,
- then layer an AI loop for predictive steering.
And if you want an implementation path that’s realistic in enterprise environments, one approach is combining Azion Load Balancer + Azion Functions + Azion Data Stream (+ WAF / Firewall) to build an adaptive, observable, policy-driven routing system at the edge.
FAQ: Modern Load Balancing (Beyond Round‑Robin) + AI‑Assisted Edge Traffic Steering
Is round‑robin load balancing “bad”?
Round‑robin isn’t inherently bad—it’s just a default distribution method, not a decision system. It works best when backends are homogeneous, latency is stable, and the system isn’t frequently partially degraded. In real production environments (heterogeneous instances, bursty workloads, tail latency), round‑robin can amplify incidents by continuing to send traffic to slow or saturated nodes.
Why can round‑robin cause incidents even when everything looks “healthy”?
Because many real incidents are gray failures: a zone or subset of instances becomes slow, not down. Health checks may still return 200 OK, CPU may look fine, and nodes aren’t marked unhealthy—yet user experience degrades due to queueing, GC pauses, noisy neighbors, cold caches, or saturated pools. Round‑robin keeps feeding the slow slice, increasing tail latency and retries.
What is a “gray failure” in load balancing?
A gray failure is partial degradation—the service is reachable and may respond successfully, but performance or reliability is impaired (e.g., high p99 latency, intermittent timeouts, slow dependency calls). Gray failures are dangerous because they often don’t trigger binary health alarms but still break user experience.
Why do slowdowns often look like application bugs?
Because the symptoms (timeouts, increased latency, inconsistent behavior) look like regressions in code or database queries. In reality, the system may be experiencing localized saturation (one AZ, one node class, one dependency path). Naive balancing makes it appear random and hard to reproduce, especially when only some users are routed to the degraded backend.
What should a modern load balancer optimize for?
Not “even distribution,” but user outcomes and stability, especially:
- Tail latency (p95/p99), not just averages
- Avoiding overload spirals (retries → more load → more timeouts)
- Reducing blast radius so one degraded zone/host/dependency doesn’t poison the whole service
- Fast, controlled failover and traffic steering
Why do health checks “lie”?
Health checks are often binary and out-of-band, so they may confirm “process is alive” while missing:
- saturation (queue depth, exhausted thread pools)
- dependency failures (DB/Redis pool limits)
- user-path latency issues (regional network problems) What you really need is: Is this backend a good choice for the next request right now?
How do retries turn a small issue into an outage?
Retries can create a retry storm:
- one zone gets slow
- timeouts rise
- retries increase traffic
- dependencies and queues saturate
- more timeouts → more retries This becomes a self-inflicted DDoS. Modern balancing requires retry budgets, adaptive timeouts, and steering away from saturation—rather than blindly retrying into the same failure.
What is “slow start” (warmup) and why does it matter?
Slow start ramps traffic gradually to newly started or recently recovered instances. New instances often have cold caches, unstable GC/JIT behavior, empty pools, and can be overloaded immediately if they receive full traffic at once—causing oscillations during autoscaling and making “new capacity” look unhealthy.
Which load balancing algorithm should I use instead of round‑robin?
It depends on your symptom and protocol behavior. Common upgrades include:
- Weighted Least Connections: good general default for heterogeneous fleets
- Least Response Time (fastest): good for latency sensitivity and gray failure detection (must be error-aware)
- P2C (Power of Two Choices): strong under bursts with simple logic
- Consistent Hashing: best for cache locality or session affinity
- SLO/Latency + Error Budget aware routing: best for mission-critical, multi-region, dynamic conditions
When does least connections work well—and when does it fail?
Works well when concurrent connections correlate with load (common in HTTP/1.1 patterns and variable request duration). Breaks in HTTP/2 or HTTP/3 scenarios where multiplexing means few connections can still carry heavy work, and with long-lived idle keep-alives that distort the metric.
What is P2C (Power of Two Choices) and why is it effective?
P2C picks two backends at random and sends the request to the “better” one (lower load/latency). It often delivers surprisingly strong tail latency under bursty traffic without needing heavy global coordination. It works best when your “better” metric is meaningful (e.g., in-flight requests, latency corrected for errors).
Why can “fastest backend” routing backfire?
If you keep sending more traffic to the currently fastest backend, it can become the next bottleneck (feedback loop). Also, if you don’t account for errors, a backend that fails quickly may look “fast.” The fix is error-aware fastest plus saturation signals and anti-flap controls (longer windows, cooldowns, max shift rates).
What signals should I monitor to prevent load-balancing-driven incidents?
The post emphasizes watching early indicators of pain:
- p95/p99 latency per backend/AZ/region (divergence is key)
- retry rate and timeout rate
- queue depth / in-flight requests
- upstream connect time (network/path issues)
- dependency pool saturation (DB/Redis connections)
- error rates by class (timeouts vs 5xx vs fast failures)
What does “SLO-driven” or “error budget aware” routing mean?
It means routing decisions are made relative to an SLO target (e.g., p99 latency threshold, availability target) and how quickly you’re burning the error budget. Instead of reacting only after a hard failure, the system steers away from risk (degrading nodes/regions) before users notice.
How does AI help load balancing in production (practically)?
AI is most useful when it enables predictive steering, not just reactive failover:
- Predict saturation (5–15 minutes ahead) using time-series forecasting
- Detect anomalies (region/ASN/provider “weirdness,” bot storms, dependency regressions)
- Recommend actions like dynamic weight adjustment, route shifts, or pre-warming capacity
- AI should typically propose, while policy and guardrails enforce.
What guardrails are required for AI-assisted routing?
Non-negotiable controls described in the post include:
- limit traffic shift rate (e.g., max N% per minute)
- keep a known-good baseline route
- confidence thresholds + cooldown windows to prevent flapping
- “suggest then apply” rollout (human-in-the-loop → sandbox → guardrailed automation)
- automatic rollback on SLO regression
Why does edge-based load balancing improve reliability and latency?
When routing decisions happen closer to users:
- you detect user-perceived degradation faster
- failover happens without extra long RTT to a struggling region
- policies can incorporate geography, network conditions, and real-time performance
- This reduces abandonment and improves conversion by lowering latency and shortening incident impact.
What makes Azion’s approach different?
Azion is as an edge traffic control plane, combining:
- Azion Load Balancer (routing, weights, failover primitives)
- Application Acceleration (TLS/transport optimization; HTTP/2/HTTP/3 dynamics)
- Azion Functions (custom routing logic at the edge: canaries, circuit breakers, tenant-aware steering, dynamic weights)
- Azion Data Stream (telemetry export to SIEM/observability)
- Optionally: WAF/Firewall and micro-caching for burst absorption and abuse damping.
What is a good “first fix” if p99 latency is bad but averages look fine?
Move away from naive request-count distribution and adopt policies that reduce queueing and gray failure impact:
- Least Connections (especially for HTTP/1.1)
- or P2C using in-flight/latency signals
- Then layer slow-start, error-aware routing, and controlled retries.
What’s the recommended adoption path to reduce incidents caused by load balancing?
- Choose the right baseline algorithm (not default RR)
- Add outlier detection, slow-start, and sane retries (retry budgets)
- Layer an AI loop for predictive steering with guardrails, verification, and rollback
- For an implementation path at the edge: Azion Load Balancer + Azion Functions + Azion Data Stream (+ WAF/Firewall).






