"Five nines" sounds impressive. "99.9% uptime" sounds like nearly perfect reliability. But the practical difference between 99.9% and 99.99% is enormous — and the difference in engineering cost is even larger. Before committing to any availability target, understand what you're actually promising and whether your infrastructure can deliver it.
Uptime Percentages in Real Time
The number that matters is downtime, not uptime percentage:
| SLA Level | Monthly Downtime | Annual Downtime | Practical Meaning | |---|---|---|---| | 99.0% | 7h 18m | 3d 15h | Several hours per month — significant impact | | 99.5% | 3h 39m | 1d 19h | A couple of hours per month | | 99.9% | 43m 50s | 8h 46m | Less than an hour per month | | 99.95% | 21m 55s | 4h 23m | About 20 minutes per month | | 99.99% | 4m 23s | 52m | Four minutes per month — no planned downtime possible | | 99.999% | 26 seconds | 5m 15s | Essentially no downtime — requires extraordinary engineering |
What Each Level Requires Architecturally
99.9% Uptime
The standard for most web applications and SaaS products.
What you can do:
- Deploy updates during low-traffic windows (plan for occasional brief outages)
- Use a single-region primary with a read replica
- Handle database migrations during maintenance windows
- Restart services for configuration changes
Typical infrastructure:
- Single cloud region
- Load balancer with 2+ application server instances
- Managed database with automated failover (typically adds a few minutes of failover time)
- CDN for static assets
What you can't do:
- Have zero-downtime deployments be optional — one rolling restart that takes 5 minutes breaks 99.9%
- Perform database schema migrations that lock tables for more than a few minutes
Reality check: 43 minutes of downtime per month sounds like a lot.
But a single restart that takes 2 minutes and happens twice a month
consumes 4 minutes of your budget. A single database failover can
consume 5-10 minutes. A botched deployment requiring rollback can
consume 20-30 minutes.
99.9% is achievable but requires discipline in deployments.
99.95% Uptime
The next tier — often required for enterprise contracts with business-critical tools.
What you need:
- Blue/green deployments or rolling deployments with zero visible downtime
- Database failover under 1 minute (requires careful setup)
- Automated rollback capability
- Proactive capacity planning (no scaling events causing brief latency spikes)
What changes architecturally:
- Connection draining during deployments becomes mandatory
- Health check integration with load balancer must be correct
- Database choice matters — some managed databases have faster failover than others
99.99% Uptime
Four minutes per month of allowed downtime. This level changes the engineering problem fundamentally.
What you need:
- Multi-region active-active or active-passive with sub-minute failover
- Database replication with automated failover under 30 seconds
- Zero-downtime deployments (canary, blue-green, rolling) with automated verification
- Chaos engineering practice to find failure modes before they happen
- 24/7 on-call with paging and sub-5-minute response times
- No single points of failure anywhere in the stack
What this costs:
- Infrastructure runs at 2x minimum (N+1 redundancy in multiple regions)
- Engineering time for reliability is significant (often 20-30% of eng capacity)
- Operational complexity is substantially higher
Reality check: At 99.99%, you cannot have unplanned downtime more
than once per month — even for 5 minutes. Any meaningful incident
that isn't fully auto-recovered within 4 minutes consumes your
entire monthly budget.
Planned maintenance requires zero-downtime deployment strategies
for everything — including database migrations, certificate
renewals, and infrastructure changes.
99.999% (Five Nines)
26 seconds of allowed downtime per month. This is the domain of telecommunications infrastructure, financial clearing systems, and emergency services.
Requires:
- Multiple geographically distributed active-active regions
- Database with synchronous replication across regions
- Automated failure detection and failover in under 10 seconds
- Dedicated reliability engineering teams
- Netflix-scale chaos engineering practices
- Game Days and regular failover testing
Very few consumer or enterprise software products genuinely need this level. Most that claim it don't actually achieve it.
The Cost Curve Is Not Linear
Going from 99.9% to 99.99% doesn't cost 10x more — it often costs 3-5x more in infrastructure and significantly more in engineering complexity:
| Reliability Level | Relative Infrastructure Cost | Relative Engineering Complexity | |---|---|---| | 99.0% | 1x | Low | | 99.5% | 1.2x | Low | | 99.9% | 1.5x | Medium | | 99.95% | 2x | Medium-High | | 99.99% | 3-4x | High | | 99.999% | 8-10x+ | Very High |
Measuring Your Actual Uptime
Before committing to an SLA, measure what you're actually delivering:
def analyze_historical_availability(monitoring_data, months=12):
"""
Analyze historical availability to understand which SLA tier is achievable.
"""
results = []
for month in get_last_n_months(months):
start, end = get_month_bounds(month)
checks = monitoring_data.get_checks(start=start, end=end)
availability = calculate_availability(checks, start, end)
results.append({
"month": month.strftime("%Y-%m"),
"availability_pct": availability["availability_pct"],
"downtime_minutes": availability["downtime_minutes"],
"achieves_99_9": availability["availability_pct"] >= 99.9,
"achieves_99_95": availability["availability_pct"] >= 99.95,
"achieves_99_99": availability["availability_pct"] >= 99.99,
})
# Summary
months_at_99_9 = sum(1 for m in results if m["achieves_99_9"])
months_at_99_95 = sum(1 for m in results if m["achieves_99_95"])
months_at_99_99 = sum(1 for m in results if m["achieves_99_99"])
return {
"monthly_data": results,
"months_analyzed": months,
"99_9_compliance_rate": f"{months_at_99_9}/{months}",
"99_95_compliance_rate": f"{months_at_99_95}/{months}",
"99_99_compliance_rate": f"{months_at_99_99}/{months}",
"recommended_sla": determine_recommended_sla(results)
}
def determine_recommended_sla(monthly_results):
"""
Recommend an SLA target with reasonable confidence.
Apply a buffer — only commit to what you've consistently exceeded.
"""
min_availability = min(m["availability_pct"] for m in monthly_results)
# Commit to a target you'd have achieved in your worst month
# with some buffer
if min_availability >= 99.95:
return {"target": "99.9%", "note": "Achievable based on historical data"}
elif min_availability >= 99.92:
return {"target": "99.9%", "note": "Achievable but monitor closely"}
elif min_availability >= 99.55:
return {"target": "99.5%", "note": "Safe commitment"}
else:
return {"target": "99.0%", "note": "Work on reliability before committing higher"}
Choosing the Right Target
Use this framework when deciding what SLA to offer:
## SLA Target Selection Framework
1. Measure your actual availability for the past 12 months
→ If any month was below 99.9%, don't commit to 99.9%
2. Identify your worst-case scenario
→ How long does your longest incident typically last?
→ How long does a database failover take in practice?
→ How long does a deployment take end-to-end?
3. Apply a safety buffer
→ Commit to a target 10-20% below your best consistent performance
→ "We achieved 99.95% for 11 of 12 months" → commit to 99.9%
4. Consider your remediation capability
→ 99.99% requires automated recovery — manual response is too slow
→ If your typical MTTR is 30 minutes, 99.99% is not achievable
5. Think about what's in your control
→ Third-party dependencies (Stripe, AWS, Twilio) can fail
→ Their downtime is often YOUR SLA breach
→ Build in exclusions for third-party failures OR raise your own reliability
Maintenance Windows and Uptime Calculations
Planned maintenance is often excluded from SLA calculations:
With Maintenance Window Exclusion:
- Monthly uptime budget: 43 minutes (for 99.9%)
- Schedule 4-hour maintenance window → not counted
- Remaining downtime budget: 43 minutes of UNPLANNED downtime
Without Maintenance Window Exclusion:
- Monthly uptime budget: 43 minutes for all downtime including maintenance
- No room for meaningful maintenance
- Requires zero-downtime deployment strategies for all changes
Most SaaS companies offer maintenance windows for lower SLA tiers and move to zero-downtime deployments for 99.99%+ commitments.
Conclusion
The difference between uptime percentages is primarily a difference in engineering investment and operational discipline — not just a marketing choice. A 99.9% SLA is achievable for most well-run web applications; 99.99% requires significant architectural work and continuous reliability investment. Before choosing a target, measure what you're actually delivering, understand what infrastructure supports which tier, and commit conservatively. AzMonitor's historical monitoring data gives you the actual uptime measurements needed to make this decision with data rather than optimism, and ongoing monitoring to verify you're meeting whatever commitment you make.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →