SLA math is straightforward but easy to get wrong when you're dealing with multiple services, partial outages, or compound systems. This guide covers the core calculations needed to correctly determine availability, understand downtime budgets, and detect SLA breaches before your customers do.
Basic Availability Calculation
The fundamental formula:
Availability (%) = (Total Time - Downtime) / Total Time × 100
Or equivalently:
Availability (%) = Uptime / Total Time × 100
Converting Time to Downtime
For a 99.9% SLA in a 30-day month:
Total minutes in 30-day month = 30 × 24 × 60 = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43,200 × 0.001 = 43.2 minutes
def calculate_allowed_downtime(sla_target_pct, period_days):
"""
Calculate maximum allowed downtime for an SLA target.
Args:
sla_target_pct: e.g., 99.9 for 99.9% SLA
period_days: number of days in the period (30, 31, 365, etc.)
Returns:
dict with downtime in various units
"""
total_minutes = period_days * 24 * 60
allowed_downtime_minutes = total_minutes * (1 - sla_target_pct / 100)
return {
"sla_target_pct": sla_target_pct,
"period_days": period_days,
"total_minutes": total_minutes,
"allowed_downtime_minutes": round(allowed_downtime_minutes, 1),
"allowed_downtime_hours": round(allowed_downtime_minutes / 60, 2),
"allowed_downtime_seconds": round(allowed_downtime_minutes * 60, 0)
}
# Quick reference table
for pct in [99.0, 99.5, 99.9, 99.95, 99.99, 99.999]:
result = calculate_allowed_downtime(pct, 30)
print(f"{pct}%: {result['allowed_downtime_minutes']:.1f} minutes/month")
# Output:
# 99.0%: 432.0 minutes/month (7h 12m)
# 99.5%: 216.0 minutes/month (3h 36m)
# 99.9%: 43.2 minutes/month
# 99.95%: 21.6 minutes/month
# 99.99%: 4.3 minutes/month
# 99.999%: 0.4 minutes/month (26 seconds)
Measuring Actual Availability from Monitoring Data
from datetime import datetime
from typing import List, Tuple
def calculate_availability_from_checks(
checks: List[dict],
period_start: datetime,
period_end: datetime
) -> dict:
"""
Calculate actual availability from monitoring check history.
Each check has: checked_at (datetime), status ('up' or 'down')
"""
total_seconds = (period_end - period_start).total_seconds()
# Sort checks chronologically
sorted_checks = sorted(checks, key=lambda c: c["checked_at"])
downtime_seconds = 0.0
outage_start = None
for check in sorted_checks:
# Only process checks within our period
if check["checked_at"] < period_start or check["checked_at"] > period_end:
continue
if check["status"] == "down":
if outage_start is None:
outage_start = check["checked_at"]
else: # status == "up"
if outage_start is not None:
outage_seconds = (check["checked_at"] - outage_start).total_seconds()
downtime_seconds += outage_seconds
outage_start = None
# Handle outage that extends to end of period
if outage_start is not None:
outage_seconds = (period_end - outage_start).total_seconds()
downtime_seconds += outage_seconds
uptime_seconds = total_seconds - downtime_seconds
availability_pct = (uptime_seconds / total_seconds) * 100
return {
"period_start": period_start.isoformat(),
"period_end": period_end.isoformat(),
"total_minutes": round(total_seconds / 60, 1),
"uptime_minutes": round(uptime_seconds / 60, 1),
"downtime_minutes": round(downtime_seconds / 60, 2),
"availability_pct": round(availability_pct, 4),
"availability_display": f"{availability_pct:.3f}%"
}
Compound Availability: Multiple Services
When your product depends on multiple services, the combined availability is lower than any individual service:
Combined availability = Service A × Service B × Service C ...
Example:
Service A: 99.9% available
Service B: 99.9% available (independent)
Combined: 0.999 × 0.999 = 0.998001 = 99.8% available
This is why complex microservice architectures often have lower end-to-end availability than each individual service:
def calculate_compound_availability(service_availabilities: List[float]) -> dict:
"""
Calculate combined availability for a system where all services
must be available for the system to function.
Args:
service_availabilities: List of availability percentages (e.g., [99.9, 99.5])
Returns:
Combined availability
"""
# Convert percentages to decimals for multiplication
combined = 1.0
for avail_pct in service_availabilities:
combined *= (avail_pct / 100)
combined_pct = combined * 100
return {
"services": service_availabilities,
"service_count": len(service_availabilities),
"combined_availability_pct": round(combined_pct, 4),
"combined_downtime_per_month_minutes": round(
(1 - combined) * 30 * 24 * 60, 1
),
"note": "Each additional dependency reduces combined availability"
}
# Examples
examples = [
[99.9, 99.9], # Two 99.9% services
[99.9, 99.9, 99.9], # Three 99.9% services
[99.9, 99.5, 99.9], # Mixed SLA services
[99.99, 99.99, 99.99], # Three 99.99% services
]
for services in examples:
result = calculate_compound_availability(services)
print(f"{services} → {result['combined_availability_pct']:.4f}%")
# Output:
# [99.9, 99.9] → 99.8001%
# [99.9, 99.9, 99.9] → 99.7003%
# [99.9, 99.5, 99.9] → 99.3010%
# [99.99, 99.99, 99.99] → 99.9700%
Parallel Services (Redundancy)
When services run in parallel and only all need to be down for failure:
def calculate_parallel_availability(service_availabilities: List[float]) -> float:
"""
Calculate availability when services are in PARALLEL.
System fails only if ALL services are down simultaneously.
Combined availability = 1 - (1-A1) × (1-A2) × ...
"""
combined_failure_rate = 1.0
for avail_pct in service_availabilities:
failure_rate = 1 - (avail_pct / 100)
combined_failure_rate *= failure_rate
return (1 - combined_failure_rate) * 100
# Two redundant servers, each 99% available:
# Combined = 1 - (0.01 × 0.01) = 1 - 0.0001 = 99.99%
print(f"Parallel: {calculate_parallel_availability([99.0, 99.0]):.4f}%")
# Output: 99.9900%
Error Budget Math
The error budget is the inverse of your SLA target:
def calculate_error_budget(sla_target_pct: float, period_days: int = 30):
"""
Calculate the error budget for an SLA target.
Error budget = (1 - SLA target) × period
"""
error_budget_pct = 100 - sla_target_pct
total_minutes = period_days * 24 * 60
budget_minutes = total_minutes * (error_budget_pct / 100)
return {
"sla_target_pct": sla_target_pct,
"error_budget_pct": error_budget_pct,
"budget_minutes": round(budget_minutes, 1),
"budget_seconds": round(budget_minutes * 60),
}
def calculate_budget_burn_rate(
error_budget_minutes: float,
actual_downtime_minutes: float,
elapsed_period_fraction: float
) -> dict:
"""
Calculate how fast the error budget is being consumed.
burn_rate > 1 means you're consuming budget faster than sustainable.
"""
budget_consumed_pct = actual_downtime_minutes / error_budget_minutes
expected_consumed_pct = elapsed_period_fraction
burn_rate = budget_consumed_pct / elapsed_period_fraction if elapsed_period_fraction > 0 else 0
remaining_budget = error_budget_minutes - actual_downtime_minutes
return {
"burn_rate": round(burn_rate, 2),
"budget_consumed_pct": round(budget_consumed_pct * 100, 1),
"expected_consumed_pct": round(expected_consumed_pct * 100, 1),
"remaining_budget_minutes": round(remaining_budget, 1),
"on_track": burn_rate <= 1.0,
"at_risk": 1.0 < burn_rate <= 2.0,
"critical": burn_rate > 2.0
}
Measuring Partial Outages
Not all incidents affect 100% of users. Some SLAs account for partial impact:
def calculate_weighted_availability(
outage_periods: List[dict],
total_period_minutes: float
) -> dict:
"""
Calculate availability weighted by percentage of users affected.
Some SLA definitions use:
Effective downtime = actual_downtime × impact_percentage
outage_periods: list of {
duration_minutes: float,
impact_pct: float # 0-100, percentage of users affected
}
"""
weighted_downtime_minutes = sum(
period["duration_minutes"] * (period["impact_pct"] / 100)
for period in outage_periods
)
# Simple availability (any impact counts as downtime)
total_downtime_minutes = sum(p["duration_minutes"] for p in outage_periods)
simple_availability = (
(total_period_minutes - total_downtime_minutes) / total_period_minutes * 100
)
# Weighted availability (impact percentage factors in)
weighted_availability = (
(total_period_minutes - weighted_downtime_minutes) / total_period_minutes * 100
)
return {
"simple_availability_pct": round(simple_availability, 4),
"weighted_availability_pct": round(weighted_availability, 4),
"note": "Weighted availability is higher when only partial user impact"
}
Real-Time SLA Compliance Check
def check_current_sla_compliance(
monitoring_client,
customer_id: str,
sla_target_pct: float
) -> dict:
"""
Check current SLA compliance status for a customer.
Useful for real-time dashboards and automated breach detection.
"""
now = datetime.utcnow()
month_start = now.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
# Calculate period stats
total_minutes = (now - month_start).total_seconds() / 60
# Get downtime from monitoring
checks = monitoring_client.get_check_history(
customer_id=customer_id,
start=month_start,
end=now
)
availability = calculate_availability_from_checks(checks, month_start, now)
downtime_minutes = availability["downtime_minutes"]
# Calculate budget
full_month_minutes = 30 * 24 * 60 # Approximate
full_budget_minutes = full_month_minutes * (1 - sla_target_pct / 100)
remaining_budget = full_budget_minutes - downtime_minutes
# Project remaining budget
elapsed_fraction = total_minutes / full_month_minutes
expected_consumption = full_budget_minutes * elapsed_fraction
actual_consumption = downtime_minutes
return {
"sla_target": f"{sla_target_pct}%",
"current_availability": f"{availability['availability_pct']:.4f}%",
"downtime_so_far_minutes": round(downtime_minutes, 1),
"budget_remaining_minutes": round(remaining_budget, 1),
"in_breach": downtime_minutes > full_budget_minutes,
"at_risk": remaining_budget < 15, # Less than 15 minutes remaining
"projected_month_end_availability": project_month_end(
downtime_minutes, elapsed_fraction, full_month_minutes
)
}
Conclusion
SLA math seems simple but has important nuances — compound services multiply failure rates, partial outages may be weighted differently, and measurement methodology (what counts as "down") materially affects the numbers. The most important habit is measuring real availability continuously rather than calculating it only when disputes arise. AzMonitor's continuous monitoring collects the check-level data that makes accurate SLA calculations possible, tracks historical uptime for baseline and trend analysis, and provides the timestamps needed to calculate exact downtime durations rather than relying on memory or imprecise estimates.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →