SLA Management

Calculating SLA: The Math Behind Uptime Percentages and Downtime Budgets

Learn how to calculate SLA availability, compound SLAs for multiple services, measure error budgets, and verify SLA compliance using monitoring data.

AzMonitor TeamJune 4, 20257 min read · 1,307 wordsUpdated January 20, 2026
SLA calculationuptime calculationavailability matherror budget

SLA math is straightforward but easy to get wrong when you're dealing with multiple services, partial outages, or compound systems. This guide covers the core calculations needed to correctly determine availability, understand downtime budgets, and detect SLA breaches before your customers do.

Basic Availability Calculation

The fundamental formula:

Availability (%) = (Total Time - Downtime) / Total Time × 100

Or equivalently:

Availability (%) = Uptime / Total Time × 100

Converting Time to Downtime

For a 99.9% SLA in a 30-day month:

Total minutes in 30-day month = 30 × 24 × 60 = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43,200 × 0.001 = 43.2 minutes
def calculate_allowed_downtime(sla_target_pct, period_days):
    """
    Calculate maximum allowed downtime for an SLA target.
    
    Args:
        sla_target_pct: e.g., 99.9 for 99.9% SLA
        period_days: number of days in the period (30, 31, 365, etc.)
    
    Returns:
        dict with downtime in various units
    """
    total_minutes = period_days * 24 * 60
    allowed_downtime_minutes = total_minutes * (1 - sla_target_pct / 100)
    
    return {
        "sla_target_pct": sla_target_pct,
        "period_days": period_days,
        "total_minutes": total_minutes,
        "allowed_downtime_minutes": round(allowed_downtime_minutes, 1),
        "allowed_downtime_hours": round(allowed_downtime_minutes / 60, 2),
        "allowed_downtime_seconds": round(allowed_downtime_minutes * 60, 0)
    }

# Quick reference table
for pct in [99.0, 99.5, 99.9, 99.95, 99.99, 99.999]:
    result = calculate_allowed_downtime(pct, 30)
    print(f"{pct}%: {result['allowed_downtime_minutes']:.1f} minutes/month")

# Output:
# 99.0%:   432.0 minutes/month  (7h 12m)
# 99.5%:   216.0 minutes/month  (3h 36m)
# 99.9%:    43.2 minutes/month
# 99.95%:   21.6 minutes/month
# 99.99%:    4.3 minutes/month
# 99.999%:   0.4 minutes/month  (26 seconds)

Measuring Actual Availability from Monitoring Data

from datetime import datetime
from typing import List, Tuple

def calculate_availability_from_checks(
    checks: List[dict],
    period_start: datetime,
    period_end: datetime
) -> dict:
    """
    Calculate actual availability from monitoring check history.
    
    Each check has: checked_at (datetime), status ('up' or 'down')
    """
    total_seconds = (period_end - period_start).total_seconds()
    
    # Sort checks chronologically
    sorted_checks = sorted(checks, key=lambda c: c["checked_at"])
    
    downtime_seconds = 0.0
    outage_start = None
    
    for check in sorted_checks:
        # Only process checks within our period
        if check["checked_at"] < period_start or check["checked_at"] > period_end:
            continue
        
        if check["status"] == "down":
            if outage_start is None:
                outage_start = check["checked_at"]
        else:  # status == "up"
            if outage_start is not None:
                outage_seconds = (check["checked_at"] - outage_start).total_seconds()
                downtime_seconds += outage_seconds
                outage_start = None
    
    # Handle outage that extends to end of period
    if outage_start is not None:
        outage_seconds = (period_end - outage_start).total_seconds()
        downtime_seconds += outage_seconds
    
    uptime_seconds = total_seconds - downtime_seconds
    availability_pct = (uptime_seconds / total_seconds) * 100
    
    return {
        "period_start": period_start.isoformat(),
        "period_end": period_end.isoformat(),
        "total_minutes": round(total_seconds / 60, 1),
        "uptime_minutes": round(uptime_seconds / 60, 1),
        "downtime_minutes": round(downtime_seconds / 60, 2),
        "availability_pct": round(availability_pct, 4),
        "availability_display": f"{availability_pct:.3f}%"
    }

Compound Availability: Multiple Services

When your product depends on multiple services, the combined availability is lower than any individual service:

Combined availability = Service A × Service B × Service C ...

Example:
Service A: 99.9% available
Service B: 99.9% available (independent)
Combined: 0.999 × 0.999 = 0.998001 = 99.8% available

This is why complex microservice architectures often have lower end-to-end availability than each individual service:

def calculate_compound_availability(service_availabilities: List[float]) -> dict:
    """
    Calculate combined availability for a system where all services 
    must be available for the system to function.
    
    Args:
        service_availabilities: List of availability percentages (e.g., [99.9, 99.5])
    
    Returns:
        Combined availability
    """
    # Convert percentages to decimals for multiplication
    combined = 1.0
    for avail_pct in service_availabilities:
        combined *= (avail_pct / 100)
    
    combined_pct = combined * 100
    
    return {
        "services": service_availabilities,
        "service_count": len(service_availabilities),
        "combined_availability_pct": round(combined_pct, 4),
        "combined_downtime_per_month_minutes": round(
            (1 - combined) * 30 * 24 * 60, 1
        ),
        "note": "Each additional dependency reduces combined availability"
    }

# Examples
examples = [
    [99.9, 99.9],           # Two 99.9% services
    [99.9, 99.9, 99.9],     # Three 99.9% services
    [99.9, 99.5, 99.9],     # Mixed SLA services
    [99.99, 99.99, 99.99],  # Three 99.99% services
]

for services in examples:
    result = calculate_compound_availability(services)
    print(f"{services} → {result['combined_availability_pct']:.4f}%")

# Output:
# [99.9, 99.9] → 99.8001%
# [99.9, 99.9, 99.9] → 99.7003%
# [99.9, 99.5, 99.9] → 99.3010%
# [99.99, 99.99, 99.99] → 99.9700%

Parallel Services (Redundancy)

When services run in parallel and only all need to be down for failure:

def calculate_parallel_availability(service_availabilities: List[float]) -> float:
    """
    Calculate availability when services are in PARALLEL.
    System fails only if ALL services are down simultaneously.
    
    Combined availability = 1 - (1-A1) × (1-A2) × ...
    """
    combined_failure_rate = 1.0
    for avail_pct in service_availabilities:
        failure_rate = 1 - (avail_pct / 100)
        combined_failure_rate *= failure_rate
    
    return (1 - combined_failure_rate) * 100

# Two redundant servers, each 99% available:
# Combined = 1 - (0.01 × 0.01) = 1 - 0.0001 = 99.99%
print(f"Parallel: {calculate_parallel_availability([99.0, 99.0]):.4f}%")
# Output: 99.9900%

Error Budget Math

The error budget is the inverse of your SLA target:

def calculate_error_budget(sla_target_pct: float, period_days: int = 30):
    """
    Calculate the error budget for an SLA target.
    Error budget = (1 - SLA target) × period
    """
    error_budget_pct = 100 - sla_target_pct
    total_minutes = period_days * 24 * 60
    
    budget_minutes = total_minutes * (error_budget_pct / 100)
    
    return {
        "sla_target_pct": sla_target_pct,
        "error_budget_pct": error_budget_pct,
        "budget_minutes": round(budget_minutes, 1),
        "budget_seconds": round(budget_minutes * 60),
    }

def calculate_budget_burn_rate(
    error_budget_minutes: float,
    actual_downtime_minutes: float,
    elapsed_period_fraction: float
) -> dict:
    """
    Calculate how fast the error budget is being consumed.
    
    burn_rate > 1 means you're consuming budget faster than sustainable.
    """
    budget_consumed_pct = actual_downtime_minutes / error_budget_minutes
    expected_consumed_pct = elapsed_period_fraction
    
    burn_rate = budget_consumed_pct / elapsed_period_fraction if elapsed_period_fraction > 0 else 0
    
    remaining_budget = error_budget_minutes - actual_downtime_minutes
    
    return {
        "burn_rate": round(burn_rate, 2),
        "budget_consumed_pct": round(budget_consumed_pct * 100, 1),
        "expected_consumed_pct": round(expected_consumed_pct * 100, 1),
        "remaining_budget_minutes": round(remaining_budget, 1),
        "on_track": burn_rate <= 1.0,
        "at_risk": 1.0 < burn_rate <= 2.0,
        "critical": burn_rate > 2.0
    }

Measuring Partial Outages

Not all incidents affect 100% of users. Some SLAs account for partial impact:

def calculate_weighted_availability(
    outage_periods: List[dict],
    total_period_minutes: float
) -> dict:
    """
    Calculate availability weighted by percentage of users affected.
    
    Some SLA definitions use:
    Effective downtime = actual_downtime × impact_percentage
    
    outage_periods: list of {
        duration_minutes: float,
        impact_pct: float  # 0-100, percentage of users affected
    }
    """
    weighted_downtime_minutes = sum(
        period["duration_minutes"] * (period["impact_pct"] / 100)
        for period in outage_periods
    )
    
    # Simple availability (any impact counts as downtime)
    total_downtime_minutes = sum(p["duration_minutes"] for p in outage_periods)
    simple_availability = (
        (total_period_minutes - total_downtime_minutes) / total_period_minutes * 100
    )
    
    # Weighted availability (impact percentage factors in)
    weighted_availability = (
        (total_period_minutes - weighted_downtime_minutes) / total_period_minutes * 100
    )
    
    return {
        "simple_availability_pct": round(simple_availability, 4),
        "weighted_availability_pct": round(weighted_availability, 4),
        "note": "Weighted availability is higher when only partial user impact"
    }

Real-Time SLA Compliance Check

def check_current_sla_compliance(
    monitoring_client,
    customer_id: str,
    sla_target_pct: float
) -> dict:
    """
    Check current SLA compliance status for a customer.
    Useful for real-time dashboards and automated breach detection.
    """
    now = datetime.utcnow()
    month_start = now.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
    
    # Calculate period stats
    total_minutes = (now - month_start).total_seconds() / 60
    
    # Get downtime from monitoring
    checks = monitoring_client.get_check_history(
        customer_id=customer_id,
        start=month_start,
        end=now
    )
    
    availability = calculate_availability_from_checks(checks, month_start, now)
    downtime_minutes = availability["downtime_minutes"]
    
    # Calculate budget
    full_month_minutes = 30 * 24 * 60  # Approximate
    full_budget_minutes = full_month_minutes * (1 - sla_target_pct / 100)
    
    remaining_budget = full_budget_minutes - downtime_minutes
    
    # Project remaining budget
    elapsed_fraction = total_minutes / full_month_minutes
    expected_consumption = full_budget_minutes * elapsed_fraction
    actual_consumption = downtime_minutes
    
    return {
        "sla_target": f"{sla_target_pct}%",
        "current_availability": f"{availability['availability_pct']:.4f}%",
        "downtime_so_far_minutes": round(downtime_minutes, 1),
        "budget_remaining_minutes": round(remaining_budget, 1),
        "in_breach": downtime_minutes > full_budget_minutes,
        "at_risk": remaining_budget < 15,  # Less than 15 minutes remaining
        "projected_month_end_availability": project_month_end(
            downtime_minutes, elapsed_fraction, full_month_minutes
        )
    }

Conclusion

SLA math seems simple but has important nuances — compound services multiply failure rates, partial outages may be weighted differently, and measurement methodology (what counts as "down") materially affects the numbers. The most important habit is measuring real availability continuously rather than calculating it only when disputes arise. AzMonitor's continuous monitoring collects the check-level data that makes accurate SLA calculations possible, tracks historical uptime for baseline and trend analysis, and provides the timestamps needed to calculate exact downtime durations rather than relying on memory or imprecise estimates.

Tags:SLA calculationuptime calculationavailability matherror budget
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →