SLA Management

SLA vs SLO vs SLI: Understanding Service Level Terminology

Demystify SLA, SLO, and SLI with clear definitions, practical examples, and guidance on setting targets that drive reliability without burning out your team.

AzMonitor TeamMay 21, 20258 min read · 1,356 wordsUpdated January 20, 2026
SLASLOSLIreliability engineering

SLA, SLO, and SLI — three terms that get used interchangeably by people who should know better, causing endless confusion in reliability discussions. Each has a specific meaning, and mixing them up leads to miscommunication between engineering teams, product teams, and customers. Getting them right transforms vague reliability aspirations into measurable, accountable commitments.

The Definitions

SLI (Service Level Indicator) — A metric that measures a specific aspect of service quality. Raw measurements. Numbers.

SLO (Service Level Objective) — A target value or range for an SLI. This is what you're aiming for internally.

SLA (Service Level Agreement) — A contractual commitment (often with financial consequences) made to customers, typically based on SLOs.

The relationship: SLIs measure performance → SLOs define internal targets → SLAs make external commitments.

SLIs: What You Measure

SLIs are the foundation of everything else. A good SLI:

  • Is measurable continuously and automatically
  • Directly reflects user experience
  • Has a clear numerator and denominator (for ratios)

Common SLIs:

| Category | SLI Example | Formula | |---|---|---| | Availability | Request success rate | Successful requests / Total requests | | Latency | Response time | Time from request to response | | Error rate | Error percentage | Error responses / Total responses | | Throughput | Requests per second | Total requests / Time window | | Freshness | Data staleness | Age of most recent data update | | Correctness | Valid responses | Correct responses / Total responses |

Choosing the Right SLIs

Not every metric makes a good SLI. CPU utilization is a poor SLI — it doesn't directly measure user experience. A server at 95% CPU might still serve users fine; a server at 30% CPU might have a broken database connection causing all requests to fail.

Start with the question: "What does a user actually experience when using this service?" Then work backward to what you can measure:

User experience: "Checkout is too slow"
↓
Measurement: "95th percentile checkout completion time"
↓
SLI: "p95 latency for POST /api/checkout requests"

SLOs: Your Internal Targets

An SLO is a target range for an SLI. SLOs should be:

  • Achievable — Based on current performance with some stretch
  • Meaningful — Users actually care if you miss them
  • Time-bounded — Measured over a rolling window (7 days, 30 days)

SLO Examples

slos:
  - name: "Checkout API Availability"
    sli: "successful_checkout_requests / total_checkout_requests"
    target: 99.9%
    window: 30d
    
  - name: "Checkout Latency"
    sli: "p95_checkout_latency_ms"
    target: 500ms
    window: 7d
    
  - name: "Login Success Rate"
    sli: "successful_logins / total_login_attempts"
    target: 99.95%
    window: 30d
    
  - name: "Search Results Latency"
    sli: "p50_search_latency_ms"
    target: 200ms
    window: 24h

Setting SLO Targets

Don't set SLOs at aspirational levels you can't currently achieve. The target should be slightly better than your current performance:

def suggest_slo_target(historical_performance, target_percentile=5):
    """
    Suggest SLO target based on historical performance.
    By default, suggests a target at the 5th worst percentile
    (i.e., 95% of weeks you should meet or exceed this target).
    """
    import numpy as np
    
    # Sort performance data (higher is better for availability/success rate)
    sorted_perf = sorted(historical_performance)
    
    # Take the 5th percentile as the target (5% of time you might miss)
    target_index = int(len(sorted_perf) * target_percentile / 100)
    suggested_target = sorted_perf[target_index]
    
    return {
        "suggested_target": round(suggested_target, 4),
        "current_average": np.mean(historical_performance),
        "current_p5": sorted_perf[int(len(sorted_perf) * 0.05)],
        "current_p95": sorted_perf[int(len(sorted_perf) * 0.95)],
        "note": f"Target will be missed approximately {target_percentile}% of the time with current performance"
    }

Error Budgets

The most powerful concept that flows from SLOs is the error budget. An error budget is the inverse of your SLO — it's the amount of unreliability you're allowed before breaking your commitment.

Error Budget = 1 - SLO Target

For a 99.9% availability SLO over 30 days:
Error Budget = 0.1% of 30 days
             = 0.001 × 30 × 24 × 60
             = 43.2 minutes of allowed downtime per month

Error budgets transform reliability from a constraint into a resource that teams can spend strategically:

| Error Budget Status | Interpretation | Team Action | |---|---|---| | Full (100%) | No incidents this period | Can take on risky changes | | 50% remaining | Normal operation | Deploy carefully | | 25% remaining | Elevated risk | Review change plans | | 10% remaining | Budget nearly exhausted | Freeze risky deploys | | 0% remaining | SLO violated | Incident review, no new features |

def calculate_error_budget(slo_target, window_days, actual_availability):
    """Calculate error budget status"""
    
    total_minutes = window_days * 24 * 60
    allowed_downtime = (1 - slo_target) * total_minutes
    actual_downtime = (1 - actual_availability) * total_minutes
    
    budget_remaining = allowed_downtime - actual_downtime
    budget_remaining_pct = (budget_remaining / allowed_downtime) * 100
    
    return {
        "slo_target": slo_target,
        "actual_availability": actual_availability,
        "allowed_downtime_minutes": round(allowed_downtime, 1),
        "actual_downtime_minutes": round(actual_downtime, 1),
        "budget_remaining_minutes": round(budget_remaining, 1),
        "budget_remaining_pct": round(budget_remaining_pct, 1),
        "slo_met": actual_availability >= slo_target,
        "burn_rate": actual_downtime / allowed_downtime if allowed_downtime > 0 else float('inf')
    }

SLAs: External Commitments

An SLA is what you promise customers in a contract. It's usually:

  • Less aggressive than your SLO (you need buffer)
  • Tied to remedies (credits, refunds)
  • Focused on a subset of your SLOs (the most customer-impactful ones)

The gap between SLO and SLA is intentional — it gives you buffer before you owe customers money:

Internal SLO: 99.95% availability
External SLA: 99.9% availability (the customer-facing commitment)

Buffer: 0.05% — if you're between 99.9% and 99.95%, you're violating
        your internal SLO but not your external SLA.

This buffer lets you investigate and fix issues before they become contractual violations.

SLA Credit Structures

A typical SLA credit structure:

| Availability | Credit | |---|---| | 99.9% or above | No credit (SLA met) | | 99.0% to 99.9% | 10% of monthly fees | | 95.0% to 99.0% | 25% of monthly fees | | Below 95.0% | 50% of monthly fees |

When drafting SLAs, be specific about:

  • Measurement methodology — Who measures? (preferably your own monitoring)
  • Exclusions — Scheduled maintenance, force majeure, client-caused issues
  • Claim process — How customers request credits (and with what evidence)
  • Payment method — Service credits, not cash refunds, are standard

Multi-Tier SLOs

Different parts of your service can have different SLOs:

# Tiered SLO configuration
service_slos:
  critical_paths:
    - name: "Authentication"
      availability_target: 99.99%
      latency_p99_target_ms: 200
      
    - name: "Payment Processing"
      availability_target: 99.99%
      latency_p99_target_ms: 3000
      
  standard_paths:
    - name: "Product Search"
      availability_target: 99.9%
      latency_p95_target_ms: 500
      
    - name: "User Profile"
      availability_target: 99.9%
      latency_p95_target_ms: 300
      
  best_effort:
    - name: "Analytics Dashboard"
      availability_target: 99.5%
      latency_p95_target_ms: 2000

Monitoring SLOs in Practice

Configure monitoring to track SLO compliance in real time:

# Availability SLO check (Prometheus query)
# "What fraction of requests in the last 30 days were successful?"
sum(rate(http_requests_total{status=~"2.."}[30d]))
/ sum(rate(http_requests_total[30d]))

# Latency SLO check
# "What fraction of requests completed within our target latency?"
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/ sum(rate(http_request_duration_seconds_count[30d]))

# Error budget burn rate (fast burn = urgent alert)
# Alert if burning through 5% of monthly budget in 1 hour
1 - (
  sum(rate(http_requests_total{status=~"2.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 5  # 5x the budget rate

Common Mistakes

Setting too many SLOs — More SLOs create more work and more noise. Focus on the 3-5 SLIs that actually reflect user experience.

SLO targets too high to achieve — "99.99% availability" sounds good but requires significant infrastructure investment. Start at achievable targets and tighten over time.

Treating SLOs as ceilings, not targets — SLOs aren't a license to be exactly 99.9% available. They're a floor. Aim higher, use the SLO as a trigger for action.

Ignoring customer-facing vs internal services — Internal services can have more relaxed SLOs. Customer-facing services need tighter targets.

Conclusion

SLI → SLO → SLA: measure performance, set targets, make commitments. Get this hierarchy right and you have the foundation of a disciplined reliability practice. Error budgets derived from SLOs give engineering teams a rational framework for deciding when to take risks and when to hunker down and fix things. AzMonitor provides the continuous measurement data that makes SLI tracking automatic and SLO reporting straightforward, so your team can focus on improving reliability rather than manually compiling numbers.

Tags:SLASLOSLIreliability engineering
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →