SLA, SLO, and SLI — three terms that get used interchangeably by people who should know better, causing endless confusion in reliability discussions. Each has a specific meaning, and mixing them up leads to miscommunication between engineering teams, product teams, and customers. Getting them right transforms vague reliability aspirations into measurable, accountable commitments.
The Definitions
SLI (Service Level Indicator) — A metric that measures a specific aspect of service quality. Raw measurements. Numbers.
SLO (Service Level Objective) — A target value or range for an SLI. This is what you're aiming for internally.
SLA (Service Level Agreement) — A contractual commitment (often with financial consequences) made to customers, typically based on SLOs.
The relationship: SLIs measure performance → SLOs define internal targets → SLAs make external commitments.
SLIs: What You Measure
SLIs are the foundation of everything else. A good SLI:
- Is measurable continuously and automatically
- Directly reflects user experience
- Has a clear numerator and denominator (for ratios)
Common SLIs:
| Category | SLI Example | Formula | |---|---|---| | Availability | Request success rate | Successful requests / Total requests | | Latency | Response time | Time from request to response | | Error rate | Error percentage | Error responses / Total responses | | Throughput | Requests per second | Total requests / Time window | | Freshness | Data staleness | Age of most recent data update | | Correctness | Valid responses | Correct responses / Total responses |
Choosing the Right SLIs
Not every metric makes a good SLI. CPU utilization is a poor SLI — it doesn't directly measure user experience. A server at 95% CPU might still serve users fine; a server at 30% CPU might have a broken database connection causing all requests to fail.
Start with the question: "What does a user actually experience when using this service?" Then work backward to what you can measure:
User experience: "Checkout is too slow"
↓
Measurement: "95th percentile checkout completion time"
↓
SLI: "p95 latency for POST /api/checkout requests"
SLOs: Your Internal Targets
An SLO is a target range for an SLI. SLOs should be:
- Achievable — Based on current performance with some stretch
- Meaningful — Users actually care if you miss them
- Time-bounded — Measured over a rolling window (7 days, 30 days)
SLO Examples
slos:
- name: "Checkout API Availability"
sli: "successful_checkout_requests / total_checkout_requests"
target: 99.9%
window: 30d
- name: "Checkout Latency"
sli: "p95_checkout_latency_ms"
target: 500ms
window: 7d
- name: "Login Success Rate"
sli: "successful_logins / total_login_attempts"
target: 99.95%
window: 30d
- name: "Search Results Latency"
sli: "p50_search_latency_ms"
target: 200ms
window: 24h
Setting SLO Targets
Don't set SLOs at aspirational levels you can't currently achieve. The target should be slightly better than your current performance:
def suggest_slo_target(historical_performance, target_percentile=5):
"""
Suggest SLO target based on historical performance.
By default, suggests a target at the 5th worst percentile
(i.e., 95% of weeks you should meet or exceed this target).
"""
import numpy as np
# Sort performance data (higher is better for availability/success rate)
sorted_perf = sorted(historical_performance)
# Take the 5th percentile as the target (5% of time you might miss)
target_index = int(len(sorted_perf) * target_percentile / 100)
suggested_target = sorted_perf[target_index]
return {
"suggested_target": round(suggested_target, 4),
"current_average": np.mean(historical_performance),
"current_p5": sorted_perf[int(len(sorted_perf) * 0.05)],
"current_p95": sorted_perf[int(len(sorted_perf) * 0.95)],
"note": f"Target will be missed approximately {target_percentile}% of the time with current performance"
}
Error Budgets
The most powerful concept that flows from SLOs is the error budget. An error budget is the inverse of your SLO — it's the amount of unreliability you're allowed before breaking your commitment.
Error Budget = 1 - SLO Target
For a 99.9% availability SLO over 30 days:
Error Budget = 0.1% of 30 days
= 0.001 × 30 × 24 × 60
= 43.2 minutes of allowed downtime per month
Error budgets transform reliability from a constraint into a resource that teams can spend strategically:
| Error Budget Status | Interpretation | Team Action | |---|---|---| | Full (100%) | No incidents this period | Can take on risky changes | | 50% remaining | Normal operation | Deploy carefully | | 25% remaining | Elevated risk | Review change plans | | 10% remaining | Budget nearly exhausted | Freeze risky deploys | | 0% remaining | SLO violated | Incident review, no new features |
def calculate_error_budget(slo_target, window_days, actual_availability):
"""Calculate error budget status"""
total_minutes = window_days * 24 * 60
allowed_downtime = (1 - slo_target) * total_minutes
actual_downtime = (1 - actual_availability) * total_minutes
budget_remaining = allowed_downtime - actual_downtime
budget_remaining_pct = (budget_remaining / allowed_downtime) * 100
return {
"slo_target": slo_target,
"actual_availability": actual_availability,
"allowed_downtime_minutes": round(allowed_downtime, 1),
"actual_downtime_minutes": round(actual_downtime, 1),
"budget_remaining_minutes": round(budget_remaining, 1),
"budget_remaining_pct": round(budget_remaining_pct, 1),
"slo_met": actual_availability >= slo_target,
"burn_rate": actual_downtime / allowed_downtime if allowed_downtime > 0 else float('inf')
}
SLAs: External Commitments
An SLA is what you promise customers in a contract. It's usually:
- Less aggressive than your SLO (you need buffer)
- Tied to remedies (credits, refunds)
- Focused on a subset of your SLOs (the most customer-impactful ones)
The gap between SLO and SLA is intentional — it gives you buffer before you owe customers money:
Internal SLO: 99.95% availability
External SLA: 99.9% availability (the customer-facing commitment)
Buffer: 0.05% — if you're between 99.9% and 99.95%, you're violating
your internal SLO but not your external SLA.
This buffer lets you investigate and fix issues before they become contractual violations.
SLA Credit Structures
A typical SLA credit structure:
| Availability | Credit | |---|---| | 99.9% or above | No credit (SLA met) | | 99.0% to 99.9% | 10% of monthly fees | | 95.0% to 99.0% | 25% of monthly fees | | Below 95.0% | 50% of monthly fees |
When drafting SLAs, be specific about:
- Measurement methodology — Who measures? (preferably your own monitoring)
- Exclusions — Scheduled maintenance, force majeure, client-caused issues
- Claim process — How customers request credits (and with what evidence)
- Payment method — Service credits, not cash refunds, are standard
Multi-Tier SLOs
Different parts of your service can have different SLOs:
# Tiered SLO configuration
service_slos:
critical_paths:
- name: "Authentication"
availability_target: 99.99%
latency_p99_target_ms: 200
- name: "Payment Processing"
availability_target: 99.99%
latency_p99_target_ms: 3000
standard_paths:
- name: "Product Search"
availability_target: 99.9%
latency_p95_target_ms: 500
- name: "User Profile"
availability_target: 99.9%
latency_p95_target_ms: 300
best_effort:
- name: "Analytics Dashboard"
availability_target: 99.5%
latency_p95_target_ms: 2000
Monitoring SLOs in Practice
Configure monitoring to track SLO compliance in real time:
# Availability SLO check (Prometheus query)
# "What fraction of requests in the last 30 days were successful?"
sum(rate(http_requests_total{status=~"2.."}[30d]))
/ sum(rate(http_requests_total[30d]))
# Latency SLO check
# "What fraction of requests completed within our target latency?"
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/ sum(rate(http_request_duration_seconds_count[30d]))
# Error budget burn rate (fast burn = urgent alert)
# Alert if burning through 5% of monthly budget in 1 hour
1 - (
sum(rate(http_requests_total{status=~"2.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 5 # 5x the budget rate
Common Mistakes
Setting too many SLOs — More SLOs create more work and more noise. Focus on the 3-5 SLIs that actually reflect user experience.
SLO targets too high to achieve — "99.99% availability" sounds good but requires significant infrastructure investment. Start at achievable targets and tighten over time.
Treating SLOs as ceilings, not targets — SLOs aren't a license to be exactly 99.9% available. They're a floor. Aim higher, use the SLO as a trigger for action.
Ignoring customer-facing vs internal services — Internal services can have more relaxed SLOs. Customer-facing services need tighter targets.
Conclusion
SLI → SLO → SLA: measure performance, set targets, make commitments. Get this hierarchy right and you have the foundation of a disciplined reliability practice. Error budgets derived from SLOs give engineering teams a rational framework for deciding when to take risks and when to hunker down and fix things. AzMonitor provides the continuous measurement data that makes SLI tracking automatic and SLO reporting straightforward, so your team can focus on improving reliability rather than manually compiling numbers.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →