Error budgets are one of the most powerful ideas in reliability engineering, yet most teams either don't use them or implement them incorrectly. The core insight is elegant: instead of treating reliability as "always maximize it," treat unreliability as a finite resource that can be spent strategically. You have a budget of downtime each month. How you spend it determines the balance between stability and velocity.
The Problem Error Budgets Solve
Without error budgets, reliability discussions are usually unproductive:
- Engineering: "We need to slow down deployments to improve reliability."
- Product: "We can't slow down — we have features to ship."
- Engineering: "But the service keeps breaking."
- Product: "Then fix the reliability problems."
Both sides are right, and neither is wrong, which means these conversations go nowhere. Error budgets replace this debate with a data-driven policy: when the budget is full, deploy freely. When it's exhausted, reliability work takes priority.
This isn't engineering imposing restrictions on product — it's an agreed-upon policy derived from your SLOs, which represent what your customers actually need.
How Error Budgets Work
Starting with a 99.9% availability SLO over a 30-day window:
Monthly minutes: 30 × 24 × 60 = 43,200 minutes
Allowed downtime: 0.1% × 43,200 = 43.2 minutes
Error Budget: 43.2 minutes per month
This 43.2 minutes is your budget. Each minute of downtime consumes some of it:
Budget Remaining = Allowed Downtime - Actual Downtime
If you had 15 minutes of downtime:
Budget Remaining = 43.2 - 15 = 28.2 minutes (65% remaining)
If you had 50 minutes of downtime:
Budget Remaining = 43.2 - 50 = -6.8 minutes (EXHAUSTED — SLO violated)
Error Budget Burn Rate
The most useful operational concept isn't budget remaining — it's burn rate: how fast you're consuming the budget relative to the rate at which it accumulates.
A burn rate of 1.0 means you're consuming budget at exactly the rate it replenishes. Burn rate of 2.0 means you'll exhaust the budget in half the period. Burn rate of 0.5 means you're only using half your budget.
def calculate_burn_rate(
slo_target, # e.g., 0.999 for 99.9%
actual_availability, # measured over some window
window_hours # measurement window
):
"""
Calculate error budget burn rate.
burn_rate = 1.0: consuming budget at normal rate
burn_rate > 1.0: will exhaust budget before end of period
burn_rate < 1.0: under-consuming budget (could take more risks)
"""
# How much budget do we get per hour?
budget_rate_per_hour = (1 - slo_target)
# How much are we actually consuming per hour?
actual_error_rate = 1 - actual_availability
actual_consumption_rate = actual_error_rate
burn_rate = actual_consumption_rate / budget_rate_per_hour
return {
"burn_rate": round(burn_rate, 2),
"interpretation": (
"consuming budget too fast" if burn_rate > 1 else
"within budget" if burn_rate > 0.5 else
"well within budget"
),
"hours_until_budget_exhausted": (
(budget_rate_per_hour / actual_consumption_rate * 30 * 24)
if actual_consumption_rate > 0 else float('inf')
)
}
Multi-Window Burn Rate Alerts
The Google SRE book recommends monitoring burn rate at multiple time windows to catch both slow leaks and fast burns:
# Multi-window burn rate alerting
alerts:
# Fast burn — critical, page immediately
- name: "Error Budget Fast Burn"
conditions:
# Burning through 2% of monthly budget in 1 hour
- window: 1h
burn_rate: 14.4 # 14.4x = exhausts budget in 2 days at this rate
# AND confirmed over 5 minutes (prevent false positives)
- window: 5m
burn_rate: 14.4
severity: P1
message: "Error budget burning fast — investigate immediately"
# Slow burn — warning
- name: "Error Budget Slow Burn"
conditions:
# 5% of monthly budget consumed in 6 hours
- window: 6h
burn_rate: 6
# AND confirmed over 30 minutes
- window: 30m
burn_rate: 6
severity: P2
message: "Error budget consumption elevated — monitor closely"
# Very slow burn — informational
- name: "Error Budget Trending"
conditions:
- window: 3d
burn_rate: 1.5 # 50% over budget pace
severity: P3
message: "Error budget consuming faster than average — review deployment plans"
Prometheus Queries for Error Budget Tracking
# Current error budget consumption rate (30-day window)
1 - (
sum(rate(http_requests_total{job="api", status=~"2.."}[30d]))
/ sum(rate(http_requests_total{job="api"}[30d]))
)
# Error budget remaining (as fraction)
(
(1 - 0.999) - # Budget (1 - SLO target)
(1 - sum(rate(http_requests_total{status=~"2.."}[30d])) / sum(rate(http_requests_total[30d])))
) / (1 - 0.999)
# 1-hour burn rate
(1 - sum(rate(http_requests_total{status=~"2.."}[1h])) / sum(rate(http_requests_total[1h])))
/
(1 - 0.999)
# 6-hour burn rate
(1 - sum(rate(http_requests_total{status=~"2.."}[6h])) / sum(rate(http_requests_total[6h])))
/
(1 - 0.999)
The Error Budget Policy
The error budget is only useful if it drives decisions. Write down your error budget policy and get product and engineering to sign off:
# Error Budget Policy — [Team Name]
## SLO
Availability: 99.9% (43 minutes allowed downtime per 30 days)
## Budget Status Actions
### Budget Full (> 100% remaining)
Engineering may:
- Deploy multiple times per day
- Run experiments and A/B tests
- Deploy risky infrastructure changes
No restrictions on deployment velocity.
### Budget Healthy (50-100% remaining)
Normal operations:
- Standard deployment process
- Changes should have rollback plans
- Risky changes need approval
### Budget Low (10-50% remaining)
Elevated caution:
- Freeze non-critical deployments
- Only deploy tested, well-reviewed changes
- Reliability work gets priority in sprint planning
### Budget Exhausted (< 10% remaining)
Emergency mode:
- Freeze all feature deployments
- Engineering focuses on reliability improvements
- No new features shipped until budget recovers
- Postmortem for how budget was exhausted
## Measuring Budget
- Measurement: External synthetic monitoring from AzMonitor (5-minute checks)
- Window: Rolling 30 days
- Source of truth: Monitoring dashboard [link]
Budget Recovery
When budget is exhausted, you need both a short-term and long-term plan:
def project_budget_recovery(
current_deficit_minutes,
monthly_budget_minutes,
current_burn_rate
):
"""
Calculate when budget will recover if burn rate stabilizes.
Budget recovers at 1x rate when burn_rate = 0.
"""
# Budget accumulates at this rate (minutes per day)
budget_accumulation_per_day = monthly_budget_minutes / 30
# Current deficit consumption per day (if burn rate continues)
current_burn_per_day = budget_accumulation_per_day * current_burn_rate
if current_burn_rate < 1:
# Budget is recovering
net_recovery_per_day = budget_accumulation_per_day - current_burn_per_day
days_to_recovery = current_deficit_minutes / net_recovery_per_day * 60 / 24
return {
"recovering": True,
"days_to_full_recovery": round(days_to_recovery, 1),
"recommendation": "Budget recovering — monitor closely"
}
else:
# Budget is still being consumed
return {
"recovering": False,
"days_to_recovery": None,
"recommendation": "Freeze deployments immediately — budget still depleting"
}
Error Budgets for Different Services
Not all services need the same error budget approach:
| Service Type | Typical SLO | Budget Approach | |---|---|---| | External payment API | 99.99% | Very tight — track hourly | | Customer-facing UI | 99.9% | Standard — track daily | | Internal admin tools | 99.5% | Relaxed — track weekly | | Batch jobs | 99.0% | Very relaxed — track monthly | | Development environments | 95.0% | Minimal — track only for cost |
Integrating Budget Tracking with Deployments
Add budget status checks to your CI/CD pipeline:
#!/bin/bash
# deployment-gate.sh — Check error budget before deploying
BUDGET_REMAINING=$(curl -s "$MONITORING_API/error-budget/remaining")
BUDGET_PCT=$(echo $BUDGET_REMAINING | jq -r '.budget_remaining_pct')
echo "Error budget remaining: ${BUDGET_PCT}%"
if (( $(echo "$BUDGET_PCT < 10" | bc -l) )); then
echo "ERROR: Error budget below 10% — deployment frozen"
echo "Budget remaining: ${BUDGET_PCT}%"
echo "To override: get approval from SRE team and run with FORCE_DEPLOY=true"
if [ "$FORCE_DEPLOY" != "true" ]; then
exit 1
else
echo "WARN: Force deploy enabled — proceeding with approval"
fi
fi
echo "Budget check passed — proceeding with deployment"
Conclusion
Error budgets transform reliability from an engineering constraint into a shared organizational resource. When product understands that risky deployments consume the budget that could be spent on new features later, reliability becomes everyone's concern — not just SRE's. Track burn rates at multiple windows, enforce your error budget policy consistently, and celebrate when the budget is full (it means your team can move fast). AzMonitor's continuous uptime and latency monitoring provides the raw data needed to calculate error budgets accurately, giving your team the numbers they need to make good decisions about deployment velocity versus reliability investment.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →