Error budgets are one of the most powerful ideas in reliability engineering, yet most teams either don't use them or implement them incorrectly. The core insight is elegant: instead of treating reliability as "always maximize it," treat unreliability as a finite resource that can be spent strategically. You have a budget of downtime each month. How you spend it determines the balance between stability and velocity.

The Problem Error Budgets Solve

Without error budgets, reliability discussions are usually unproductive:

Engineering: "We need to slow down deployments to improve reliability."
Product: "We can't slow down — we have features to ship."
Engineering: "But the service keeps breaking."
Product: "Then fix the reliability problems."

Both sides are right, and neither is wrong, which means these conversations go nowhere. Error budgets replace this debate with a data-driven policy: when the budget is full, deploy freely. When it's exhausted, reliability work takes priority.

This isn't engineering imposing restrictions on product — it's an agreed-upon policy derived from your SLOs, which represent what your customers actually need.

How Error Budgets Work

Starting with a 99.9% availability SLO over a 30-day window:

Monthly minutes: 30 × 24 × 60 = 43,200 minutes
Allowed downtime: 0.1% × 43,200 = 43.2 minutes
Error Budget: 43.2 minutes per month

This 43.2 minutes is your budget. Each minute of downtime consumes some of it:

Budget Remaining = Allowed Downtime - Actual Downtime

If you had 15 minutes of downtime:
Budget Remaining = 43.2 - 15 = 28.2 minutes (65% remaining)

If you had 50 minutes of downtime:
Budget Remaining = 43.2 - 50 = -6.8 minutes (EXHAUSTED — SLO violated)

Error Budget Burn Rate

The most useful operational concept isn't budget remaining — it's burn rate: how fast you're consuming the budget relative to the rate at which it accumulates.

A burn rate of 1.0 means you're consuming budget at exactly the rate it replenishes. Burn rate of 2.0 means you'll exhaust the budget in half the period. Burn rate of 0.5 means you're only using half your budget.

def calculate_burn_rate(
    slo_target,          # e.g., 0.999 for 99.9%
    actual_availability,  # measured over some window
    window_hours         # measurement window
):
    """
    Calculate error budget burn rate.
    
    burn_rate = 1.0: consuming budget at normal rate
    burn_rate > 1.0: will exhaust budget before end of period
    burn_rate < 1.0: under-consuming budget (could take more risks)
    """
    # How much budget do we get per hour?
    budget_rate_per_hour = (1 - slo_target)
    
    # How much are we actually consuming per hour?
    actual_error_rate = 1 - actual_availability
    actual_consumption_rate = actual_error_rate
    
    burn_rate = actual_consumption_rate / budget_rate_per_hour
    
    return {
        "burn_rate": round(burn_rate, 2),
        "interpretation": (
            "consuming budget too fast" if burn_rate > 1 else
            "within budget" if burn_rate > 0.5 else
            "well within budget"
        ),
        "hours_until_budget_exhausted": (
            (budget_rate_per_hour / actual_consumption_rate * 30 * 24)
            if actual_consumption_rate > 0 else float('inf')
        )
    }

Multi-Window Burn Rate Alerts

The Google SRE book recommends monitoring burn rate at multiple time windows to catch both slow leaks and fast burns:

# Multi-window burn rate alerting
alerts:
  # Fast burn — critical, page immediately
  - name: "Error Budget Fast Burn"
    conditions:
      # Burning through 2% of monthly budget in 1 hour
      - window: 1h
        burn_rate: 14.4  # 14.4x = exhausts budget in 2 days at this rate
      # AND confirmed over 5 minutes (prevent false positives)
      - window: 5m
        burn_rate: 14.4
    severity: P1
    message: "Error budget burning fast — investigate immediately"
    
  # Slow burn — warning
  - name: "Error Budget Slow Burn"
    conditions:
      # 5% of monthly budget consumed in 6 hours
      - window: 6h
        burn_rate: 6
      # AND confirmed over 30 minutes
      - window: 30m
        burn_rate: 6
    severity: P2
    message: "Error budget consumption elevated — monitor closely"
    
  # Very slow burn — informational
  - name: "Error Budget Trending"
    conditions:
      - window: 3d
        burn_rate: 1.5  # 50% over budget pace
    severity: P3
    message: "Error budget consuming faster than average — review deployment plans"

Prometheus Queries for Error Budget Tracking

# Current error budget consumption rate (30-day window)
1 - (
  sum(rate(http_requests_total{job="api", status=~"2.."}[30d]))
  / sum(rate(http_requests_total{job="api"}[30d]))
)

# Error budget remaining (as fraction)
(
  (1 - 0.999) -  # Budget (1 - SLO target)
  (1 - sum(rate(http_requests_total{status=~"2.."}[30d])) / sum(rate(http_requests_total[30d])))
) / (1 - 0.999)

# 1-hour burn rate
(1 - sum(rate(http_requests_total{status=~"2.."}[1h])) / sum(rate(http_requests_total[1h])))
/
(1 - 0.999)

# 6-hour burn rate  
(1 - sum(rate(http_requests_total{status=~"2.."}[6h])) / sum(rate(http_requests_total[6h])))
/
(1 - 0.999)

The Error Budget Policy

The error budget is only useful if it drives decisions. Write down your error budget policy and get product and engineering to sign off:

# Error Budget Policy — [Team Name]

## SLO
Availability: 99.9% (43 minutes allowed downtime per 30 days)

## Budget Status Actions

### Budget Full (> 100% remaining)
Engineering may:
- Deploy multiple times per day
- Run experiments and A/B tests
- Deploy risky infrastructure changes
No restrictions on deployment velocity.

### Budget Healthy (50-100% remaining)
Normal operations:
- Standard deployment process
- Changes should have rollback plans
- Risky changes need approval

### Budget Low (10-50% remaining)
Elevated caution:
- Freeze non-critical deployments
- Only deploy tested, well-reviewed changes
- Reliability work gets priority in sprint planning

### Budget Exhausted (< 10% remaining)
Emergency mode:
- Freeze all feature deployments
- Engineering focuses on reliability improvements
- No new features shipped until budget recovers
- Postmortem for how budget was exhausted

## Measuring Budget
- Measurement: External synthetic monitoring from AzMonitor (5-minute checks)
- Window: Rolling 30 days
- Source of truth: Monitoring dashboard [link]

Budget Recovery

When budget is exhausted, you need both a short-term and long-term plan:

def project_budget_recovery(
    current_deficit_minutes,
    monthly_budget_minutes,
    current_burn_rate
):
    """
    Calculate when budget will recover if burn rate stabilizes.
    Budget recovers at 1x rate when burn_rate = 0.
    """
    # Budget accumulates at this rate (minutes per day)
    budget_accumulation_per_day = monthly_budget_minutes / 30
    
    # Current deficit consumption per day (if burn rate continues)
    current_burn_per_day = budget_accumulation_per_day * current_burn_rate
    
    if current_burn_rate < 1:
        # Budget is recovering
        net_recovery_per_day = budget_accumulation_per_day - current_burn_per_day
        days_to_recovery = current_deficit_minutes / net_recovery_per_day * 60 / 24
        return {
            "recovering": True,
            "days_to_full_recovery": round(days_to_recovery, 1),
            "recommendation": "Budget recovering — monitor closely"
        }
    else:
        # Budget is still being consumed
        return {
            "recovering": False,
            "days_to_recovery": None,
            "recommendation": "Freeze deployments immediately — budget still depleting"
        }

Error Budgets for Different Services

Not all services need the same error budget approach:

| Service Type | Typical SLO | Budget Approach | |---|---|---| | External payment API | 99.99% | Very tight — track hourly | | Customer-facing UI | 99.9% | Standard — track daily | | Internal admin tools | 99.5% | Relaxed — track weekly | | Batch jobs | 99.0% | Very relaxed — track monthly | | Development environments | 95.0% | Minimal — track only for cost |

Integrating Budget Tracking with Deployments

Add budget status checks to your CI/CD pipeline:

#!/bin/bash
# deployment-gate.sh — Check error budget before deploying

BUDGET_REMAINING=$(curl -s "$MONITORING_API/error-budget/remaining")
BUDGET_PCT=$(echo $BUDGET_REMAINING | jq -r '.budget_remaining_pct')

echo "Error budget remaining: ${BUDGET_PCT}%"

if (( $(echo "$BUDGET_PCT < 10" | bc -l) )); then
    echo "ERROR: Error budget below 10% — deployment frozen"
    echo "Budget remaining: ${BUDGET_PCT}%"
    echo "To override: get approval from SRE team and run with FORCE_DEPLOY=true"
    
    if [ "$FORCE_DEPLOY" != "true" ]; then
        exit 1
    else
        echo "WARN: Force deploy enabled — proceeding with approval"
    fi
fi

echo "Budget check passed — proceeding with deployment"

Conclusion

Error budgets transform reliability from an engineering constraint into a shared organizational resource. When product understands that risky deployments consume the budget that could be spent on new features later, reliability becomes everyone's concern — not just SRE's. Track burn rates at multiple windows, enforce your error budget policy consistently, and celebrate when the budget is full (it means your team can move fast). AzMonitor's continuous uptime and latency monitoring provides the raw data needed to calculate error budgets accurately, giving your team the numbers they need to make good decisions about deployment velocity versus reliability investment.

Tags:error budgetsSREreliabilitySLO

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Error Budgets: How to Use Unreliability as a Strategic Resource

The Problem Error Budgets Solve

How Error Budgets Work

Error Budget Burn Rate

Multi-Window Burn Rate Alerts

Prometheus Queries for Error Budget Tracking

The Error Budget Policy

Budget Recovery

Error Budgets for Different Services

Integrating Budget Tracking with Deployments

Conclusion

Related articles

Monitoring for SaaS: Building Reliability Into Your Subscription Business

DORA Metrics: Measuring Software Delivery and Operational Performance

Chaos Engineering: Testing System Reliability by Breaking Things on Purpose