Incident Management

On-Call Burnout: Causes, Consequences, and How to Fix It

Understand why on-call burnout happens, how to measure on-call load, and practical interventions that reduce engineer burnout without sacrificing reliability.

AzMonitor TeamApril 2, 20258 min read · 1,639 wordsUpdated January 20, 2026
on-call burnoutalert fatigueengineer wellnessincident management

On-call burnout is one of the most common causes of engineer attrition at companies with reliability responsibilities. Engineers who regularly lose sleep to alerts, who dread their rotation weeks, and who feel like they're firefighting instead of building — they leave. The organizational cost is severe: losing a senior engineer costs 12-18 months of their salary in recruitment and ramp-up time, and reliability suffers during the gap. Addressing burnout is both a human obligation and a business imperative.

What Causes On-Call Burnout

On-call burnout doesn't come from being on-call itself — many engineers find on-call rewarding when the experience is well-designed. Burnout comes from specific, fixable problems:

High alert volume — More than 5-10 alerts per on-call shift creates exhaustion. Engineers who get woken up multiple times per night for weeks stop being able to make good decisions.

Low-signal alerts — Alerts that are noisy, flaky, or irrelevant destroy trust in the monitoring system. When engineers stop trusting that alerts represent real problems, they start ignoring or silencing them.

Unclear runbooks — Being woken at 3am for an alert with no documented resolution path is stressful and ineffective. Engineers who have to improvise every incident burn out faster.

No backup — On-call rotation with too few people means engineers cover too many weeks per year. A rotation with fewer than 5-6 people typically leads to excessive coverage frequency.

Lack of follow-through — Engineers who file post-incident action items that never get prioritized stop filing them. Nothing is more demoralizing than repeatedly fixing the same problem because the underlying cause was never addressed.

Business hours problems bleeding into on-call — When known technical debt, flaky tests, or deployment issues create on-call pages, it signals that the organization doesn't value engineers' time.

Measuring On-Call Load

You can't improve what you don't measure. Track these metrics per on-call rotation:

# on_call_metrics.py
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Optional

@dataclass
class AlertEvent:
    alert_id: str
    triggered_at: datetime
    resolved_at: Optional[datetime]
    was_actionable: bool  # Did it require actual work?
    hour_of_day: int
    day_of_week: int
    engineer: str
    rotation_week: str

class OnCallMetricsCalculator:
    
    def calculate_rotation_metrics(self, alerts: List[AlertEvent], rotation_week: str):
        """Calculate metrics for a single on-call rotation."""
        rotation_alerts = [a for a in alerts if a.rotation_week == rotation_week]
        
        total_alerts = len(rotation_alerts)
        sleep_disrupting_alerts = [
            a for a in rotation_alerts
            if a.hour_of_day < 7 or a.hour_of_day >= 22  # Before 7am or after 10pm
        ]
        actionable_alerts = [a for a in rotation_alerts if a.was_actionable]
        noise_alerts = [a for a in rotation_alerts if not a.was_actionable]
        
        # Time to resolve
        resolved = [a for a in rotation_alerts if a.resolved_at]
        if resolved:
            resolution_times = [
                (a.resolved_at - a.triggered_at).total_seconds() / 60
                for a in resolved
            ]
            avg_resolution_minutes = sum(resolution_times) / len(resolution_times)
        else:
            avg_resolution_minutes = None
        
        return {
            "rotation_week": rotation_week,
            "total_alerts": total_alerts,
            "alerts_per_day": round(total_alerts / 7, 1),
            "sleep_disrupting_alerts": len(sleep_disrupting_alerts),
            "actionable_alerts": len(actionable_alerts),
            "noise_alerts": len(noise_alerts),
            "noise_ratio": round(len(noise_alerts) / total_alerts, 2) if total_alerts > 0 else 0,
            "avg_resolution_minutes": round(avg_resolution_minutes, 1) if avg_resolution_minutes else None,
            "burnout_risk": self._assess_burnout_risk(total_alerts, sleep_disrupting_alerts)
        }
    
    def _assess_burnout_risk(self, total_alerts, sleep_disrupting_alerts):
        """Simple burnout risk assessment."""
        if len(sleep_disrupting_alerts) > 10 or total_alerts > 70:
            return "high"
        elif len(sleep_disrupting_alerts) > 5 or total_alerts > 35:
            return "medium"
        return "low"
    
    def calculate_engineer_annual_load(self, alerts, engineer, year):
        """Calculate annual on-call burden for an engineer."""
        engineer_alerts = [
            a for a in alerts
            if a.engineer == engineer
            and a.triggered_at.year == year
        ]
        
        weeks_on_call = len(set(a.rotation_week for a in engineer_alerts))
        sleep_disruptions = sum(
            1 for a in engineer_alerts
            if a.hour_of_day < 7 or a.hour_of_day >= 22
        )
        
        return {
            "engineer": engineer,
            "year": year,
            "weeks_on_call": weeks_on_call,
            "total_alerts": len(engineer_alerts),
            "sleep_disrupting_alerts": sleep_disruptions,
            "avg_alerts_per_rotation": round(len(engineer_alerts) / weeks_on_call, 1) if weeks_on_call > 0 else 0,
            "recommendation": "reduce_rotation_frequency" if weeks_on_call > 10 else "acceptable"
        }

Healthy On-Call Metrics

Use these benchmarks from industry research:

| Metric | Healthy Range | Concerning | Critical | |---|---|---|---| | Alerts per on-call shift | < 5 | 5-15 | > 15 | | Sleep-disrupting pages per week | 0-2 | 3-5 | > 5 | | Alert noise ratio | < 10% | 10-30% | > 30% | | Time to resolve (median) | < 30 min | 30-120 min | > 2 hours | | Rotation frequency | Every 6+ weeks | Every 4-6 weeks | Every 1-3 weeks | | Post-incident action items resolved | > 80% | 50-80% | < 50% |

Interventions That Actually Work

1. Alert Audit and Cleanup

The most impactful intervention is ruthlessly auditing alerts:

def run_alert_audit(alerts, days=30):
    """
    Identify alerts that should be eliminated or modified.
    """
    audit_results = {
        "never_actionable": [],
        "always_auto_resolves": [],
        "too_frequent": [],
        "wrong_time": []
    }
    
    # Group by alert rule
    by_rule = {}
    for alert in alerts:
        rule = alert.rule_name
        if rule not in by_rule:
            by_rule[rule] = []
        by_rule[rule].append(alert)
    
    for rule, rule_alerts in by_rule.items():
        actionable_count = sum(1 for a in rule_alerts if a.was_actionable)
        actionable_rate = actionable_count / len(rule_alerts)
        
        # Alerts that are almost never actionable
        if actionable_rate < 0.1 and len(rule_alerts) >= 5:
            audit_results["never_actionable"].append({
                "rule": rule,
                "fires": len(rule_alerts),
                "actionable_rate": f"{actionable_rate:.0%}",
                "recommendation": "Delete or disable"
            })
        
        # Alerts that always auto-resolve without intervention
        auto_resolved = [a for a in rule_alerts if a.auto_resolved]
        if len(auto_resolved) / len(rule_alerts) > 0.9:
            audit_results["always_auto_resolves"].append({
                "rule": rule,
                "recommendation": "Use longer evaluation window to filter transients"
            })
        
        # Alerts firing more than once per day
        daily_rate = len(rule_alerts) / days
        if daily_rate > 1:
            audit_results["too_frequent"].append({
                "rule": rule,
                "daily_rate": round(daily_rate, 1),
                "recommendation": "Raise threshold or add alert suppression"
            })
    
    return audit_results

2. Improve Runbook Coverage

Every alert that can wake someone up should have a runbook:

# Runbook Coverage Audit

## Check Coverage
- [ ] Every PagerDuty alert has a runbook_url field set
- [ ] Every runbook was updated in the last 6 months
- [ ] Every runbook includes: symptoms, immediate mitigation, root cause investigation
- [ ] Every runbook has been verified by an engineer who wasn't involved in writing it

## Runbook Quality Checklist
Good runbook: engineer can resolve within 15 minutes of reading it
Poor runbook: requires tribal knowledge not documented in the runbook

## Template
- **Alert name**: [exact alert name]
- **What broke**: [1-2 sentences explaining the issue]
- **Impact**: [who is affected and how severely]
- **Immediate mitigation** (do this first): [step-by-step]
- **Root cause investigation**: [commands/dashboards to check]
- **Escalation**: [when and who to escalate to]
- **Resolution verification**: [how to confirm the fix worked]
- **Long-term fix**: [links to action items for root cause resolution]

3. Establish Alert Severity SLAs

Not every alert needs to wake someone up immediately:

# Alert severity configuration
severity_policies:
  P1_critical:
    description: "User-facing production outage"
    response_time: "Immediate — wake on-call"
    paging: true
    notification_channels: [pagerduty, slack-incidents]
    
  P2_high:
    description: "Significant degradation, workaround exists"
    response_time: "Within 30 minutes during business hours"
    paging: false  # No sleep disruption for P2
    notification_channels: [slack-alerts]
    business_hours_only: true
    
  P3_medium:
    description: "Non-critical issue, monitoring for escalation"
    response_time: "Next business day"
    paging: false
    notification_channels: [slack-alerts]
    
  P4_low:
    description: "Information only, no action required immediately"
    response_time: "Weekly review"
    paging: false
    notification_channels: [email-digest]

4. Rotation Size and Frequency

Calculate minimum sustainable rotation size:

def calculate_rotation_size(
    acceptable_weeks_per_year=8,
    weeks_in_year=52
):
    """
    Calculate minimum team size for sustainable on-call.
    
    If 8 weeks/year is acceptable burden per engineer,
    you need at least 52/8 = 6.5 engineers — round up to 7.
    """
    min_engineers = weeks_in_year / acceptable_weeks_per_year
    
    return {
        "min_engineers": round(min_engineers),
        "weeks_per_engineer": round(weeks_in_year / round(min_engineers), 1),
        "recommendation": (
            f"With {round(min_engineers)} engineers, each covers "
            f"~{round(weeks_in_year / round(min_engineers), 1)} weeks/year"
        )
    }

# Example outputs:
# 8 weeks acceptable → need 7 engineers
# 6 weeks acceptable → need 9 engineers
# 4 weeks acceptable → need 13 engineers

5. On-Call Compensation and Recognition

Engineers who carry reliability responsibility should be compensated for it:

| Compensation Model | Description | Suitable For | |---|---|---| | Fixed on-call stipend | Monthly/weekly payment for being on rotation | Common in mid-size companies | | Per-page compensation | Payment per alert received outside hours | Incentivizes alert reduction | | Time off in lieu | Extra PTO after heavy on-call weeks | Startups with limited budget | | On-call bonus pool | Quarterly bonus tied to reliability metrics | Ties compensation to outcomes | | Reduced sprint commitment | Fewer story points during on-call weeks | Engineering-focused teams |

The Feedback Loop That Prevents Burnout

The most sustainable fix is establishing a feedback loop that continuously improves the on-call experience:

## Weekly On-Call Review Process

### After Every Rotation (30-minute review)
1. Review all alerts from the week
2. Mark each as: actionable / noise / needs-improvement
3. File issues for:
   - Noisy alerts to tune or remove
   - Missing runbooks
   - Recurring issues that need permanent fixes

### Monthly Metrics Review
Review team on-call metrics:
- Alert volume trend (should be decreasing)
- Sleep-disruption count trend (should be decreasing)
- Runbook coverage (should be increasing toward 100%)
- Action items closed vs opened (should be net positive)

### Quarterly Engineering Investment
Allocate engineering time specifically for:
- Reducing alert noise
- Writing and improving runbooks
- Fixing root causes of recurring incidents
- Improving test coverage to catch issues before production

This isn't optional toil — it's reliability investment.

Conclusion

On-call burnout is a systems problem, not an individual problem — it's caused by poor alert design, inadequate runbooks, rotation sizes that are too small, and feedback loops that don't prioritize fixing the root causes of recurring incidents. The solution is systematic: measure on-call load, audit alerts ruthlessly, establish clear severity policies, and create dedicated time to fix the underlying causes. AzMonitor's monitoring configuration — with proper check intervals, meaningful response assertions, and intelligent alert thresholds — forms part of the foundation that keeps on-call quiet. The best alert is the one that never fires because the problem was caught and fixed before it became an incident.

Tags:on-call burnoutalert fatigueengineer wellnessincident management
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →