Alert fatigue is monitoring's dirty secret. A team sets up alerts for everything, the alerts fire constantly, engineers start ignoring them, and the monitoring that was supposed to protect production becomes worse than useless. A 2016 Google SRE study found that some teams were receiving hundreds of pages per shift — far more than anyone could investigate meaningfully. The paradox: more monitoring configured incorrectly means worse reliability outcomes.

The Anatomy of Alert Fatigue

Alert fatigue develops through a predictable progression:

Over-alerting — Too many alerts configured with thresholds that are too sensitive
False positives — Alerts that fire but don't represent actual user impact
Normalization — Engineers start treating alerts as background noise
Ignoring — Real incidents go unnoticed because the alert channel is muted or skipped
Failure — A critical incident goes undetected because no one believed the alert was real

The team that ignores their pagerduty isn't lazy — they're rational. If 95% of alerts are noise, ignoring them makes sense statistically. The problem is systemic, not individual.

Measuring Alert Fatigue

Before fixing alert fatigue, measure it:

-- Alert noise analysis over the past 30 days
SELECT
    alert_name,
    COUNT(*) as total_fires,
    COUNT(CASE WHEN acknowledged_within_minutes <= 5 THEN 1 END) as quick_ack,
    COUNT(CASE WHEN linked_incident_id IS NOT NULL THEN 1 END) as led_to_incident,
    COUNT(CASE WHEN auto_resolved_without_action THEN 1 END) as auto_resolved,
    ROUND(
        COUNT(CASE WHEN linked_incident_id IS NOT NULL THEN 1 END) * 100.0 
        / COUNT(*), 2
    ) as actionability_pct
FROM alert_history
WHERE fired_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
ORDER BY total_fires DESC;

Key metrics to review:

| Metric | Healthy Target | Alert Fatigue Indicator | |---|---|---| | Alerts per on-call shift | < 5 actionable | > 20 total | | Alert actionability rate | > 50% | < 20% | | Mean time to acknowledge | < 5 min | > 30 min | | Auto-resolved without action | < 30% | > 60% | | On-call feedback score | > 7/10 | < 5/10 |

The Five Categories of Bad Alerts

1. Alerts That Self-Resolve

An alert that fires and clears itself within 10 minutes, without any engineer action, is a false alarm by definition. Track auto-resolution rates:

def find_auto_resolving_alerts(alert_history, threshold_minutes=10):
    """
    Find alerts that frequently auto-resolve without human intervention.
    These are likely flapping on transient issues.
    """
    auto_resolve_rates = {}
    
    for alert_name, firings in group_by_alert(alert_history):
        auto_resolved = sum(
            1 for f in firings 
            if f.resolved_at and 
               (f.resolved_at - f.fired_at).total_seconds() < threshold_minutes * 60 and
               f.action_taken is None
        )
        
        auto_resolve_rates[alert_name] = {
            "total_fires": len(firings),
            "auto_resolved": auto_resolved,
            "auto_resolve_rate": auto_resolved / len(firings) if firings else 0
        }
    
    # Return alerts with > 50% auto-resolution rate
    return {
        name: data for name, data in auto_resolve_rates.items()
        if data["auto_resolve_rate"] > 0.5
    }

Fix: Add minimum duration requirements. An alert should only fire if the condition persists for at least 2-3 minutes.

2. Alerts Without Actionable Steps

If an engineer can't do anything when they receive an alert, it shouldn't page them. Common examples:

"Third party API is slow" (nothing you can do)
"Background job queue depth high" (no action defined)
"Disk 60% full" (not urgent, no immediate action required)

Create a test for each alert: "If I receive this at 3 AM on a Saturday, what exact steps would I take?" If the answer is "check and go back to sleep," it shouldn't page. If there's no answer, remove or silence the alert.

3. Flapping Alerts

Flapping alerts fire and clear repeatedly in a short period, creating alert storms:

# Bad: Alert that flaps
alert:
  name: "CPU Usage High"
  condition: "cpu_usage > 80%"
  # No hysteresis - fires at 81%, clears at 79%, fires at 81%...

# Good: Alert with hysteresis and duration
alert:
  name: "CPU Usage High"
  condition: "cpu_usage > 85% for 5 consecutive minutes"
  recovery: "cpu_usage < 70% for 3 consecutive minutes"
  # Won't flip-flop around the threshold

4. Alerts That Don't Reflect User Impact

Infrastructure metrics (CPU, memory, disk) can look alarming while users experience zero impact. And conversely, users can be having a terrible experience while infrastructure looks fine.

Prefer symptom-based alerts over cause-based alerts:

| Cause-based (worse) | Symptom-based (better) | |---|---| | CPU > 90% | Error rate > 2% | | Memory > 85% | Request success rate < 99% | | Disk > 80% | Checkout completion rate dropping | | Queue depth > 1000 | Order processing delay > 5 min |

5. Duplicate Alerts

Multiple alerts for the same underlying issue create an alert storm. If the database is down, you might get:

"Payment service health check failing"
"User service health check failing"
"Database connectivity alert"
"Background worker failures"

These are all symptoms of one root cause. Use alert correlation and deduplication:

# PagerDuty - deduplicate related alerts
alert_grouping:
  type: time
  timeout: 600  # Group alerts within 10 minutes
  
# Or use intelligent grouping
alert_grouping:
  type: intelligent
  time_window: 600

The Alert Audit Process

Run a quarterly alert audit:

# Alert Audit Checklist

For each alert in your system, answer:

1. **Has it fired in the past 90 days?**
   - No → Consider removing or disabling
   
2. **What percentage of fires led to actual incidents?**
   - < 20% → Too noisy, tune threshold or add duration requirement
   
3. **What action does an engineer take when this fires at 3 AM?**
   - "None" or "check and ignore" → Downgrade to Slack-only or remove
   
4. **Is there a more direct way to detect this user impact?**
   - If yes → Replace with symptom-based alert
   
5. **Does this alert fire during expected patterns (deployments, scheduled jobs)?**
   - Yes without suppression → Add maintenance window suppression

Alert Threshold Tuning

Set thresholds based on data, not intuition:

def suggest_alert_threshold(metric_history, target_false_positive_rate=0.05):
    """
    Suggest alert threshold that results in target false positive rate.
    Assumes historical data represents normal operation.
    """
    import numpy as np
    
    values = [m['value'] for m in metric_history]
    
    # Find percentile that represents the target
    # If we want < 5% false positives, threshold should be at 95th percentile
    suggested_threshold = np.percentile(values, (1 - target_false_positive_rate) * 100)
    
    return {
        "suggested_threshold": suggested_threshold,
        "p95": np.percentile(values, 95),
        "p99": np.percentile(values, 99),
        "max": max(values),
        "estimated_false_positive_rate": target_false_positive_rate,
        "note": "Review against historical incidents to validate"
    }

Multi-Window Alert Logic

Use multiple evaluation windows to catch both gradual and sudden problems:

# Alert that catches both fast and slow failure modes
alerts:
  # Sudden spike
  - name: "Error Rate Sudden Spike"
    condition: "error_rate > 10% for 1 minute"
    severity: P1
    
  # Elevated for sustained period
  - name: "Error Rate Sustained Elevation"
    condition: "error_rate > 2% for 15 minutes"
    severity: P2
    
  # Gradual trend
  - name: "Error Rate Trending Up"
    condition: "error_rate_30min_avg > 1% AND error_rate_trend > 0.1%/hour"
    severity: P3

Alert Routing Improvements

Not all alerts should go to the same place:

routing:
  P1:
    channels:
      - pagerduty    # Phone call + SMS
      - slack_incidents  # #incidents channel
    escalation: 5_minutes
    
  P2:
    channels:
      - pagerduty_low_urgency  # Push notification only
      - slack_engineering     # #engineering-alerts channel
    escalation: 30_minutes
    
  P3:
    channels:
      - slack_engineering  # Slack only, no page
    notification_hours: any  # 24/7 Slack is fine
    
  P4:
    channels:
      - slack_monitoring  # Low-priority channel
    notification_hours: business_hours_only

Continuous Improvement

Run a weekly alert review as part of your SRE practice:

# Weekly Alert Review Agenda (15 minutes)

1. How many alerts fired this week? (vs last week)
2. Which alerts fired most frequently? Are they actionable?
3. Were there any missed alerts (users reported issues we didn't catch)?
4. Any alert that woke someone up without a real issue?
5. Action items: tune, remove, or add alerts

Track these weekly metrics on a dashboard for trend visibility.

Conclusion

Alert fatigue is solvable, but it requires treating your monitoring configuration with the same care as your production code. Audit regularly, measure actionability rates, prefer symptom-based over cause-based alerts, and add duration requirements to prevent flapping. The goal isn't fewer alerts — it's alerts your team trusts enough to always investigate. AzMonitor's intelligent alerting includes options for consecutive failure requirements, multi-region confirmation, and flexible routing that help you build a monitoring system engineers respect rather than tune out.

Tags:alert fatiguemonitoringon-callalert tuning

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Alert Fatigue: How to Fix Your Noisy Monitoring and Restore Trust in Alerts