Alert fatigue is monitoring's dirty secret. A team sets up alerts for everything, the alerts fire constantly, engineers start ignoring them, and the monitoring that was supposed to protect production becomes worse than useless. A 2016 Google SRE study found that some teams were receiving hundreds of pages per shift — far more than anyone could investigate meaningfully. The paradox: more monitoring configured incorrectly means worse reliability outcomes.
The Anatomy of Alert Fatigue
Alert fatigue develops through a predictable progression:
- Over-alerting — Too many alerts configured with thresholds that are too sensitive
- False positives — Alerts that fire but don't represent actual user impact
- Normalization — Engineers start treating alerts as background noise
- Ignoring — Real incidents go unnoticed because the alert channel is muted or skipped
- Failure — A critical incident goes undetected because no one believed the alert was real
The team that ignores their pagerduty isn't lazy — they're rational. If 95% of alerts are noise, ignoring them makes sense statistically. The problem is systemic, not individual.
Measuring Alert Fatigue
Before fixing alert fatigue, measure it:
-- Alert noise analysis over the past 30 days
SELECT
alert_name,
COUNT(*) as total_fires,
COUNT(CASE WHEN acknowledged_within_minutes <= 5 THEN 1 END) as quick_ack,
COUNT(CASE WHEN linked_incident_id IS NOT NULL THEN 1 END) as led_to_incident,
COUNT(CASE WHEN auto_resolved_without_action THEN 1 END) as auto_resolved,
ROUND(
COUNT(CASE WHEN linked_incident_id IS NOT NULL THEN 1 END) * 100.0
/ COUNT(*), 2
) as actionability_pct
FROM alert_history
WHERE fired_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
ORDER BY total_fires DESC;
Key metrics to review:
| Metric | Healthy Target | Alert Fatigue Indicator | |---|---|---| | Alerts per on-call shift | < 5 actionable | > 20 total | | Alert actionability rate | > 50% | < 20% | | Mean time to acknowledge | < 5 min | > 30 min | | Auto-resolved without action | < 30% | > 60% | | On-call feedback score | > 7/10 | < 5/10 |
The Five Categories of Bad Alerts
1. Alerts That Self-Resolve
An alert that fires and clears itself within 10 minutes, without any engineer action, is a false alarm by definition. Track auto-resolution rates:
def find_auto_resolving_alerts(alert_history, threshold_minutes=10):
"""
Find alerts that frequently auto-resolve without human intervention.
These are likely flapping on transient issues.
"""
auto_resolve_rates = {}
for alert_name, firings in group_by_alert(alert_history):
auto_resolved = sum(
1 for f in firings
if f.resolved_at and
(f.resolved_at - f.fired_at).total_seconds() < threshold_minutes * 60 and
f.action_taken is None
)
auto_resolve_rates[alert_name] = {
"total_fires": len(firings),
"auto_resolved": auto_resolved,
"auto_resolve_rate": auto_resolved / len(firings) if firings else 0
}
# Return alerts with > 50% auto-resolution rate
return {
name: data for name, data in auto_resolve_rates.items()
if data["auto_resolve_rate"] > 0.5
}
Fix: Add minimum duration requirements. An alert should only fire if the condition persists for at least 2-3 minutes.
2. Alerts Without Actionable Steps
If an engineer can't do anything when they receive an alert, it shouldn't page them. Common examples:
- "Third party API is slow" (nothing you can do)
- "Background job queue depth high" (no action defined)
- "Disk 60% full" (not urgent, no immediate action required)
Create a test for each alert: "If I receive this at 3 AM on a Saturday, what exact steps would I take?" If the answer is "check and go back to sleep," it shouldn't page. If there's no answer, remove or silence the alert.
3. Flapping Alerts
Flapping alerts fire and clear repeatedly in a short period, creating alert storms:
# Bad: Alert that flaps
alert:
name: "CPU Usage High"
condition: "cpu_usage > 80%"
# No hysteresis - fires at 81%, clears at 79%, fires at 81%...
# Good: Alert with hysteresis and duration
alert:
name: "CPU Usage High"
condition: "cpu_usage > 85% for 5 consecutive minutes"
recovery: "cpu_usage < 70% for 3 consecutive minutes"
# Won't flip-flop around the threshold
4. Alerts That Don't Reflect User Impact
Infrastructure metrics (CPU, memory, disk) can look alarming while users experience zero impact. And conversely, users can be having a terrible experience while infrastructure looks fine.
Prefer symptom-based alerts over cause-based alerts:
| Cause-based (worse) | Symptom-based (better) | |---|---| | CPU > 90% | Error rate > 2% | | Memory > 85% | Request success rate < 99% | | Disk > 80% | Checkout completion rate dropping | | Queue depth > 1000 | Order processing delay > 5 min |
5. Duplicate Alerts
Multiple alerts for the same underlying issue create an alert storm. If the database is down, you might get:
- "Payment service health check failing"
- "User service health check failing"
- "Database connectivity alert"
- "Background worker failures"
These are all symptoms of one root cause. Use alert correlation and deduplication:
# PagerDuty - deduplicate related alerts
alert_grouping:
type: time
timeout: 600 # Group alerts within 10 minutes
# Or use intelligent grouping
alert_grouping:
type: intelligent
time_window: 600
The Alert Audit Process
Run a quarterly alert audit:
# Alert Audit Checklist
For each alert in your system, answer:
1. **Has it fired in the past 90 days?**
- No → Consider removing or disabling
2. **What percentage of fires led to actual incidents?**
- < 20% → Too noisy, tune threshold or add duration requirement
3. **What action does an engineer take when this fires at 3 AM?**
- "None" or "check and ignore" → Downgrade to Slack-only or remove
4. **Is there a more direct way to detect this user impact?**
- If yes → Replace with symptom-based alert
5. **Does this alert fire during expected patterns (deployments, scheduled jobs)?**
- Yes without suppression → Add maintenance window suppression
Alert Threshold Tuning
Set thresholds based on data, not intuition:
def suggest_alert_threshold(metric_history, target_false_positive_rate=0.05):
"""
Suggest alert threshold that results in target false positive rate.
Assumes historical data represents normal operation.
"""
import numpy as np
values = [m['value'] for m in metric_history]
# Find percentile that represents the target
# If we want < 5% false positives, threshold should be at 95th percentile
suggested_threshold = np.percentile(values, (1 - target_false_positive_rate) * 100)
return {
"suggested_threshold": suggested_threshold,
"p95": np.percentile(values, 95),
"p99": np.percentile(values, 99),
"max": max(values),
"estimated_false_positive_rate": target_false_positive_rate,
"note": "Review against historical incidents to validate"
}
Multi-Window Alert Logic
Use multiple evaluation windows to catch both gradual and sudden problems:
# Alert that catches both fast and slow failure modes
alerts:
# Sudden spike
- name: "Error Rate Sudden Spike"
condition: "error_rate > 10% for 1 minute"
severity: P1
# Elevated for sustained period
- name: "Error Rate Sustained Elevation"
condition: "error_rate > 2% for 15 minutes"
severity: P2
# Gradual trend
- name: "Error Rate Trending Up"
condition: "error_rate_30min_avg > 1% AND error_rate_trend > 0.1%/hour"
severity: P3
Alert Routing Improvements
Not all alerts should go to the same place:
routing:
P1:
channels:
- pagerduty # Phone call + SMS
- slack_incidents # #incidents channel
escalation: 5_minutes
P2:
channels:
- pagerduty_low_urgency # Push notification only
- slack_engineering # #engineering-alerts channel
escalation: 30_minutes
P3:
channels:
- slack_engineering # Slack only, no page
notification_hours: any # 24/7 Slack is fine
P4:
channels:
- slack_monitoring # Low-priority channel
notification_hours: business_hours_only
Continuous Improvement
Run a weekly alert review as part of your SRE practice:
# Weekly Alert Review Agenda (15 minutes)
1. How many alerts fired this week? (vs last week)
2. Which alerts fired most frequently? Are they actionable?
3. Were there any missed alerts (users reported issues we didn't catch)?
4. Any alert that woke someone up without a real issue?
5. Action items: tune, remove, or add alerts
Track these weekly metrics on a dashboard for trend visibility.
Conclusion
Alert fatigue is solvable, but it requires treating your monitoring configuration with the same care as your production code. Audit regularly, measure actionability rates, prefer symptom-based over cause-based alerts, and add duration requirements to prevent flapping. The goal isn't fewer alerts — it's alerts your team trusts enough to always investigate. AzMonitor's intelligent alerting includes options for consecutive failure requirements, multi-region confirmation, and flexible routing that help you build a monitoring system engineers respect rather than tune out.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →