Most monitoring systems have too many alerts, not too few. Teams enable every available alert, set tight thresholds, and configure notifications to every possible channel. The result is alert fatigue: engineers learn to ignore alerts because the signal-to-noise ratio is too low. Effective alerting requires discipline — fewer alerts, higher quality, with thresholds set from data rather than intuition.
The Four Properties of a Good Alert
Before creating any alert, verify it has all four of these properties:
Actionable — When this alert fires, there's a specific action someone should take. If there's no action, it's not an alert — it's a metric.
Accurate — The alert fires when there's a real problem and doesn't fire when there isn't. High false positive rates destroy trust. High false negative rates mean incidents go undetected.
Timely — The alert fires quickly enough that the response can prevent or limit impact. An alert that fires 20 minutes after the problem started means 20 minutes of undetected impact.
Routed correctly — The alert reaches the right person with the right priority through the right channel. The database alert goes to the database team, not everyone.
Alert Types and When to Use Each
## Alert Categories
### Symptom-based alerts (recommended for most cases)
Alert on what users experience, not what you think is causing it.
- "Error rate > 5%" (symptom) → Alert
- "Memory utilization > 80%" (possible cause) → Dashboard metric, not alert
Why: Symptom alerts catch problems regardless of root cause.
Cause-based alerts only catch the causes you anticipated.
### Threshold-based alerts
Alert when a metric crosses a predefined value.
- Response time > 2000ms for 3 consecutive checks
- Error rate > 1% over 5 minutes
- SSL certificate expires in < 30 days
Best for: Simple, well-understood metrics with clear thresholds.
### Rate-of-change alerts
Alert when a metric is changing faster than expected.
- Error rate doubled in the last 5 minutes
- Request volume dropped by > 50% from hourly average
Best for: Catching sudden changes that may not cross absolute thresholds yet.
### SLO burn rate alerts (multi-window)
Alert when error budget is being consumed faster than sustainable.
- "5% of monthly error budget burned in the last hour"
- Combines long-window (slow burn) and short-window (fast burn)
Best for: SLO-based alerting that distinguishes severity of budget consumption.
Setting Thresholds
The most common alerting mistake is setting thresholds based on intuition:
# threshold_analyzer.py
import statistics
from typing import List
def calculate_alert_thresholds(
historical_values: List[float],
percentile_for_warning: float = 95,
percentile_for_critical: float = 99.5
) -> dict:
"""
Calculate alert thresholds from historical metric data.
Sets thresholds based on actual distribution, not guesses.
"""
if not historical_values:
raise ValueError("Need at least 1 historical data point")
sorted_values = sorted(historical_values)
n = len(sorted_values)
p50 = sorted_values[int(n * 0.50)]
p90 = sorted_values[int(n * 0.90)]
p95 = sorted_values[int(n * 0.95)]
p99 = sorted_values[int(n * 0.99)]
p99_5 = sorted_values[int(n * 0.995)]
mean = statistics.mean(historical_values)
stdev = statistics.stdev(historical_values) if len(historical_values) > 1 else 0
return {
"data_points": n,
"distribution": {
"p50": p50,
"p90": p90,
"p95": p95,
"p99": p99,
"p99_5": p99_5,
"mean": round(mean, 2),
"stdev": round(stdev, 2)
},
"recommended_thresholds": {
# Warning: values above historical 95th percentile
"warning": round(sorted_values[int(n * (percentile_for_warning / 100))], 2),
# Critical: values above historical 99.5th percentile
"critical": round(sorted_values[int(n * (percentile_for_critical / 100))], 2),
},
"note": f"Thresholds set from {n} historical data points. Recalculate monthly."
}
# Example usage for response time thresholds
response_times = load_last_30_days_of_p95_response_times()
thresholds = calculate_alert_thresholds(response_times)
print(f"Warning threshold: {thresholds['recommended_thresholds']['warning']}ms")
print(f"Critical threshold: {thresholds['recommended_thresholds']['critical']}ms")
Alert Evaluation Windows
How you evaluate an alert affects both sensitivity and false positive rates:
# Alert evaluation patterns
# Pattern 1: Immediate — fires on first failure
# Use for: Critical availability checks where any failure matters
- type: availability
condition: "status == 'down'"
evaluation: "1 of 1 checks"
# Pattern 2: Consecutive failures — requires N failures in a row
# Use for: Reduces flapping, good for transient network issues
- type: availability
condition: "status == 'down'"
evaluation: "3 of 3 consecutive checks"
check_interval: 60s
# Time to alert: 3 minutes
# Pattern 3: Time window — requires N failures in a window of M
# Use for: Allows occasional flaps but catches sustained issues
- type: error_rate
condition: "error_rate > 5%"
evaluation: "3 of 5 checks in window"
window: 5 minutes
# Pattern 4: Rolling average — evaluates against moving average
# Use for: Smoothed metrics like response time percentiles
- type: latency
condition: "p95_ms > 500"
evaluation: "average over 5 minutes"
Alert Routing Strategy
Different alerts should reach different people through different channels:
# alert_router.py
from enum import Enum
class AlertSeverity(Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
ROUTING_RULES = {
AlertSeverity.CRITICAL: {
"channels": ["pagerduty", "slack"],
"notification_methods": ["push", "call", "sms"],
"time_restriction": None, # Any time
"acknowledgment_timeout_minutes": 5,
"escalation": True,
"slack_channels": ["#incidents-active"],
"slack_at_mention": "@oncall"
},
AlertSeverity.HIGH: {
"channels": ["pagerduty", "slack"],
"notification_methods": ["push", "sms"],
"time_restriction": "business_hours", # Only pages during business hours
"acknowledgment_timeout_minutes": 30,
"escalation": True,
"slack_channels": ["#alerts-high"],
"slack_at_mention": "@oncall-team"
},
AlertSeverity.MEDIUM: {
"channels": ["slack", "email"],
"notification_methods": [], # No push/call/SMS
"time_restriction": None,
"acknowledgment_timeout_minutes": None,
"escalation": False,
"slack_channels": ["#alerts"],
"slack_at_mention": None
},
AlertSeverity.LOW: {
"channels": ["slack"],
"notification_methods": [],
"time_restriction": None,
"acknowledgment_timeout_minutes": None,
"escalation": False,
"slack_channels": ["#monitoring-digest"],
"slack_at_mention": None
}
}
def route_alert(alert, team_mapping):
"""
Route an alert to the appropriate channels based on severity and team.
"""
severity = AlertSeverity(alert.severity)
routing = ROUTING_RULES[severity]
# Find team responsible for this service
team = team_mapping.get(alert.service_name, "platform")
# Get team-specific PagerDuty service key
pd_service_key = get_team_pagerduty_key(team, severity)
return {
"pagerduty_service_key": pd_service_key if "pagerduty" in routing["channels"] else None,
"slack_channels": routing["slack_channels"],
"slack_mention": routing["slack_at_mention"],
"send_email": "email" in routing["channels"],
"apply_time_restriction": routing["time_restriction"] is not None
}
Preventing Alert Fatigue
Alert fatigue develops gradually and requires systematic monitoring:
def measure_alert_quality(alert_history, days=30):
"""
Measure alert quality metrics to identify fatigue risks.
"""
recent_alerts = [a for a in alert_history if a.triggered_within_days(days)]
# Total alert volume
alerts_per_day = len(recent_alerts) / days
# Actionability rate
actionable_alerts = [a for a in recent_alerts if a.required_action]
actionability_rate = len(actionable_alerts) / len(recent_alerts) if recent_alerts else 0
# Auto-resolution rate (alert resolved without human intervention)
auto_resolved = [a for a in recent_alerts if a.auto_resolved]
auto_resolution_rate = len(auto_resolved) / len(recent_alerts) if recent_alerts else 0
# Mean time to acknowledge
acknowledged = [a for a in recent_alerts if a.acknowledged_at]
if acknowledged:
mtta_minutes = sum(
(a.acknowledged_at - a.triggered_at).total_seconds() / 60
for a in acknowledged
) / len(acknowledged)
else:
mtta_minutes = None
# Alerts per rule (find noisy rules)
by_rule = {}
for alert in recent_alerts:
by_rule[alert.rule_name] = by_rule.get(alert.rule_name, 0) + 1
top_noisy_rules = sorted(by_rule.items(), key=lambda x: x[1], reverse=True)[:5]
return {
"alerts_per_day": round(alerts_per_day, 1),
"actionability_rate": f"{actionability_rate:.0%}",
"auto_resolution_rate": f"{auto_resolution_rate:.0%}",
"mean_time_to_acknowledge_minutes": round(mtta_minutes, 1) if mtta_minutes else None,
"top_noisy_rules": top_noisy_rules,
"health_assessment": {
"alert_volume": "healthy" if alerts_per_day < 5 else "concerning" if alerts_per_day < 15 else "critical",
"actionability": "healthy" if actionability_rate > 0.9 else "concerning" if actionability_rate > 0.7 else "critical",
"auto_resolution": "healthy" if auto_resolution_rate < 0.2 else "concerning" if auto_resolution_rate < 0.5 else "critical"
}
}
Alert Documentation Standard
Every alert should have associated documentation:
# Alert documentation template (in code or wiki)
alert_name: "checkout-api-error-rate-critical"
description: "Fires when the checkout API error rate exceeds 5% over 5 minutes"
trigger_conditions:
metric: "http_errors_rate"
threshold: "> 5%"
window: "5 minutes"
evaluation: "3 of 5 data points"
severity: critical
team: backend
runbook_url: "https://wiki.example.com/runbooks/checkout-api-errors"
common_causes:
- "Deployment regression"
- "Database connection exhaustion"
- "Downstream payment service failure"
immediate_actions:
- "Check error logs at https://logging.example.com/checkout"
- "Check recent deployments at https://deployments.example.com"
- "Verify database connection pool metrics"
escalate_to: "Engineering Manager if not resolved in 30 minutes"
created_by: "jane@example.com"
created_date: "2025-01-15"
last_reviewed: "2025-06-01"
last_fired: "2025-05-20"
The Alert Review Process
Alerts should be reviewed regularly, not just created once and forgotten:
## Monthly Alert Audit Process
### Step 1: Collect metrics (10 min)
- Total alerts fired in the month
- Actionable vs noise breakdown
- Top 5 noisiest alert rules
### Step 2: Review each noisy rule (30 min)
For each alert that fired more than 5x:
- Was each firing actionable? (check acknowledgment notes)
- Should the threshold be raised?
- Should the evaluation window be extended?
- Should this alert be deleted?
### Step 3: Review missing alerts (15 min)
- Were there incidents NOT caught by alerts?
- Were there customer reports before our monitoring detected anything?
- Are there new endpoints/services without coverage?
### Step 4: Review routing (15 min)
- Any alerts going to wrong team?
- Any alerts at wrong severity for their actual impact?
- Any new engineers who aren't in rotation schedules?
### Outcome
- Updated alert rules with new thresholds
- Deleted or disabled alerts with no actionable history
- New alerts for gaps discovered
Conclusion
Effective alerting is a continuous discipline, not a setup task. The best alerting systems are built incrementally — starting with the minimum viable set of alerts, measuring actionability, and adding or removing alerts based on evidence. The goal is a quiet monitoring system that pages engineers rarely but reliably when something actually needs human attention. AzMonitor's flexible alert configuration — with support for consecutive failure evaluation, custom thresholds, time-based routing, and multiple notification channels — gives teams the control needed to build an alerting system that stays trustworthy as systems and teams evolve.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →