Alert configuration is where most monitoring setups go wrong. Engineers spend time setting up monitors and then configure alerts with default settings — which almost always produces too much noise. After a week of false alarms, the team starts ignoring alerts, and when a real incident hits, nobody notices. Here's how to configure alerts that actually work.
The Alert Quality Litmus Test
Before configuring any alert, answer three questions:
- If this fires at 3 AM, what exact action should the on-call engineer take?
- Does this alert firing directly indicate user impact?
- Will this alert fire for transient, self-resolving issues?
If you can't answer question 1 clearly, the alert shouldn't page. If the answer to question 2 is "not necessarily," reconsider whether it should be a critical alert. If question 3 is "probably yes," add duration requirements.
Setting Alert Thresholds
Thresholds should be based on data, not guesswork.
Step 1: Establish Baselines
Run your monitors for 2-4 weeks before configuring alerts. Collect baseline data:
# Analyze baseline to set thresholds
def calculate_alert_thresholds(historical_data, percentile_threshold=0.99):
"""
Calculate alert thresholds based on historical performance.
Default: Set threshold at 99th percentile (1% of checks exceed threshold)
This means alerts fire rarely under normal conditions.
"""
import statistics
values = sorted(historical_data)
n = len(values)
p50 = values[int(n * 0.50)]
p90 = values[int(n * 0.90)]
p95 = values[int(n * 0.95)]
p99 = values[int(n * 0.99)]
return {
"p50_baseline": p50,
"p90_baseline": p90,
"p95_baseline": p95,
"p99_baseline": p99,
"suggested_warning_threshold": p95 * 1.5, # 50% above p95
"suggested_critical_threshold": p99 * 2.0, # 2x p99
}
# Example output for a 200ms median service:
# {
# "p50_baseline": 200,
# "p95_baseline": 500,
# "p99_baseline": 800,
# "suggested_warning_threshold": 750,
# "suggested_critical_threshold": 1600
# }
Step 2: Add Hysteresis
Set different thresholds for opening and closing alerts to prevent flapping:
alert:
name: "API Response Time High"
# Alert opens when: p95 > 1500ms for 5 consecutive minutes
condition:
metric: response_time_p95
operator: greater_than
threshold: 1500
for: 5m
# Alert closes when: p95 drops below 800ms for 3 minutes
recovery:
metric: response_time_p95
operator: less_than
threshold: 800
for: 3m
Without hysteresis, an alert will flap around the threshold — firing and resolving repeatedly as the metric oscillates.
Consecutive Failure Requirements
For availability monitoring, require multiple consecutive failures before alerting:
# Single failure tolerance (reduce false positives from transient issues)
monitor:
name: "Payment API"
url: "https://api.example.com/health"
alert_after:
consecutive_failures: 2 # Alert only after 2 consecutive failures
# 2 failures × 60s interval = 2 minutes of sustained failure before alert
recovery_after:
consecutive_successes: 2 # Recover alert after 2 consecutive successes
This prevents alerts for single network timeouts or momentary blips while ensuring sustained failures are caught.
Severity-Based Alert Configuration
Match alert severity to actual impact:
alerts:
# P1 - Immediate page, wake people up
critical:
conditions:
- metric: availability
operator: less_than
value: 0.90 # Less than 90% available
for: 2m
- metric: error_rate
operator: greater_than
value: 0.10 # More than 10% errors
for: 3m
channels:
- type: pagerduty
integration_key: "PD_P1_KEY"
severity: critical
- type: slack
webhook: "WEBHOOK_URL"
channel: "#incidents"
escalation:
timeout_minutes: 5
next_contact: "secondary-oncall"
# P2 - Push notification, investigate today
warning:
conditions:
- metric: response_time_p95
operator: greater_than
value: 2000
for: 10m
- metric: error_rate
operator: greater_than
value: 0.02 # More than 2% errors
for: 10m
channels:
- type: pagerduty
integration_key: "PD_P2_KEY"
severity: warning
- type: slack
webhook: "WEBHOOK_URL"
channel: "#alerts"
# P3 - Slack notification, review this week
info:
conditions:
- metric: response_time_p95
operator: greater_than
value: 1000
for: 30m
channels:
- type: slack
webhook: "WEBHOOK_URL"
channel: "#monitoring"
Multi-Channel Alert Configuration
Configure multiple channels to complement each other:
# P1 Alert - comprehensive notification strategy
p1_alert:
channels:
# Primary: Wake on-call
- type: pagerduty
urgency: high
integration_key: "PD_KEY"
# Backup: SMS if PD fails
- type: sms
recipients: ["on-call-sms-list@example.com"]
# Visibility: Team awareness
- type: slack
channel: "#incidents"
message: |
🚨 CRITICAL: {{monitor_name}} is DOWN
URL: {{monitor_url}}
From: {{failed_regions|join(", ")}}
Started: {{incident_start}}
Dashboard: {{dashboard_link}}
Runbook: {{runbook_link}}
# External: Customer communication trigger
- type: webhook
url: "https://api.statuspage.io/v1/pages/PAGE_ID/incidents"
headers:
Authorization: "OAuth YOUR_TOKEN"
body_template: |
{
"incident": {
"name": "Investigating {{monitor_name}} issues",
"status": "investigating",
"impact_override": "major"
}
}
Configuring Time-Based Alert Behavior
Some alerts should behave differently outside business hours:
# Business hours vs after-hours alert routing
alert:
name: "Cache Hit Rate Low"
condition: "cache_hit_rate < 60% for 20 minutes"
routing:
business_hours: # 09:00-18:00 weekdays
channels:
- slack: "#engineering"
severity: warning
after_hours: # Evenings and weekends
# Not critical enough to wake anyone up
# Hold notification until business hours
channels:
- slack: "#monitoring"
severity: info
hold_until_business_hours: true
Not all issues warrant waking people up. Non-critical alerts that can wait until morning should be held.
SSL Certificate Alert Configuration
SSL alerts require different timing — they're not about current availability but about upcoming expiry:
ssl_alerts:
name: "SSL Certificate Expiry"
hostname: "api.example.com"
alert_schedule:
- days_before_expiry: 60
severity: info
channels: ["#engineering-weekly-digest"]
send_once: true # Don't repeat daily
- days_before_expiry: 30
severity: warning
channels: ["#sre-alerts"]
send_once_per_week: true
- days_before_expiry: 14
severity: warning
channels: ["#sre-alerts"]
send_daily: true
- days_before_expiry: 7
severity: critical
channels: ["pagerduty", "#incidents"]
send_daily: true
- days_before_expiry: 3
severity: critical
channels: ["pagerduty", "sms", "#incidents"]
send_every_6_hours: true
Maintenance Window Configuration
Suppress alerts during planned maintenance to prevent false alarms:
# Create maintenance window via API
import requests
def create_maintenance_window(
monitor_ids,
start_time,
end_time,
description,
api_key
):
"""Create maintenance window to suppress alerts"""
response = requests.post(
"https://api.azmonitor.com/v1/maintenance",
headers={"Authorization": f"Bearer {api_key}"},
json={
"monitor_ids": monitor_ids,
"start_time": start_time.isoformat(),
"end_time": end_time.isoformat(),
"description": description,
"notify_team": True # Notify team that maintenance is scheduled
}
)
return response.json()
# Add to your CI/CD pipeline
from datetime import datetime, timedelta
window = create_maintenance_window(
monitor_ids=["payment-api", "user-api"],
start_time=datetime.utcnow(),
end_time=datetime.utcnow() + timedelta(hours=1),
description="Deploying payment-service v2.5.0",
api_key=os.environ["AZMONITOR_API_KEY"]
)
print(f"Maintenance window created: {window['id']}")
# ... deploy here ...
# Window auto-expires, alerts resume automatically
Testing Alert Configurations
Never assume your alerts work — test them:
# Test 1: Test each notification channel
# Use your monitoring tool's "Send test alert" feature
# Verify receipt in: Slack, email, PagerDuty
# Test 2: Test with a real failure
# Temporarily make your monitored URL return an error
# Verify alert fires within expected time
# Verify it reaches expected channels
# Verify it resolves when URL recovers
# Test 3: Test escalation
# Don't acknowledge the test alert
# Verify escalation fires after configured timeout
# Verify escalation reaches secondary contact
# Test 4: Test maintenance windows
# Create maintenance window
# Trigger failure during window
# Verify NO alert fires during window
# Allow window to expire
# Trigger failure again
# Verify alert fires normally
Alert Message Templates
Well-crafted alert messages accelerate incident response:
# Include actionable context in every alert
alert_templates:
critical:
title: "[CRITICAL] {{monitor_name}} DOWN — {{failed_regions}} failing"
body: |
Service {{monitor_name}} is DOWN.
Failing from: {{failed_regions|join(", ")}}
First failure: {{incident_start_time}} UTC
URL: {{monitor_url}}
Recent response: {{status_code}} in {{response_time}}ms
Dashboard: {{dashboard_url}}
Runbook: {{runbook_url}}
Acknowledge: {{acknowledge_url}}
warning:
title: "[WARNING] {{monitor_name}} — {{metric}} elevated"
body: |
Service {{monitor_name}} is degraded.
Metric: {{metric}} = {{current_value}} (threshold: {{threshold}})
Duration: {{duration_minutes}} minutes
Dashboard: {{dashboard_url}}
Common Alert Configuration Mistakes
| Mistake | Impact | Fix | |---|---|---| | No consecutive failure requirement | Constant false positives | Require 2-3 consecutive failures | | Single region checks | Alerts on monitoring provider issues | Use 2+ region confirmation | | No hysteresis | Alert flapping | Set different open/close thresholds | | All channels same for all severities | On-call burnout | Route by severity | | No maintenance windows | Alerts during deployments | Automate maintenance window creation | | No recovery notifications | Can't tell when issue resolved | Enable "all clear" notifications |
Conclusion
Alert configuration is an iterative process. Start conservative (higher thresholds, consecutive failure requirements, multi-region confirmation), measure alert quality weekly, and tune based on what you observe. The goal is alerts that fire when something actually needs human attention and never fire for issues that resolve themselves. AzMonitor's alert configuration provides all the knobs needed to tune your monitoring precisely — consecutive failure requirements, multi-region confirmation, time-based routing, and per-channel severity configuration — so you can build a monitoring setup your team actually trusts.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →