An escalation policy is the safety net under your on-call rotation. When the primary responder doesn't acknowledge an alert — because they're asleep, in a meeting, or dealing with another incident — escalation ensures someone else picks it up. A well-designed escalation policy means no critical alert ever goes unaddressed. A poorly designed one floods inboxes with alerts nobody acts on.
What an Escalation Policy Defines
Every escalation policy answers three questions:
- Who gets notified first? (primary)
- If they don't respond, who's next? (escalation chain)
- How long to wait before escalating? (timeout)
Simple version:
Alert fires → Primary on-call (5 min timeout)
↓ No ack
Secondary on-call (5 min timeout)
↓ No ack
Engineering manager (10 min timeout)
↓ No ack
VP Engineering (continuous paging)
The complexity of your escalation policy should match the complexity of your organization and the severity of your alerts.
Severity-Based Escalation
Different alert severities should have different escalation urgency and chains:
# Escalation policies by severity
P1_critical:
level_1:
target: primary_oncall
timeout_minutes: 3
notification: [phone_call, sms, push]
level_2:
target: secondary_oncall
timeout_minutes: 5
notification: [phone_call, sms]
level_3:
target: engineering_manager
timeout_minutes: 10
notification: [phone_call]
level_4:
target: vp_engineering
timeout_minutes: 15
notification: [phone_call]
P2_high:
level_1:
target: primary_oncall
timeout_minutes: 15
notification: [push, sms]
level_2:
target: secondary_oncall
timeout_minutes: 30
notification: [push]
level_3:
target: engineering_manager
timeout_minutes: 60
notification: [push]
P3_medium:
level_1:
target: team_slack_channel
timeout_minutes: 120
notification: [slack]
level_2:
target: primary_oncall
timeout_minutes: 480 # 8 hours (business hours)
notification: [push]
Building Escalation Policies in PagerDuty
# Create escalation policy via PagerDuty API
import pdpyras
session = pdpyras.APISession(api_key="YOUR_API_KEY")
policy = {
"type": "escalation_policy",
"name": "Engineering P1 Escalation",
"description": "Critical incident escalation for engineering team",
"escalation_rules": [
{
"escalation_delay_in_minutes": 3,
"targets": [
{
"type": "schedule_reference",
"id": "PRIMARY_SCHEDULE_ID" # Primary on-call schedule
}
]
},
{
"escalation_delay_in_minutes": 5,
"targets": [
{
"type": "schedule_reference",
"id": "SECONDARY_SCHEDULE_ID" # Secondary on-call schedule
}
]
},
{
"escalation_delay_in_minutes": 10,
"targets": [
{
"type": "user_reference",
"id": "ENGINEERING_MANAGER_USER_ID"
}
]
}
],
"repeat_enabled": True,
"num_loops": 3 # Repeat the escalation chain 3 times before stopping
}
result = session.post("escalation_policies", json={"escalation_policy": policy})
print(f"Created policy: {result['escalation_policy']['id']}")
Notification Channels by Escalation Level
Different escalation levels should use different notification channels. Phone calls wake people up but are appropriate for P1. Push notifications are less disruptive for P2:
| Escalation Level | P1 Channels | P2 Channels | P3 Channels | |---|---|---|---| | Level 1 (primary) | Phone + SMS + Push | Push + SMS | Slack | | Level 2 (secondary) | Phone + SMS | Push | Push | | Level 3 (manager) | Phone | SMS | Email | | Level 4 (executive) | Phone | Email | — |
Configure per-person notification preferences to respect individual preferences while maintaining escalation requirements:
# Engineer notification preferences
alice:
high_urgency:
- phone_call
- sms
- push_notification
low_urgency:
- push_notification
quiet_hours_override: true # Ring even during quiet hours for P1
bob:
high_urgency:
- sms
- push_notification
low_urgency:
- push_notification
quiet_hours_override: false # Respect quiet hours (secondary will catch P1)
Handling Escalation Exceptions
Real-world escalation needs to handle edge cases:
Vacation coverage — When the primary is on vacation, escalation should skip them:
def get_current_oncall(schedule_id, escalation_level=1):
"""Get current on-call person, accounting for overrides and vacations"""
from datetime import datetime
schedule = fetch_schedule(schedule_id, datetime.utcnow())
# Check for overrides (vacations, schedule swaps)
overrides = fetch_overrides(schedule_id, datetime.utcnow())
if overrides:
return overrides[0].user # Use override if exists
return schedule.current_oncall_user
Multiple simultaneous incidents — One engineer shouldn't handle two simultaneous P1 incidents. Configure duplicate suppression:
# Prevent one person from being paged for multiple simultaneous incidents
escalation_settings:
max_concurrent_pages_per_person: 1
on_overload:
action: escalate_to_next_level
message: "Primary on-call already handling incident - escalating"
Time-zone aware escalation — For global teams, escalate to the region that's awake:
def get_regional_oncall(incident_time_utc):
"""Route escalation to the on-call team whose business hours are active"""
hour = incident_time_utc.hour
if 8 <= hour < 16: # 08:00-16:00 UTC = Europe business hours
return "europe_oncall_schedule"
elif 14 <= hour < 22: # 14:00-22:00 UTC = US East business hours
return "us_east_oncall_schedule"
else: # Night coverage
return "apac_oncall_schedule"
Service-Specific Escalation
Different services might need different escalation paths based on who owns them:
service_escalation_policies:
payment-service:
primary_schedule: payments_team_oncall
escalation_policy: payments_p1_escalation
team_slack: "#payments-incidents"
user-service:
primary_schedule: user_platform_oncall
escalation_policy: platform_p1_escalation
team_slack: "#platform-incidents"
infrastructure:
primary_schedule: sre_oncall
escalation_policy: sre_critical_escalation
team_slack: "#infrastructure-incidents"
This ensures the right team gets paged for each service, rather than routing everything to a single on-call pool.
Escalation Testing
Test your escalation policies regularly — don't wait for a real incident to discover they don't work:
#!/bin/bash
# Test escalation policy without waking anyone up
# Send a test alert to staging environment
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
-H "Content-Type: application/json" \
-d '{
"routing_key": "STAGING_INTEGRATION_KEY",
"event_action": "trigger",
"payload": {
"summary": "[TEST] Escalation policy validation - NOT a real incident",
"severity": "info",
"source": "escalation-test",
"custom_details": {
"test": true,
"test_id": "2025-06-18-01",
"instruction": "Acknowledge this test alert and verify the escalation timing"
}
}
}'
echo "Test alert sent. Verify:"
echo "1. Primary on-call received notification"
echo "2. Escalation to secondary fires after 3 minutes if not acknowledged"
echo "3. Alert resolves cleanly after acknowledgment"
Run quarterly escalation tests:
- Send a test alert and deliberately don't acknowledge it
- Verify each escalation level fires in the correct order
- Verify escalation stops when acknowledged
- Verify no duplicate notifications
Escalation Metrics
Track how well your escalation is working:
-- Escalation analytics
SELECT
DATE_TRUNC('week', alert_time) as week,
severity,
COUNT(*) as total_alerts,
AVG(first_ack_minutes) as avg_time_to_first_ack,
COUNT(CASE WHEN escalation_level > 1 THEN 1 END) as escalated,
COUNT(CASE WHEN escalation_level > 2 THEN 1 END) as double_escalated,
COUNT(CASE WHEN never_acknowledged THEN 1 END) as missed
FROM alert_history
WHERE alert_time > NOW() - INTERVAL '12 weeks'
GROUP BY DATE_TRUNC('week', alert_time), severity
ORDER BY week DESC, severity;
Watch for:
- Rising escalation rates (primary isn't acknowledging)
- High missed alert counts (escalation chain isn't reaching anyone)
- Very fast ack times (alerts might be auto-acknowledged without investigation)
When to Review Escalation Policies
Review after:
- Any incident where the escalation didn't work as expected
- Team structure changes (new hires, departures)
- Changes to service ownership
- Quarterly rotation reviews
- When engineers report being paged outside their on-call shift
Conclusion
A good escalation policy is invisible when it works — alerts get acknowledged, incidents get resolved, and nobody falls through the cracks. It only becomes visible when it breaks: missed alerts, wrong people paged, or escalations that never fire. Build your escalation policies with the same care as your production code — test them, review them regularly, and update them as your team evolves. Combined with AzMonitor's multi-channel alerting and flexible routing, a well-designed escalation policy ensures that every alert — from a simple timeout to a full production outage — reaches the right person in the right amount of time.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →