On-Call Management

Escalation Policies: Designing Alert Escalation That Actually Works

Learn how to design alert escalation policies that ensure critical incidents always get attention while minimizing unnecessary interruptions to your team.

AzMonitor TeamJune 18, 20257 min read · 1,150 wordsUpdated January 20, 2026
escalationon-callalertingincident management

An escalation policy is the safety net under your on-call rotation. When the primary responder doesn't acknowledge an alert — because they're asleep, in a meeting, or dealing with another incident — escalation ensures someone else picks it up. A well-designed escalation policy means no critical alert ever goes unaddressed. A poorly designed one floods inboxes with alerts nobody acts on.

What an Escalation Policy Defines

Every escalation policy answers three questions:

  1. Who gets notified first? (primary)
  2. If they don't respond, who's next? (escalation chain)
  3. How long to wait before escalating? (timeout)

Simple version:

Alert fires → Primary on-call (5 min timeout)
  ↓ No ack
Secondary on-call (5 min timeout)
  ↓ No ack
Engineering manager (10 min timeout)
  ↓ No ack
VP Engineering (continuous paging)

The complexity of your escalation policy should match the complexity of your organization and the severity of your alerts.

Severity-Based Escalation

Different alert severities should have different escalation urgency and chains:

# Escalation policies by severity

P1_critical:
  level_1:
    target: primary_oncall
    timeout_minutes: 3
    notification: [phone_call, sms, push]
  level_2:
    target: secondary_oncall
    timeout_minutes: 5
    notification: [phone_call, sms]
  level_3:
    target: engineering_manager
    timeout_minutes: 10
    notification: [phone_call]
  level_4:
    target: vp_engineering
    timeout_minutes: 15
    notification: [phone_call]

P2_high:
  level_1:
    target: primary_oncall
    timeout_minutes: 15
    notification: [push, sms]
  level_2:
    target: secondary_oncall
    timeout_minutes: 30
    notification: [push]
  level_3:
    target: engineering_manager
    timeout_minutes: 60
    notification: [push]

P3_medium:
  level_1:
    target: team_slack_channel
    timeout_minutes: 120
    notification: [slack]
  level_2:
    target: primary_oncall
    timeout_minutes: 480  # 8 hours (business hours)
    notification: [push]

Building Escalation Policies in PagerDuty

# Create escalation policy via PagerDuty API
import pdpyras

session = pdpyras.APISession(api_key="YOUR_API_KEY")

policy = {
    "type": "escalation_policy",
    "name": "Engineering P1 Escalation",
    "description": "Critical incident escalation for engineering team",
    "escalation_rules": [
        {
            "escalation_delay_in_minutes": 3,
            "targets": [
                {
                    "type": "schedule_reference",
                    "id": "PRIMARY_SCHEDULE_ID"  # Primary on-call schedule
                }
            ]
        },
        {
            "escalation_delay_in_minutes": 5,
            "targets": [
                {
                    "type": "schedule_reference",
                    "id": "SECONDARY_SCHEDULE_ID"  # Secondary on-call schedule
                }
            ]
        },
        {
            "escalation_delay_in_minutes": 10,
            "targets": [
                {
                    "type": "user_reference",
                    "id": "ENGINEERING_MANAGER_USER_ID"
                }
            ]
        }
    ],
    "repeat_enabled": True,
    "num_loops": 3  # Repeat the escalation chain 3 times before stopping
}

result = session.post("escalation_policies", json={"escalation_policy": policy})
print(f"Created policy: {result['escalation_policy']['id']}")

Notification Channels by Escalation Level

Different escalation levels should use different notification channels. Phone calls wake people up but are appropriate for P1. Push notifications are less disruptive for P2:

| Escalation Level | P1 Channels | P2 Channels | P3 Channels | |---|---|---|---| | Level 1 (primary) | Phone + SMS + Push | Push + SMS | Slack | | Level 2 (secondary) | Phone + SMS | Push | Push | | Level 3 (manager) | Phone | SMS | Email | | Level 4 (executive) | Phone | Email | — |

Configure per-person notification preferences to respect individual preferences while maintaining escalation requirements:

# Engineer notification preferences
alice:
  high_urgency:
    - phone_call
    - sms
    - push_notification
  low_urgency:
    - push_notification
  quiet_hours_override: true  # Ring even during quiet hours for P1

bob:
  high_urgency:
    - sms
    - push_notification
  low_urgency:
    - push_notification
  quiet_hours_override: false  # Respect quiet hours (secondary will catch P1)

Handling Escalation Exceptions

Real-world escalation needs to handle edge cases:

Vacation coverage — When the primary is on vacation, escalation should skip them:

def get_current_oncall(schedule_id, escalation_level=1):
    """Get current on-call person, accounting for overrides and vacations"""
    from datetime import datetime
    
    schedule = fetch_schedule(schedule_id, datetime.utcnow())
    
    # Check for overrides (vacations, schedule swaps)
    overrides = fetch_overrides(schedule_id, datetime.utcnow())
    
    if overrides:
        return overrides[0].user  # Use override if exists
    
    return schedule.current_oncall_user

Multiple simultaneous incidents — One engineer shouldn't handle two simultaneous P1 incidents. Configure duplicate suppression:

# Prevent one person from being paged for multiple simultaneous incidents
escalation_settings:
  max_concurrent_pages_per_person: 1
  on_overload:
    action: escalate_to_next_level
    message: "Primary on-call already handling incident - escalating"

Time-zone aware escalation — For global teams, escalate to the region that's awake:

def get_regional_oncall(incident_time_utc):
    """Route escalation to the on-call team whose business hours are active"""
    hour = incident_time_utc.hour
    
    if 8 <= hour < 16:  # 08:00-16:00 UTC = Europe business hours
        return "europe_oncall_schedule"
    elif 14 <= hour < 22:  # 14:00-22:00 UTC = US East business hours
        return "us_east_oncall_schedule"
    else:  # Night coverage
        return "apac_oncall_schedule"

Service-Specific Escalation

Different services might need different escalation paths based on who owns them:

service_escalation_policies:
  payment-service:
    primary_schedule: payments_team_oncall
    escalation_policy: payments_p1_escalation
    team_slack: "#payments-incidents"
    
  user-service:
    primary_schedule: user_platform_oncall
    escalation_policy: platform_p1_escalation
    team_slack: "#platform-incidents"
    
  infrastructure:
    primary_schedule: sre_oncall
    escalation_policy: sre_critical_escalation
    team_slack: "#infrastructure-incidents"

This ensures the right team gets paged for each service, rather than routing everything to a single on-call pool.

Escalation Testing

Test your escalation policies regularly — don't wait for a real incident to discover they don't work:

#!/bin/bash
# Test escalation policy without waking anyone up

# Send a test alert to staging environment
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "STAGING_INTEGRATION_KEY",
    "event_action": "trigger",
    "payload": {
      "summary": "[TEST] Escalation policy validation - NOT a real incident",
      "severity": "info",
      "source": "escalation-test",
      "custom_details": {
        "test": true,
        "test_id": "2025-06-18-01",
        "instruction": "Acknowledge this test alert and verify the escalation timing"
      }
    }
  }'

echo "Test alert sent. Verify:"
echo "1. Primary on-call received notification"
echo "2. Escalation to secondary fires after 3 minutes if not acknowledged"
echo "3. Alert resolves cleanly after acknowledgment"

Run quarterly escalation tests:

  • Send a test alert and deliberately don't acknowledge it
  • Verify each escalation level fires in the correct order
  • Verify escalation stops when acknowledged
  • Verify no duplicate notifications

Escalation Metrics

Track how well your escalation is working:

-- Escalation analytics
SELECT
    DATE_TRUNC('week', alert_time) as week,
    severity,
    COUNT(*) as total_alerts,
    AVG(first_ack_minutes) as avg_time_to_first_ack,
    COUNT(CASE WHEN escalation_level > 1 THEN 1 END) as escalated,
    COUNT(CASE WHEN escalation_level > 2 THEN 1 END) as double_escalated,
    COUNT(CASE WHEN never_acknowledged THEN 1 END) as missed
FROM alert_history
WHERE alert_time > NOW() - INTERVAL '12 weeks'
GROUP BY DATE_TRUNC('week', alert_time), severity
ORDER BY week DESC, severity;

Watch for:

  • Rising escalation rates (primary isn't acknowledging)
  • High missed alert counts (escalation chain isn't reaching anyone)
  • Very fast ack times (alerts might be auto-acknowledged without investigation)

When to Review Escalation Policies

Review after:

  • Any incident where the escalation didn't work as expected
  • Team structure changes (new hires, departures)
  • Changes to service ownership
  • Quarterly rotation reviews
  • When engineers report being paged outside their on-call shift

Conclusion

A good escalation policy is invisible when it works — alerts get acknowledged, incidents get resolved, and nobody falls through the cracks. It only becomes visible when it breaks: missed alerts, wrong people paged, or escalations that never fire. Build your escalation policies with the same care as your production code — test them, review them regularly, and update them as your team evolves. Combined with AzMonitor's multi-channel alerting and flexible routing, a well-designed escalation policy ensures that every alert — from a simple timeout to a full production outage — reaches the right person in the right amount of time.

Tags:escalationon-callalertingincident management
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →