PagerDuty is the industry standard for on-call alert management. The difference between a well-configured PagerDuty account and a poorly configured one is significant — one reliably delivers the right alert to the right person with the right context at 3am; the other creates alert fatigue, missed incidents, and burned-out engineers. This guide covers a thoughtful PagerDuty setup from scratch.

PagerDuty Architecture Overview

PagerDuty's core concepts:

Services — Represent things you want to monitor (your checkout API, database, frontend)
Integrations — Connect monitoring tools to PagerDuty services (AzMonitor, Datadog, etc.)
Escalation Policies — Define who gets paged and when if they don't respond
Schedules — Define who is on-call at any given time
Incident Rules — Route alerts to the right service based on content

Monitoring Alert
    ↓
PagerDuty Integration (API key or webhook)
    ↓
PagerDuty Service
    ↓
Escalation Policy
    ↓
Schedule (who's on-call right now?)
    ↓
On-call Engineer (phone call, SMS, push notification)

Setting Up Services

Create one PagerDuty service per logical unit that can fail independently:

## Service Structure for a SaaS Application

### Recommended Services
- checkout-api (P1 - wake anyone anytime)
- authentication-service (P1)
- database-cluster (P1)
- frontend-app (P2 - page during business hours only)
- background-jobs (P2)
- reporting-service (P3 - next business day)
- staging-environment (P4 - Slack only, no pages)

### Service Settings
For P1 services:
- Escalation Policy: p1-immediate-escalation
- Alert Grouping: Intelligent (groups related alerts)
- Auto-resolve: After 4 hours if no action

For P2 services:
- Escalation Policy: p2-business-hours
- Auto-resolve: After 8 hours

Configuring Escalation Policies

The escalation policy defines what happens when an alert isn't acknowledged:

# P1 Escalation Policy Structure
name: "P1 Critical - Immediate"
description: "For production outages affecting customers"

escalation_rules:
  # Level 1: Primary on-call (immediate)
  - escalation_delay_in_minutes: 0
    targets:
      - type: schedule_reference
        id: primary-oncall-schedule
    
  # Level 2: Secondary on-call (if primary doesn't respond in 5 min)
  - escalation_delay_in_minutes: 5
    targets:
      - type: schedule_reference
        id: secondary-oncall-schedule
    
  # Level 3: Engineering Manager (if no one responds in 15 min)
  - escalation_delay_in_minutes: 15
    targets:
      - type: user_reference
        id: engineering-manager-id
    
  # Level 4: VP Engineering (30 min in, still no response)
  - escalation_delay_in_minutes: 30
    targets:
      - type: user_reference
        id: vp-engineering-id

# P2 Business Hours Policy
name: "P2 High - Business Hours"
description: "Pages during business hours, queues for morning if off-hours"

escalation_rules:
  # Only page during 9am-6pm Monday-Friday (local time)
  - escalation_delay_in_minutes: 0
    targets:
      - type: schedule_reference
        id: business-hours-oncall
    
  # After 30 min with no response, escalate to team lead
  - escalation_delay_in_minutes: 30
    targets:
      - type: user_reference
        id: team-lead-id

Setting Up On-Call Schedules

Design schedules for sustainable rotation:

# PagerDuty API schedule configuration
# Use the PagerDuty API or Terraform to manage schedules as code

schedule_config = {
    "name": "Primary On-Call Rotation",
    "time_zone": "America/New_York",
    "description": "Weekly rotation for P1 incidents",
    
    "schedule_layers": [
        {
            "name": "Primary Engineers",
            "start": "2025-01-06T00:00:00-05:00",
            "rotation_turn_length_seconds": 604800,  # 1 week in seconds
            "rotation_virtual_start": "2025-01-06T09:00:00-05:00",
            "users": [
                {"id": "P1ENGINEER1"},
                {"id": "P2ENGINEER2"},
                {"id": "P3ENGINEER3"},
                {"id": "P4ENGINEER4"},
                {"id": "P5ENGINEER5"},
                {"id": "P6ENGINEER6"}
            ]
        }
    ],
    
    # Override slots for planned absences
    "overrides": []
}

Schedule Rotation Best Practices

## Rotation Design Guidelines

### Minimum team size: 5-6 engineers
- Each engineer covers ~8-9 weeks per year
- Enough buffer for vacations and sick days

### Handoff timing
- End of business day (e.g., Monday 5pm) — not midnight
- Allows overlap period for questions and context transfer
- Reduces issues during natural wake-up times

### Restricted days
Consider not requiring on-call on:
- Major holidays (have volunteers + backup)
- Within 3 days of a vacation return
- More than 3 consecutive weeks (auto-detect and enforce)

### Secondary on-call
For critical services, have a secondary rotation:
- Secondary is paged only if primary doesn't respond in 5 minutes
- Secondary rotation can be a senior engineers' overlay
- Provides backup without requiring everyone to be primary

Integrating Monitoring with PagerDuty

AzMonitor Integration

# AzMonitor alert configuration with PagerDuty routing
monitors:
  - name: "Checkout API Health"
    url: "https://api.example.com/checkout/health"
    interval: 60
    
    alerts:
      - type: pagerduty
        integration_key: "${PAGERDUTY_CHECKOUT_API_KEY}"
        severity_mapping:
          down: "critical"
          slow: "warning"
          ssl_expiring: "warning"
        
        # Include context in PagerDuty alert
        custom_details:
          service: "checkout-api"
          runbook: "https://wiki.example.com/runbooks/checkout-api"
          dashboard: "https://monitoring.example.com/dashboards/checkout"

Generic Webhook Integration

# Send alerts to PagerDuty via Events API v2
import requests

def send_pagerduty_alert(
    integration_key: str,
    summary: str,
    severity: str,  # critical, error, warning, info
    source: str,
    details: dict = None
) -> dict:
    """
    Send an alert to PagerDuty using Events API v2.
    """
    payload = {
        "routing_key": integration_key,
        "event_action": "trigger",
        "payload": {
            "summary": summary,
            "severity": severity,
            "source": source,
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "custom_details": details or {}
        }
    }
    
    response = requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json=payload,
        headers={"Content-Type": "application/json"}
    )
    
    return {
        "status_code": response.status_code,
        "dedup_key": response.json().get("dedup_key")
    }

def resolve_pagerduty_alert(integration_key: str, dedup_key: str) -> dict:
    """
    Resolve an alert in PagerDuty.
    """
    payload = {
        "routing_key": integration_key,
        "event_action": "resolve",
        "dedup_key": dedup_key
    }
    
    response = requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json=payload,
        headers={"Content-Type": "application/json"}
    )
    
    return {"status_code": response.status_code}

Alert Content Best Practices

The alert content determines how quickly an engineer can understand and respond:

# Bad alert content
bad_alert = {
    "summary": "Monitor failed",  # Which monitor? What's happening?
    "details": {}  # No context at all
}

# Good alert content
good_alert = {
    "summary": "CRITICAL: checkout-api returning 500 errors — checkout flow affected",
    "severity": "critical",
    "source": "AzMonitor — prod-us-east-1",
    "custom_details": {
        "monitor_name": "Checkout API Health",
        "url_checked": "https://api.example.com/checkout/health",
        "status_code": 503,
        "response_time_ms": 8234,
        "check_location": "us-east-1",
        "alert_started": "2025-06-25T14:22:00Z",
        "runbook": "https://wiki.example.com/runbooks/checkout-api-500s",
        "dashboard": "https://monitoring.example.com/dashboards/checkout",
        "incident_channel": "#incident-active",
        "escalation_policy": "P1 Critical"
    }
}

PagerDuty-Slack Integration

Slack notifications complement PagerDuty pages:

## PagerDuty → Slack Integration Setup

### In PagerDuty:
1. Go to Integrations → Extensions → Slack V2
2. Connect your Slack workspace
3. Configure notification routing:
   - P1 alerts → #incidents-active + @channel mention
   - P2 alerts → #alerts-p2 (no @channel)
   - Resolution → #incidents-active

### Slack Notification Content
Configure PagerDuty Slack notifications to include:
- Incident title and service name
- Link to PagerDuty incident
- Current status (triggered/acknowledged/resolved)
- Assigned engineer name

### Useful Slash Commands
/pd ack [incident-id]     — Acknowledge from Slack
/pd resolve [incident-id] — Resolve from Slack
/pd trigger [service]     — Manually trigger an incident
/pd who                   — Who's on-call right now?

Maintenance Window Configuration

Set maintenance windows to silence alerts during planned maintenance:

# Create PagerDuty maintenance window via API
def create_maintenance_window(
    pd_api_key: str,
    service_ids: list,
    start_time: datetime,
    end_time: datetime,
    description: str
) -> dict:
    """
    Create a maintenance window to suppress alerts during planned maintenance.
    """
    response = requests.post(
        "https://api.pagerduty.com/maintenance_windows",
        headers={
            "Authorization": f"Token token={pd_api_key}",
            "Content-Type": "application/json"
        },
        json={
            "maintenance_window": {
                "type": "maintenance_window",
                "start_time": start_time.isoformat(),
                "end_time": end_time.isoformat(),
                "description": description,
                "services": [
                    {"id": service_id, "type": "service_reference"}
                    for service_id in service_ids
                ]
            }
        }
    )
    
    return response.json()

On-Call Metrics in PagerDuty

Use PagerDuty's reporting to track on-call health:

## Key Reports to Monitor

### Team Health Dashboard
- Mean time to acknowledge (MTTA)
- Mean time to resolve (MTTR)
- Alerts per service per week
- Interruptions outside business hours

### Signals for Burnout Risk
- Engineer with > 15 alerts in a rotation week
- Same alert triggering > 3x in a week (suggests threshold tuning needed)
- High MTTA (> 10 min) suggests engineers are ignoring pages
- Low acknowledgment rate suggests paging the wrong people

### Monthly Review Metrics
Pull from PagerDuty Analytics:
- Total incidents per service (trend up or down?)
- Average incident duration per service
- Time spent in incidents per engineer (equity review)
- Escalation rate (how often does level 1 not respond?)

Conclusion

PagerDuty setup is not a one-time task — it requires ongoing refinement as services evolve and team composition changes. The fundamentals are straightforward: create services aligned with failure domains, configure escalation policies that match severity, set up schedules with enough people to sustain rotation, and integrate with monitoring tools that send clean, actionable alerts. AzMonitor's PagerDuty integration sends rich alert context with each notification — the monitor name, URL, status code, response time, and direct links to runbooks — giving on-call engineers the information they need to respond effectively before they even open their laptop.

Tags:PagerDutyon-call setupincident alertingescalation policies

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

PagerDuty Setup: Configuring On-Call Alerting for Engineering Teams