Incident Management

Incident Response Playbooks: A Practical Guide for Engineering Teams

Learn how to build effective incident response playbooks that reduce MTTR, minimize confusion during outages, and help your team respond consistently to any incident.

AzMonitor TeamApril 2, 20259 min read · 1,354 wordsUpdated January 20, 2026
incident responseplaybooksMTTRon-call

When production is down, your team shouldn't be figuring out what to do next — they should be executing a plan they've already thought through. Incident response playbooks are that plan. They're the difference between a 12-minute recovery and a 3-hour war room with six people arguing about where to look first.

What Makes a Good Playbook

A playbook is only valuable if engineers actually use it during high-stress situations. That means it needs to be short, specific, and actionable. A 20-page document that requires careful reading is useless at 3 AM. The best playbooks have:

  • Clear trigger conditions — exactly when this playbook applies
  • Step-by-step actions — concrete commands and checks, not general guidance
  • Decision trees — if X, do Y; if Z, do W
  • Escalation paths — who to call when the playbook doesn't resolve it
  • Links to dashboards — one click to the relevant monitoring view

Playbook Structure

Every playbook should follow a consistent structure so engineers can navigate them quickly under pressure:

# [Service Name] - [Incident Type] Response Playbook

## Trigger
This playbook applies when: [exact alert condition or symptom description]

## Severity
[P1/P2/P3] - [Impact statement: "All users cannot complete checkout"]

## First Responder Actions (0-5 minutes)
1. Acknowledge the alert in PagerDuty
2. Check [dashboard link] for current status
3. Post in #incidents: "Investigating [alert name] - [your name] on it"

## Diagnosis (5-15 minutes)
Run these checks in order:
1. [Specific command or check]
2. [Specific command or check]
3. [Decision point - see Decision Tree below]

## Decision Tree
- If [condition A] → go to "Resolution: Database Issue"
- If [condition B] → go to "Resolution: Deployment Rollback"
- If neither → escalate to [team/person]

## Resolution Steps
### Resolution: Database Issue
1. [Step 1]
2. [Step 2]

### Resolution: Deployment Rollback
1. [Step 1]
2. [Step 2]

## Escalation
If not resolved in 30 minutes, page: [person/team]

## Post-Incident
- File incident report within 24 hours
- Schedule postmortem within 5 business days

Building Playbooks from Past Incidents

The best source of playbook content is your own incident history. Mine previous incidents for:

  1. What alerts fired? — Use this as trigger conditions
  2. What did the responder check first? — These become diagnosis steps
  3. What commands were run? — Add these as specific actions
  4. What was the root cause? — Create a resolution path for each common cause
  5. What caused delays? — These become improvements to the playbook

Review your incident tickets from the last 6 months. Group similar incidents by symptom. For each group, write a playbook.

Example: Payment Service Down Playbook

Here's a real-world example for a payment processing service:

# Payment Service - Complete Outage Playbook

## Trigger
- Alert: "Payment API availability < 95% for 2 consecutive checks"
- Alert: "payment_service_error_rate > 10%"
- Manual: Users reporting payment failures

## Severity: P1
All users cannot complete purchases. Revenue impact begins immediately.

## First Responder Actions (0-5 minutes)
1. Acknowledge PagerDuty alert
2. Open Payment Service Dashboard: https://monitoring.example.com/dashboard/payments
3. Post in #incident-response:
   "@here Investigating payment service alert - [your name] responding"
4. Join incident call bridge: +1-800-555-CONF / https://meet.example.com/incidents

## Diagnosis Checklist

### Step 1: Check Payment API Health

curl -H "Authorization: Bearer $MONITOR_TOKEN"
https://api.example.com/health/payment

Expected: {"status": "healthy"}
If unhealthy: Check payment-service logs (Step 2)
If healthy: Issue may be intermittent - check error logs (Step 3)

### Step 2: Check Service Logs

kubectl logs -n payments deployment/payment-service --tail=100 | grep ERROR

Common errors and next steps:
- "connection refused" to DB → go to Step 4 (Database)
- "Stripe API timeout" → go to Step 5 (Stripe)
- "OOMKilled" → go to Step 6 (Memory)

### Step 3: Check Recent Deployments

kubectl rollout history deployment/payment-service -n payments

If deployed in last 2 hours → go to Resolution: Rollback

### Step 4: Database Health

kubectl exec -n payments deployment/payment-service --
nc -zv postgres-primary 5432

If connection fails: Check DB team status page at [link]

### Step 5: Stripe API Status
Check: https://status.stripe.com
If Stripe is degraded: Enable payment queue mode (see below)

### Step 6: Memory/Pod Issues

kubectl get pods -n payments kubectl describe pod [pod-name] -n payments

If OOMKilled: Scale up replicas immediately (Step 6a)

## Resolution Procedures

### Resolution A: Deployment Rollback

kubectl rollout undo deployment/payment-service -n payments kubectl rollout status deployment/payment-service -n payments

Verify: Check dashboard - error rate should drop within 2 minutes.

### Resolution B: Scale Up Replicas

kubectl scale deployment/payment-service -n payments --replicas=6

Current normal: 3 replicas. Scale to 6 for incident, notify platform team.

### Resolution C: Enable Payment Queue Mode

Enable async payment processing (queues payments to process when Stripe recovers)

kubectl set env deployment/payment-service -n payments PAYMENT_MODE=queue

Customer impact: Payments accepted but not immediately confirmed. Notify support.

## Escalation Path
- 0-15 min: First responder owns resolution
- 15-30 min: Page payments team lead (@payments-lead on Slack)
- 30-60 min: Page VP Engineering (PagerDuty escalation policy "P1-Escalation")
- 60+ min: Executive escalation via Crisis Protocol

## Communication Template
Every 15 minutes, post status to #status-updates:
"Payment service: [STATUS]. Root cause: [KNOWN/INVESTIGATING]. 
Estimated recovery: [TIME/UNKNOWN]. Impact: [USER IMPACT]."

## Post-Incident Requirements
- Incident report: Within 24 hours
- Postmortem: Within 5 business days
- Customer communication: If > 30 minutes, customer success team notifies affected accounts

Playbook Testing

A playbook that's never been tested will fail when you need it most. Run quarterly playbook drills:

# Quarterly Playbook Drill Checklist

## Pre-Drill
- [ ] Select playbook to test
- [ ] Notify on-call team of drill (or run surprise drill)
- [ ] Ensure no real incidents are active

## During Drill
- [ ] Trigger synthetic incident (or simulate via test environment)
- [ ] Time each playbook step
- [ ] Note where responders hesitate or deviate from playbook
- [ ] Document actual commands used

## Post-Drill Review
- [ ] Did the playbook accurately describe the situation?
- [ ] Were all commands correct and executable?
- [ ] Did the decision tree cover the scenarios encountered?
- [ ] Total time to resolution vs. SLA target?
- [ ] Update playbook with corrections

Playbook Maintenance

Playbooks rot. Services change, infrastructure changes, and playbooks that aren't maintained become actively harmful — they mislead responders. Maintain them with:

| Maintenance Trigger | Action | |---|---| | New deployment or infrastructure change | Review and update relevant playbooks | | Incident occurs | After postmortem, update playbook to add new scenario | | Quarterly review | Audit all playbooks for accuracy | | New team member | Use onboarding to validate playbooks are understandable |

# Automated playbook staleness detection
def check_playbook_freshness(playbooks_dir, max_age_days=90):
    """Alert on playbooks not updated in 90 days"""
    import os
    from datetime import datetime, timedelta
    
    stale = []
    threshold = datetime.now() - timedelta(days=max_age_days)
    
    for filename in os.listdir(playbooks_dir):
        if filename.endswith('.md'):
            filepath = os.path.join(playbooks_dir, filename)
            modified = datetime.fromtimestamp(os.path.getmtime(filepath))
            
            if modified < threshold:
                stale.append({
                    "playbook": filename,
                    "last_modified": modified.isoformat(),
                    "days_old": (datetime.now() - modified).days
                })
    
    return sorted(stale, key=lambda x: x['days_old'], reverse=True)

Integrating Playbooks with Monitoring

The most effective setup links monitoring alerts directly to playbooks:

# Alert with playbook link
alert:
  name: "Payment Service Degraded"
  condition: "payment_error_rate > 5%"
  notification:
    pagerduty:
      severity: critical
      body: |
        Payment service error rate is {{ value }}%.
        
        Playbook: https://wiki.example.com/playbooks/payment-service-degraded
        Dashboard: https://monitoring.example.com/d/payments
        Runbook author: payments-team

When an on-call engineer gets paged, they immediately have the playbook URL in the alert. No hunting through documentation systems.

Conclusion

Playbooks are infrastructure. They're built once, maintained continuously, and pay dividends every time an incident occurs. The best engineering teams treat playbooks like code — they're reviewed, tested, and updated in the same way as production software. When your monitoring alerts fire, AzMonitor surfaces not just what's broken but links you directly to the actions that fix it. Pair that with well-maintained playbooks and your team can recover from incidents in minutes, not hours.

Tags:incident responseplaybooksMTTRon-call
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →