Incident Management

Postmortem Templates: Structured Formats for Effective Incident Reviews

Ready-to-use postmortem templates for different incident types, with guidance on what to include, what to skip, and how to drive action items to completion.

AzMonitor TeamApril 30, 20258 min read · 1,674 wordsUpdated January 20, 2026
postmortemincident reviewroot cause analysistemplates

Postmortem templates solve a practical problem: when an incident happens, you shouldn't spend mental energy deciding what format to write about it in. A pre-defined template lets you focus on the content — what went wrong, why, and what to do differently. Good templates are specific enough to guide thinking but open enough to capture the unique circumstances of each incident.

Template Design Principles

Before choosing or creating a template, understand what makes postmortem templates effective:

Scope appropriately — A 20-minute P3 outage doesn't need the same analysis depth as a 4-hour P1 data incident. Have multiple templates calibrated to incident severity.

Separate timeline from analysis — Facts in one section, interpretation in another. This makes postmortems more accurate and less defensive.

Action items are outputs, not afterthoughts — The template should force specific, owned action items, not vague commitments.

Blameless language — Templates that ask "Who caused this?" produce defensive postmortems. Templates that ask "What conditions made this possible?" produce learning.

Required vs optional sections — Mark which sections are required for all incidents and which are optional for complex cases.

Template 1: Standard Incident Postmortem

Use for most P1 and P2 incidents.

# Postmortem: [Incident Title]

**Date of incident**: YYYY-MM-DD  
**Date of postmortem**: YYYY-MM-DD  
**Severity**: P[1/2/3]  
**Duration**: [X hours Y minutes]  
**Author**: [Name]  
**Reviewers**: [Names]  
**Status**: Draft | In Review | Final

---

## Summary

[2-4 sentence summary: what failed, who was affected, and how it was resolved.
Write this for a non-technical stakeholder. Avoid jargon.]

Example: "On April 16, our checkout API experienced intermittent failures
for 16 minutes, affecting approximately 15% of users attempting to complete
purchases. The issue was caused by a connection pool misconfiguration
introduced in a deployment at 14:10 UTC. The service was restored via rollback
at 14:38 UTC."

---

## Impact

| Metric | Value |
|---|---|
| Start time | YYYY-MM-DD HH:MM UTC |
| End time | YYYY-MM-DD HH:MM UTC |
| Duration | X hours Y minutes |
| Users affected | N (X% of total) |
| Revenue impact | $X (estimated) |
| SLA impact | X minutes toward monthly budget |
| Components affected | [list] |
| Regions affected | [list] |

---

## Timeline

[All times UTC. Include: when issue started, when monitoring detected it, 
when on-call was paged, key investigation events, mitigation steps, resolution]

| Time | Event | Actor |
|---|---|---|
| HH:MM | [Event] | [Person/System] |

---

## Root Cause

[1-3 paragraphs explaining WHY this happened. Avoid stopping at the proximate
cause — dig to the systemic cause. Use "5 Whys" approach if helpful.]

**Proximate cause**: [The immediate technical cause]

**Contributing factors**:
- [Factor 1: what made this possible/likely]
- [Factor 2]
- [Factor 3]

---

## What Went Well

[Genuinely positive aspects of the response. These reinforce good practices.]

- Monitoring detected the issue within [X] seconds of onset
- On-call response was fast — acknowledged in [X] minutes
- Rollback procedure was well-documented and executed quickly
- Communication updates were timely and clear

---

## What Could Be Improved

[Specific, factual observations — not blame. Focus on systems and processes.]

- Detection took [X] minutes longer than ideal because [reason]
- The runbook for this alert was outdated — missing the current rollback procedure
- Communication to affected enterprise customers was delayed [X] minutes

---

## Action Items

[Every action item must have: owner, due date, and be specific enough to verify completion]

| Action Item | Owner | Due Date | Priority |
|---|---|---|---|
| Add integration test for connection pool exhaustion | @backend-team | 2025-05-07 | P1 |
| Update runbook for checkout-api with current rollback steps | @jane | 2025-04-30 | P2 |
| Improve alert threshold to detect faster — current threshold misses minor degradation | @carol | 2025-05-14 | P2 |
| Add connection pool metrics to checkout-api dashboard | @bob | 2025-05-07 | P3 |

---

## Lessons Learned

[2-5 sentences on the key insights from this incident that should change how
the team works going forward.]

---

## Appendix (optional)

- Graphs showing impact period
- Relevant log excerpts  
- Links to related incidents

Template 2: Security Incident Postmortem

Security incidents require additional sections for regulatory compliance:

# Security Incident Postmortem: [Title]

**Date of incident**: YYYY-MM-DD  
**Incident classification**: [Data Breach | Unauthorized Access | DDoS | Malware | Other]  
**Severity**: P[1/2]  
**Status**: [Contained | Remediated | Under Investigation]  
**Author**: [Name]  
**Reviewers**: [Security Team, Legal, Privacy Officer]  
**CONFIDENTIAL — ATTORNEY-CLIENT PRIVILEGE** (if applicable)

---

## Incident Summary

[Brief description for executive stakeholders]

## Detection

- How was the incident detected? (monitoring alert / customer report / third party)
- When was it detected vs when did it actually start?
- Detection gap: [X minutes/hours]

## Scope Assessment

### Data Potentially Affected
- Data types: [customer PII / financial data / API keys / other]
- Records potentially exposed: [count or "under investigation"]
- Time window of exposure: [start] to [end]

### Systems Affected
- [List all systems involved]

## Containment Actions

| Time | Action | Status |
|---|---|---|
| | Revoked compromised credentials | Complete |
| | Blocked attacker IPs | Complete |
| | Isolated affected systems | Complete |

## Regulatory Obligations

| Obligation | Threshold | Status |
|---|---|---|
| GDPR notification to authorities | 72 hours from awareness | [Required/Not Required/Complete] |
| GDPR notification to individuals | Without undue delay | [Required/Not Required/Pending] |
| HIPAA breach notification | 60 days | [Required/Not Required] |
| PCI DSS notification | Immediately | [Required/Not Required] |

## Root Cause

[How did the attacker gain access / how did the vulnerability exist?]

## Action Items

[Security-specific action items]

| Action | Owner | Due | Priority |
|---|---|---|---|
| Rotate all secrets in affected systems | @security | Immediate | P0 |
| Patch identified vulnerability | @engineering | YYYY-MM-DD | P1 |
| Implement additional logging for [vector] | @security | YYYY-MM-DD | P2 |
| Security training for affected process | @hr | YYYY-MM-DD | P2 |

Template 3: Lightweight P3 Incident Review

For minor incidents that don't warrant full postmortem overhead:

# Incident Quick Review: [Title]

**Date**: YYYY-MM-DD  
**Duration**: X minutes  
**Severity**: P3  
**Author**: [On-call engineer]

---

**What happened?** (2-3 sentences)

**Why did it happen?** (1-2 sentences)

**How was it resolved?** (1 sentence)

**What would have prevented it?** (1-2 sentences)

**Action items:**
- [ ] [Specific action] — Owner: @name, Due: YYYY-MM-DD

---

Making Action Items Stick

The most common postmortem failure is action items that never get completed:

## Action Item Quality Criteria

### Good action item:
"Add integration test covering connection pool limit behavior to 
checkout-api test suite"
- Owner: @bob
- Due: 2025-05-07
- Success criteria: Test exists and passes in CI

### Poor action item:
"Improve testing"
- No owner
- No due date
- Unclear what "improve" means

### Tracking Action Items

1. Enter every action item in your project management system (Jira, Linear, etc.)
   immediately after the postmortem meeting — not "later"

2. Link the postmortem document to the action items

3. Review open postmortem action items in weekly engineering meeting

4. At the start of the following postmortem, check what percentage of 
   previous action items were completed

5. If an action item is repeatedly not completed, it's either not a real
   priority (remove it) or has a blocker that needs to be addressed

Postmortem Meeting Format

The written postmortem is prepared before the meeting; the meeting is for discussion and alignment:

## Postmortem Meeting Agenda (60 minutes)

**Participants**: Incident responders + 1-2 relevant engineers not involved

0:00 - 0:05 | Facilitator: Ground rules (blameless, looking for systems not people)
0:05 - 0:15 | Author: Walk through timeline
0:15 - 0:25 | Discussion: Root cause — are we confident in the analysis?
0:25 - 0:35 | Discussion: What went well (10 min — don't skip this)
0:35 - 0:45 | Discussion: What to improve
0:45 - 0:55 | Review and finalize action items (assign owners in meeting)
0:55 - 1:00 | Confirm publication plan and communication

## Facilitator Rules
- If discussion becomes about blame or defensiveness: redirect to "what in our
  systems or processes made this outcome possible?"
- Keep timeline discussion focused on facts, not interpretation
- Ensure action items get specific owners before meeting ends
- Confirm publication plan — postmortems should be shared broadly

Postmortem Metrics

Track the health of your postmortem process:

def calculate_postmortem_health_metrics(postmortems, action_items, days=90):
    """
    Measure how well your postmortem process is working.
    """
    window_start = datetime.utcnow() - timedelta(days=days)
    recent_incidents = [p for p in postmortems if p.incident_date >= window_start]
    recent_action_items = [a for a in action_items if a.created_at >= window_start]
    
    # Postmortem completion rate
    incidents_with_postmortem = [p for p in recent_incidents if p.published]
    completion_rate = len(incidents_with_postmortem) / len(recent_incidents) if recent_incidents else 0
    
    # Time to publish
    days_to_publish = [
        (p.published_at - p.incident_date).days
        for p in incidents_with_postmortem
        if p.published_at
    ]
    avg_days_to_publish = sum(days_to_publish) / len(days_to_publish) if days_to_publish else None
    
    # Action item completion rate
    due_items = [a for a in recent_action_items if a.due_date <= date.today()]
    completed_items = [a for a in due_items if a.completed]
    action_item_completion = len(completed_items) / len(due_items) if due_items else 0
    
    return {
        "postmortem_completion_rate": f"{completion_rate:.0%}",
        "avg_days_to_publish": round(avg_days_to_publish, 1) if avg_days_to_publish else None,
        "action_item_completion_rate": f"{action_item_completion:.0%}",
        "total_incidents": len(recent_incidents),
        "total_action_items": len(recent_action_items),
        "overdue_action_items": len([a for a in recent_action_items if a.is_overdue()])
    }

Conclusion

Good postmortem templates create the conditions for learning without requiring that people figure out how to write postmortems under pressure. The templates here are starting points — adapt them to your team's size, culture, and incident types. The most important element isn't the format but the follow-through on action items. Postmortems that result in completed action items are valuable; postmortems that result in neglected TODO lists are theater. AzMonitor's alert history and uptime data contribute concrete, timestamped evidence to incident timelines, making the factual foundation of postmortems more accurate and reducing the reliance on fallible memory.

Tags:postmortemincident reviewroot cause analysistemplates
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →