Incident Management

War Room Setup: Running Effective Major Incident Response

Learn how to set up and run an effective incident war room — virtual or physical — with clear roles, communication protocols, and decision-making frameworks.

AzMonitor TeamApril 9, 20258 min read · 1,663 wordsUpdated January 20, 2026
war roomincident responsemajor incidentincident command

A war room — whether a physical conference room or a dedicated virtual channel — is where a major incident gets resolved. Without structure, incidents devolve into chaos: multiple people making conflicting changes, unclear ownership of tasks, and communication that consumes more energy than the actual fix. With proper war room structure, the same people resolve the same incident faster, with less confusion and better documentation.

When to Declare a War Room Incident

Not every alert warrants a war room. The trigger should be based on impact severity:

## War Room Trigger Criteria

Declare a war room (P1/P2 incident) when:
- [ ] More than 10% of users affected by outage or severe degradation
- [ ] A critical revenue path is non-functional (checkout, login, payments)
- [ ] A data integrity issue is detected or suspected
- [ ] SLA breach is imminent (within 30 minutes of breach threshold)
- [ ] A security incident is in progress
- [ ] External partners/dependencies report failures affecting your service
- [ ] Multiple systems failing simultaneously (possible cascade)

Do NOT declare a war room for:
- Single user reports of issues (route to support)
- Non-critical feature degradation with workarounds
- Issues affecting only internal tools
- Scheduled maintenance events

War Room Roles

The most important decision in war room setup is assigning clear roles. Ambiguous ownership is the primary cause of slow incident resolution.

Incident Commander (IC)

The IC owns the incident resolution process — not the technical fix. Their job is coordination, not coding.

## Incident Commander Responsibilities
- Declare the incident and assign initial roles
- Maintain situational awareness across all workstreams
- Make decisions when technical team members disagree
- Authorize significant actions (rollbacks, failovers, customer notifications)
- Set and enforce communication cadence
- Declare incident resolution when criteria are met

## IC Does NOT:
- Make technical changes to production systems
- Investigate root causes
- Write status page updates (delegates to Communications Lead)
- Deep-dive into logs (delegates to Technical Lead)

## IC Scripts:
"This is a P1 incident. [Name] is Technical Lead, [Name] is Communications Lead.
We're investigating [symptom]. Status update in 15 minutes."

Technical Lead

The engineer leading technical investigation and remediation.

## Technical Lead Responsibilities
- Lead root cause investigation
- Coordinate technical responders (assign specific investigation tasks)
- Propose and implement remediation steps
- Report findings and status to IC every 15 minutes
- Document technical timeline in real-time

## Technical Lead Scripts:
"We have three hypotheses: [1], [2], [3]. 
[Name] investigate database metrics, [Name] check deployment history, 
[Name] review error logs. Report back in 10 minutes."

Communications Lead (Comms)

Owns all external and stakeholder communication.

## Communications Lead Responsibilities
- Write and publish status page updates
- Send customer-facing emails if warranted
- Update internal stakeholders (customer success, sales, leadership)
- Handle support team communications
- Draft post-incident customer notification

## Comms Lead Does NOT:
- Speculate about root cause in public communications
- Commit to resolution timelines without IC approval
- Communicate directly with press or investors (escalate to leadership)

Scribe

Documents everything in real-time — the incident timeline that becomes the foundation of the postmortem.

## Scribe Responsibilities
- Record timestamped log of all actions taken and observations made
- Track open questions and hypotheses
- Document all customer-facing communications sent
- Record who made what decision and when
- Keep incident timeline document current

## Scribe Template (timestamped log):
14:32 - IC: War room declared. Team assembled.
14:33 - TL: Investigating reports of 502 errors on checkout API
14:35 - DB metrics show no anomalies
14:38 - Found spike in error rate starting 14:22 in app logs
14:40 - TL: Hypothesis — deployment at 14:20 introduced issue
14:42 - IC approved rollback of v2.4.1 → v2.4.0
14:45 - Rollback complete, monitoring for recovery
14:52 - Error rate returning to baseline
14:58 - Checkout API confirmed healthy. Incident resolved.

Virtual War Room Setup

Most modern teams run virtual war rooms. Slack is the most common platform.

Slack War Room Channel Structure

#incident-active       — Current active incident discussion
#incident-p1-[date]    — Dedicated channel per P1 incident
#incident-status       — External updates for broader team
#incident-history      — Archive of resolved incidents

Creating the incident channel:

# Slack API to create incident channel programmatically
# (via Slack bot or PagerDuty integration)

# Channel naming convention: incident-YYYYMMDD-HH-short-description
# Example: incident-20250615-14-checkout-api-errors

Zoom/Meet War Room Protocol

## Virtual War Room Video Protocol

### Initial Setup (first 5 minutes)
- Dedicated meeting link in Slack channel topic
- All responders join video, unmute briefly to confirm presence
- IC introduces themselves and situation summary

### During Incident
- Technical team stays muted unless speaking (large calls = noise)
- IC facilitates discussion, not open-mic chaos
- Use hand-raise or "I have an update" in chat to signal
- Screen sharing: only the person sharing relevant dashboard/logs

### Status Check Cadence
IC calls status checks every 15 minutes:
- "Technical Lead, what's your current theory?"
- "Are we still aligned on [current mitigation approach]?"
- "Any blockers that I need to unblock?"

### Decision Points
When proposing significant actions:
IC: "We're proposing to roll back the database migration.
     Technical Lead, what's your confidence level?"
TL: "High confidence — the migration is the most likely cause."
IC: "Approved. [Name], proceed with rollback. Scribe, note the time."

Decision Making Under Pressure

War rooms fail when decisions are delayed. Establish clear decision authorities:

| Decision Type | Who Decides | Approval Needed | |---|---|---| | Code rollback | Technical Lead | IC awareness | | Database schema rollback | Technical Lead + IC | IC approval | | Full service failover | IC | IC + Engineering VP | | Customer notification | Comms Lead | IC approval | | Escalate to vendor | IC | IC decision | | Engage customer (enterprise) | Customer Success | IC awareness | | Press/investor communication | Leadership | Leadership only |

The Reversibility Framework

When uncertain whether to act:

## Reversibility Decision Framework

Ask: Can we reverse this action if it doesn't work?

Highly reversible (act quickly):
- Code rollback to previous version
- Feature flag disable
- Traffic routing change
- Cache flush
- Connection pool restart

Partially reversible (consider carefully):
- Configuration changes
- Database index creation/removal
- SSL certificate changes

Irreversible (require IC approval + careful consideration):
- Database migrations/rollbacks
- Data deletion or modification
- Third-party API configuration changes
- Customer-visible setting changes

Rule: For reversible actions with reasonable confidence, act fast.
For irreversible actions, verify your hypothesis more carefully first.

Status Update Templates

Consistent update cadence reduces stakeholder anxiety and support ticket volume:

## Initial Incident Notification (post to #incident-status within 10 min)
---
INCIDENT DECLARED — [timestamp]
Status: Investigating
Impact: [What is affected, how many users estimated]
Symptoms: [What customers are experiencing]
Team: [IC name] leading response
Next update: [timestamp + 15 minutes]
Status page: [link]
---

## Progress Update Template (every 15-30 min)
---
INCIDENT UPDATE — [timestamp]
Status: [Investigating | Identified | Mitigating | Resolved]
Progress: [1-2 sentences on what was found/done]
Impact: [Updated impact assessment if changed]
Next update: [timestamp]
---

## Resolution Notification
---
INCIDENT RESOLVED — [timestamp]
Duration: [X hours Y minutes]
Impact: [Final scope — N users, N% of traffic]
Resolution: [1-2 sentences what was done]
Postmortem: Will be published within 5 business days at [link]
---

War Room Anti-Patterns

| Anti-Pattern | Symptom | Fix | |---|---|---| | Too many cooks | Multiple people making uncoordinated changes | IC enforces single Technical Lead for prod changes | | Analysis paralysis | 45 minutes discussing hypotheses, no action | IC: "We have enough to try X. Let's try it." | | Premature declaration | War room called for issue already resolving | Confirm issue is ongoing before escalating | | No documentation | Post-incident, no one agrees on what happened | Enforce Scribe role from minute one | | Scope creep | Team starts fixing adjacent issues | IC: "We'll file tickets. Focus on restoration." | | Silent room | Responders aren't sharing findings | IC asks for specific updates on a schedule | | No resolution criteria | "It's probably fine now?" | Define resolution criteria at incident declaration |

Post War Room: Handoff and Follow-Through

When the incident resolves:

## Resolution Checklist

### Immediate (within 1 hour of resolution)
- [ ] IC declares incident resolved with timestamp
- [ ] Comms Lead posts resolution to all channels and status page
- [ ] Scribe shares incident timeline with team
- [ ] Any temporary changes documented (to be made permanent or reverted)
- [ ] On-call engineer stays available for 2 hours post-resolution

### Within 24 hours
- [ ] All responders write their section of the incident timeline
- [ ] Incident tagged in PagerDuty/OpsGenie with correct severity and duration
- [ ] Known contributing factors documented even if not fully investigated
- [ ] Postmortem meeting scheduled within 5 business days

### Within 5 business days
- [ ] Blameless postmortem conducted
- [ ] Root cause analysis completed
- [ ] Action items created with owners and due dates
- [ ] Action items reviewed in next engineering planning session

Conclusion

A well-run war room is a practiced discipline — teams that run mock incidents and regularly review their processes respond faster and less chaotically when real incidents happen. The structure of roles, communication cadence, and decision authority isn't bureaucracy; it's the scaffolding that lets technical people do their best work under pressure. AzMonitor's monitoring provides the detection layer that triggers the war room: clear alerting, detailed response history, and uptime data that helps responders quickly assess the scope and timeline of an incident before the war room even begins.

Tags:war roomincident responsemajor incidentincident command
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →