How you communicate during an incident often matters more than how quickly you fix it. A customer who gets clear, honest, timely updates might be frustrated but understanding. A customer who discovers the outage themselves and hears nothing for two hours is likely done with your product. Incident communication is a skill — and like monitoring itself, it needs to be planned in advance rather than improvised under pressure.

The Communication Problem During Incidents

During an outage, your team is simultaneously trying to:

Diagnose and fix the problem
Coordinate internally
Communicate with customers
Update executives
Manage a status page

Without clear ownership and templates, communication gets dropped. Engineers focus on the technical problem and nobody updates customers for 90 minutes. Or well-meaning but untrained people make promises ("we'll be back up in 10 minutes") that engineering can't keep.

The solution is a communication plan that's defined before incidents happen, with clear ownership and pre-written templates.

Communication Channels and Audiences

| Audience | Channel | What They Need | Frequency | |---|---|---|---| | Customers | Status page, email | Impact, status, resolution time | Every 30 min or on change | | Enterprise customers | Dedicated Slack / email | Detailed impact, workarounds | Every 20 min | | Internal teams | Slack #incidents | Technical details, who's doing what | Continuous | | Executives | Slack DM or email | Business impact summary | Every 30-60 min for P1 | | Support team | Slack #support-escalation | What to tell customers, workarounds | At start and on major changes |

The Incident Communication Owner

Assign a dedicated communication role during incidents — separate from the engineer fixing the problem. The incident commander (or comms lead) owns:

Status page updates
Customer-facing communications
Executive updates
Support team briefings
Tracking the timeline for postmortem

The technical lead focuses entirely on resolution. The comms lead focuses entirely on keeping everyone informed. These roles should not be the same person during major incidents.

Status Page Updates

Status pages are your primary customer communication channel. Follow these rules:

Update within 5 minutes of declaring an incident. Even if you know nothing yet, a status update saying "We are investigating reports of issues with [service]" is better than silence.

Update every 30 minutes at minimum. Even if nothing has changed, post an update: "Our team continues to work on this issue. We have not yet identified the root cause. Next update in 30 minutes."

Be specific about impact. "Some users may experience issues" is useless. "Users are unable to complete purchases. Authentication appears to be functioning normally" tells customers what's affected.

Don't promise recovery times you can't deliver. "We expect resolution within 30 minutes" that takes 3 hours is worse than "no ETA yet."

## Status Page Update Templates

### Initial Update (within 5 minutes)
**Investigating - [Affected Service] Issues**
We are investigating reports of issues affecting [service/feature].
Users may experience [specific symptom].
Our team is actively working to identify and resolve the issue.
We will provide updates every 30 minutes.
Posted: 14:23 UTC

---

### Progress Update (during incident)
**Update - Identified Root Cause**
We have identified the cause of the current [service] issues:
[Brief, non-technical description of the cause]

Our team is actively working to implement a fix. 
We expect to have more information within [30 minutes / 1 hour].
Current impact: [Updated impact description]
Posted: 15:00 UTC

---

### Resolution Update
**Resolved - [Affected Service] Issues**
The issue affecting [service] has been resolved.

Summary: [Brief description of what happened]

All systems are now operating normally.
We will be publishing a detailed postmortem within 5 business days.

Duration: [Start time] to [End time] ([X] minutes)
Posted: 15:47 UTC

Customer-Facing Communication

For direct customer communication, language matters. Test your drafts against these criteria:

Does it acknowledge the user's experience? Start with empathy, not excuses.

Does it explain impact clearly? What specifically can't they do right now?

Does it set expectations? What are you doing? When will you know more?

Does it provide workarounds? Can they use an alternative path while you fix things?

## Email Template for Significant Outage

Subject: [Service] is currently experiencing issues - [Date]

We're writing to let you know that [Service] is currently experiencing 
an outage that is preventing [specific functionality].

**What's affected:**
[Specific features or services unavailable]

**What's working:**
[Features unaffected - help users understand what they CAN do]

**What we're doing:**
Our engineering team identified this issue at [time] and is actively 
working to restore service. 

**Workaround (if available):**
While we work to resolve this, you can [workaround description].

**Next update:**
We will email you again by [specific time] with an update, or 
sooner if the situation resolves.

We apologize for the disruption and appreciate your patience.

[Company] Status Page: [URL]

— The [Company] Team

Internal Incident Communication

The Slack incident channel is your war room. Keep it functional:

Pin the incident commander's name at the top so everyone knows who to address.

Use structured updates, not stream of consciousness:

[STATUS UPDATE 14:47]
Owner: @sarah-devops
Status: INVESTIGATING
Current theory: Database connection pool exhaustion
Actions taken: Scaled up payment-service replicas (x3 → x6) - no improvement
Next step: Investigating DB connection settings
ETA for next update: 15:00

Separate debugging chatter from updates. Use threads for technical back-and-forth; keep the main channel for status updates.

Establish a clear "resolved" message format:

[RESOLVED 15:47 UTC]
Root cause: Deployment at 14:15 included incorrect connection pool max setting
Fix applied: Rolled back deployment + corrected config
Downtime: 14:22 - 15:47 (85 minutes)
Users affected: ~30% of payment transactions
Postmortem scheduled: Monday 10:00 UTC, #postmortem-meetings

Executive Communication

Executives need business context, not technical detail:

## Executive Briefing Template

Time: [14:45 UTC]

**What's happening:**
Our payment service is currently unavailable. Users cannot complete purchases.

**Business impact:**
Approximately [X] users are affected. Based on normal transaction volume, 
this represents roughly $[X] in delayed transactions per hour.

**Current status:**
Engineering identified the root cause 15 minutes ago and is implementing 
a fix. We expect service restoration within [30 minutes].

**Customer communication:**
Our status page has been updated. Customer success has been briefed.
We will proactively notify enterprise accounts.

**Next update:**
We will brief you again at 15:15 UTC or immediately if the situation changes.

Keep this to 3-4 sentences. If executives have questions, they'll ask. Your job is to give them enough to respond to stakeholder inquiries, not a technical deep-dive.

Managing Communication During Extended Outages

For incidents lasting more than 2 hours, the communication cadence becomes critical:

Hour 1: Every 30 minutes
Hour 2: Every 30 minutes  
Hours 3-4: Every 45 minutes (with explanation for customers: "We're actively working...")
Hours 4+: Every 60 minutes + personal calls to top enterprise customers

For very long outages, consider a customer call with major enterprise accounts. Proactive direct outreach — especially before they call you — significantly reduces anger and churn risk.

Post-Incident Communication

After resolution, communicate:

Immediate resolution notice — update status page, send email to affected users.

Postmortem summary — 3-5 days later, publish a plain-language summary of what happened and what you're doing to prevent recurrence. This is optional for small incidents but expected for major ones.

## Post-Incident Report Template

**What happened:**
On [date], our [service] experienced a [duration] outage that prevented 
users from [specific action].

**Root cause:**
A configuration change deployed at [time] caused [brief explanation without jargon].

**How we fixed it:**
We [action taken] at [time], which restored service by [time].

**What we're doing to prevent this:**
1. [Specific action 1]
2. [Specific action 2]
3. [Specific action 3]

We take reliability seriously and apologize for the disruption.
Questions? Contact [email protected].

Communication Automation

Automate the mechanical parts of communication:

# Auto-post initial status update when incident is declared
def on_incident_created(incident):
    if incident.severity in ["P1", "P2"]:
        # Post to status page immediately
        status_page.create_incident(
            name=f"Investigating {incident.affected_service} issues",
            status="investigating",
            impact=incident.estimated_impact,
            body=INITIAL_TEMPLATE.format(
                service=incident.affected_service,
                symptom=incident.user_symptom
            )
        )
        
        # Alert support team
        slack.post(
            channel="#support-escalation",
            text=SUPPORT_BRIEFING_TEMPLATE.format(
                incident_id=incident.id,
                affected_service=incident.affected_service,
                user_impact=incident.user_symptom,
                workaround=incident.workaround or "None available yet"
            )
        )
        
        # Schedule reminder if no update in 30 minutes
        scheduler.schedule(
            delay_minutes=30,
            task=remind_status_page_update,
            args=[incident.id]
        )

Conclusion

Incident communication is a system — it requires pre-defined roles, pre-written templates, and clear escalation paths. When you plan communication before incidents happen, you ensure that even in high-stress situations, customers get timely updates, executives have the information they need, and your support team isn't fielding calls blind. Good communication doesn't make incidents less painful, but it does make them less damaging to customer relationships. Pair your AzMonitor alerts with a communication plan and you have both the detection and the response covered.

Tags:incident communicationstatus updatescustomer communicationoutage

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Incident Communication: How to Keep Stakeholders Informed During Outages

The Communication Problem During Incidents

Communication Channels and Audiences

The Incident Communication Owner

Status Page Updates

Customer-Facing Communication

Internal Incident Communication

Executive Communication

Managing Communication During Extended Outages

Post-Incident Communication

Communication Automation

Conclusion

Related articles

Status Page Examples: What Great Status Pages Look Like and Why They Work

Subscriber Notifications: Keeping Customers Informed During Incidents

Public vs Private Status Pages: Choosing the Right Visibility for Your Audience