How you communicate during an incident matters almost as much as how quickly you fix it. Customers who receive clear, timely status updates are significantly more forgiving of outages than customers who are left in silence, guessing whether anyone is working on the problem. The companies that handle incidents well often emerge with stronger customer trust than before the incident — because transparent communication during adversity builds credibility that no marketing can replicate.
The Communication Failure Pattern
Most companies make the same mistakes during incidents:
Too late — The first public communication comes 45 minutes into an incident that's already generated hundreds of support tickets. By then, customers are frustrated and the update feels reactive.
Too vague — "We are investigating reports of issues affecting some users" tells customers nothing. They still don't know if they're affected, what's broken, or when it will be fixed.
Too infrequent — An initial post followed by two hours of silence. Silence signals that no one is working on it.
Too technical — Root cause explanations that reference database replication lag or Kubernetes node failures mean nothing to most customers.
Too optimistic — Committing to resolution timelines that slip, eroding trust.
Communication Cadence
Establish update frequency before the incident happens, not during it:
| Incident Severity | Initial Update | Update Frequency | Resolution Update | |---|---|---|---| | P1 (Critical) | Within 10 minutes | Every 15-20 minutes | Immediate on resolution | | P2 (High) | Within 20 minutes | Every 30 minutes | Within 5 minutes of resolution | | P3 (Medium) | Within 60 minutes | Every hour | End of business day | | P4 (Low) | Next business day | As needed | When resolved |
The cadence is a commitment to customers. If you say "next update in 15 minutes," publish one in 15 minutes even if you have no new information. The update can say "we're still investigating" — but it must come.
Writing Effective Status Updates
The Three-Part Structure
Every status update should answer three questions:
- What happened? (factual, no speculation about cause)
- What are we doing? (concrete actions, not "investigating")
- What should you do / what's next? (action for users, next update time)
Examples
Bad initial update:
We are aware of reports of issues and our team is investigating.
We will provide more information as it becomes available.
Good initial update:
Investigating — [2025-04-16 14:32 UTC]
We are investigating an issue causing login failures for a subset
of users. Users attempting to log in may see a "500 Internal Server
Error" or receive no response.
The issue began approximately 14:15 UTC. We do not yet know the cause
but our engineering team is actively investigating.
If you need urgent access, please contact support at support@example.com.
Next update by 14:50 UTC.
Bad progress update:
Update: We are still working on this issue and hope to have it
resolved soon.
Good progress update:
Identified — [2025-04-16 14:48 UTC]
We have identified the cause: a configuration change deployed at
14:10 UTC is causing authentication failures for users in the
EU region.
We are rolling back the change now. This rollback typically takes
5-10 minutes to complete.
Users in North America and Asia-Pacific are not affected.
Next update by 15:00 UTC or when the issue is resolved.
Good resolution update:
Resolved — [2025-04-16 15:02 UTC]
The issue affecting login for EU users has been resolved.
Timeline:
- 14:10 UTC: Configuration change deployed
- 14:22 UTC: Login failures begin
- 14:32 UTC: Issue detected and investigation started
- 14:48 UTC: Root cause identified
- 14:55 UTC: Rollback initiated
- 15:02 UTC: Service restored
Duration: 40 minutes. Affected users: approximately 12% of EU users.
We will publish a full incident report within 5 business days.
We apologize for the disruption.
Status Page Component Labels
Write status components from the customer's perspective, not your infrastructure's perspective:
| Don't Use | Use Instead | |---|---| | Database cluster | Data storage | | Redis cache layer | Performance features | | Load balancer | Website availability | | CDN edge nodes | Content delivery | | Message queue | Background processing | | Auth service | Login and authentication | | API gateway | API (for developers) |
What Not to Say
| Phrase | Why to Avoid | Better Alternative | |---|---|---| | "Some users are affected" | Vague — every customer thinks they're affected | "Approximately 15% of users in the EU region" | | "We're working on it" | No information about what is being done | "We're rolling back the deployment from 14:10 UTC" | | "Should be resolved shortly" | Creates unmet expectations | "We expect resolution by 15:30 UTC" OR say nothing about timeline | | "We apologize for any inconvenience" | Weak, dismissive | "We apologize for the disruption this caused" | | "This was caused by a third-party provider" | Sounds like blame-shifting | "This was caused by an issue with our payment provider. Here's the impact..." | | Root cause jargon | Confuses non-technical customers | Plain language description of user impact |
Handling Uncertainty
You won't always know the cause quickly. Communicate honestly about what you don't know:
# Handling Unknown Causes
## When you know symptoms but not cause:
"We're seeing elevated error rates on the checkout API.
We have not yet identified the cause. Our team is
actively investigating. Next update in 15 minutes."
## When you have a hypothesis but aren't certain:
"We believe the issue is related to a configuration change
deployed at 14:10 UTC. We're investigating this hypothesis
and will provide an update in 15 minutes."
(Note: say "we believe" not "we know")
## When resolution timeline is uncertain:
Never: "This will be fixed in the next few minutes"
OK: "We're working to resolve this as quickly as possible"
Better: "We expect to have more information about resolution
timeline in our next update at 15:30 UTC."
Multi-Channel Communication Strategy
Different stakeholders need information through different channels:
# incident_comms.py
class IncidentCommunicationsManager:
"""
Manage multi-channel incident communication.
"""
def send_initial_notification(self, incident):
"""Send initial notification to all appropriate channels."""
update_text = self.compose_initial_update(incident)
# 1. Status page (primary customer channel)
self.status_page.create_incident(
title=incident.title,
status="investigating",
body=update_text,
components=incident.affected_components
)
# 2. Internal Slack (team awareness)
self.slack.post_to_channel(
channel="#incident-status",
message=f":red_circle: P{incident.severity} INCIDENT DECLARED\n{update_text}"
)
# 3. Enterprise customer direct notification (P1 only)
if incident.severity == 1:
self.notify_enterprise_customers(incident)
# 4. Support team notification
self.slack.post_to_channel(
channel="#support-team",
message=f"Incident in progress — please route customer reports to the incident channel. Status page: {self.status_page.url}"
)
def notify_enterprise_customers(self, incident):
"""Send direct email to enterprise accounts affected."""
affected_accounts = self.get_affected_enterprise_accounts(incident)
for account in affected_accounts:
contact = account.primary_technical_contact
self.email.send(
to=contact.email,
subject=f"Service Disruption Affecting {account.name}",
template="enterprise_incident_notification",
context={
"contact_name": contact.name,
"company_name": account.name,
"incident_description": incident.customer_description,
"status_page_url": self.status_page.url,
"csm_name": account.customer_success_manager.name,
"csm_email": account.customer_success_manager.email
}
)
def send_resolution_notification(self, incident):
"""Send resolution notification and schedule postmortem."""
# Update status page
self.status_page.resolve_incident(
incident_id=incident.id,
resolution_body=self.compose_resolution_update(incident)
)
# Internal notification
duration_min = int(incident.duration_seconds / 60)
self.slack.post_to_channel(
channel="#incident-status",
message=f":white_check_mark: INCIDENT RESOLVED\nDuration: {duration_min} minutes\nPostmortem scheduled for {incident.postmortem_date.strftime('%B %d')}"
)
# Enterprise follow-up
if incident.severity <= 2:
for account in self.get_affected_enterprise_accounts(incident):
self.schedule_customer_call(account, incident)
Subscriber Notification Timing
Status page subscriber notifications should be sent strategically:
## Subscriber Notification Strategy
### DO send subscriber notifications for:
- Initial incident declaration (when impact is confirmed, not suspected)
- Status change (Investigating → Identified → Monitoring → Resolved)
- Resolution
### DON'T send subscriber notifications for:
- Every update (creates notification fatigue)
- When the incident resolves in under 5 minutes (not worth the noise)
- Scheduled maintenance (use advance maintenance window notifications)
### Why this matters:
Too many notifications → users unsubscribe → miss future real incidents
Too few notifications → users don't know you're transparent
Post-Incident Customer Communication
Within 24-48 hours after a significant incident:
# Post-Incident Email Template (to affected enterprise customers)
Subject: Incident Report — [Service] disruption on [Date]
[First Name],
On [date] at [time] UTC, [brief description of what customers experienced].
The issue lasted [duration] and affected [scope — N% of users, N customers].
## What happened
[2-3 sentence plain English explanation of root cause. Avoid jargon.]
## What we did
[Bullet points of specific actions taken]
- [Time]: Detected issue via automated monitoring
- [Time]: Identified root cause
- [Time]: Implemented fix
## How we're preventing this
[3-5 specific action items from the postmortem]
- We're adding [specific test] to prevent this class of error
- We're improving detection time by [specific change]
## Your impact
Based on our records, your account experienced [specific impact — N failed requests, N minute outage].
[If applicable: SLA credit] Per our SLA terms, we're crediting your account [X] days/amount. You'll see this reflected in your next invoice.
We're sorry for the disruption. Please reach out if you have any questions.
[Your name]
[Title]
Conclusion
Status update quality during incidents is a competitive differentiator — customers remember how they were treated when things went wrong more vividly than they remember the incident itself. The teams that communicate clearly, frequently, and honestly build a reservoir of trust that weathers future incidents. Pair transparent communication with the monitoring infrastructure to detect issues quickly in the first place: AzMonitor's public status pages with automatic incident updates, subscriber email notifications, and uptime history give you the platform to communicate professional, trust-building updates without building the infrastructure from scratch.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →