How you communicate during an incident matters almost as much as how quickly you fix it. Customers who receive clear, timely status updates are significantly more forgiving of outages than customers who are left in silence, guessing whether anyone is working on the problem. The companies that handle incidents well often emerge with stronger customer trust than before the incident — because transparent communication during adversity builds credibility that no marketing can replicate.

The Communication Failure Pattern

Most companies make the same mistakes during incidents:

Too late — The first public communication comes 45 minutes into an incident that's already generated hundreds of support tickets. By then, customers are frustrated and the update feels reactive.

Too vague — "We are investigating reports of issues affecting some users" tells customers nothing. They still don't know if they're affected, what's broken, or when it will be fixed.

Too infrequent — An initial post followed by two hours of silence. Silence signals that no one is working on it.

Too technical — Root cause explanations that reference database replication lag or Kubernetes node failures mean nothing to most customers.

Too optimistic — Committing to resolution timelines that slip, eroding trust.

Communication Cadence

Establish update frequency before the incident happens, not during it:

| Incident Severity | Initial Update | Update Frequency | Resolution Update | |---|---|---|---| | P1 (Critical) | Within 10 minutes | Every 15-20 minutes | Immediate on resolution | | P2 (High) | Within 20 minutes | Every 30 minutes | Within 5 minutes of resolution | | P3 (Medium) | Within 60 minutes | Every hour | End of business day | | P4 (Low) | Next business day | As needed | When resolved |

The cadence is a commitment to customers. If you say "next update in 15 minutes," publish one in 15 minutes even if you have no new information. The update can say "we're still investigating" — but it must come.

Writing Effective Status Updates

The Three-Part Structure

Every status update should answer three questions:

What happened? (factual, no speculation about cause)
What are we doing? (concrete actions, not "investigating")
What should you do / what's next? (action for users, next update time)

Examples

Bad initial update:

We are aware of reports of issues and our team is investigating. 
We will provide more information as it becomes available.

Good initial update:

Investigating — [2025-04-16 14:32 UTC]

We are investigating an issue causing login failures for a subset 
of users. Users attempting to log in may see a "500 Internal Server 
Error" or receive no response.

The issue began approximately 14:15 UTC. We do not yet know the cause 
but our engineering team is actively investigating.

If you need urgent access, please contact support at [email protected].

Next update by 14:50 UTC.

Bad progress update:

Update: We are still working on this issue and hope to have it 
resolved soon.

Good progress update:

Identified — [2025-04-16 14:48 UTC]

We have identified the cause: a configuration change deployed at 
14:10 UTC is causing authentication failures for users in the 
EU region.

We are rolling back the change now. This rollback typically takes 
5-10 minutes to complete.

Users in North America and Asia-Pacific are not affected.

Next update by 15:00 UTC or when the issue is resolved.

Good resolution update:

Resolved — [2025-04-16 15:02 UTC]

The issue affecting login for EU users has been resolved.

Timeline:
- 14:10 UTC: Configuration change deployed
- 14:22 UTC: Login failures begin
- 14:32 UTC: Issue detected and investigation started
- 14:48 UTC: Root cause identified
- 14:55 UTC: Rollback initiated
- 15:02 UTC: Service restored

Duration: 40 minutes. Affected users: approximately 12% of EU users.

We will publish a full incident report within 5 business days. 
We apologize for the disruption.

Status Page Component Labels

Write status components from the customer's perspective, not your infrastructure's perspective:

| Don't Use | Use Instead | |---|---| | Database cluster | Data storage | | Redis cache layer | Performance features | | Load balancer | Website availability | | CDN edge nodes | Content delivery | | Message queue | Background processing | | Auth service | Login and authentication | | API gateway | API (for developers) |

What Not to Say

| Phrase | Why to Avoid | Better Alternative | |---|---|---| | "Some users are affected" | Vague — every customer thinks they're affected | "Approximately 15% of users in the EU region" | | "We're working on it" | No information about what is being done | "We're rolling back the deployment from 14:10 UTC" | | "Should be resolved shortly" | Creates unmet expectations | "We expect resolution by 15:30 UTC" OR say nothing about timeline | | "We apologize for any inconvenience" | Weak, dismissive | "We apologize for the disruption this caused" | | "This was caused by a third-party provider" | Sounds like blame-shifting | "This was caused by an issue with our payment provider. Here's the impact..." | | Root cause jargon | Confuses non-technical customers | Plain language description of user impact |

Handling Uncertainty

You won't always know the cause quickly. Communicate honestly about what you don't know:

# Handling Unknown Causes

## When you know symptoms but not cause:
"We're seeing elevated error rates on the checkout API. 
We have not yet identified the cause. Our team is 
actively investigating. Next update in 15 minutes."

## When you have a hypothesis but aren't certain:
"We believe the issue is related to a configuration change 
deployed at 14:10 UTC. We're investigating this hypothesis 
and will provide an update in 15 minutes."
(Note: say "we believe" not "we know")

## When resolution timeline is uncertain:
Never: "This will be fixed in the next few minutes"
OK: "We're working to resolve this as quickly as possible"
Better: "We expect to have more information about resolution 
timeline in our next update at 15:30 UTC."

Multi-Channel Communication Strategy

Different stakeholders need information through different channels:

# incident_comms.py
class IncidentCommunicationsManager:
    """
    Manage multi-channel incident communication.
    """
    
    def send_initial_notification(self, incident):
        """Send initial notification to all appropriate channels."""
        
        update_text = self.compose_initial_update(incident)
        
        # 1. Status page (primary customer channel)
        self.status_page.create_incident(
            title=incident.title,
            status="investigating",
            body=update_text,
            components=incident.affected_components
        )
        
        # 2. Internal Slack (team awareness)
        self.slack.post_to_channel(
            channel="#incident-status",
            message=f":red_circle: P{incident.severity} INCIDENT DECLARED\n{update_text}"
        )
        
        # 3. Enterprise customer direct notification (P1 only)
        if incident.severity == 1:
            self.notify_enterprise_customers(incident)
        
        # 4. Support team notification
        self.slack.post_to_channel(
            channel="#support-team",
            message=f"Incident in progress — please route customer reports to the incident channel. Status page: {self.status_page.url}"
        )
    
    def notify_enterprise_customers(self, incident):
        """Send direct email to enterprise accounts affected."""
        
        affected_accounts = self.get_affected_enterprise_accounts(incident)
        
        for account in affected_accounts:
            contact = account.primary_technical_contact
            
            self.email.send(
                to=contact.email,
                subject=f"Service Disruption Affecting {account.name}",
                template="enterprise_incident_notification",
                context={
                    "contact_name": contact.name,
                    "company_name": account.name,
                    "incident_description": incident.customer_description,
                    "status_page_url": self.status_page.url,
                    "csm_name": account.customer_success_manager.name,
                    "csm_email": account.customer_success_manager.email
                }
            )
    
    def send_resolution_notification(self, incident):
        """Send resolution notification and schedule postmortem."""
        
        # Update status page
        self.status_page.resolve_incident(
            incident_id=incident.id,
            resolution_body=self.compose_resolution_update(incident)
        )
        
        # Internal notification
        duration_min = int(incident.duration_seconds / 60)
        self.slack.post_to_channel(
            channel="#incident-status",
            message=f":white_check_mark: INCIDENT RESOLVED\nDuration: {duration_min} minutes\nPostmortem scheduled for {incident.postmortem_date.strftime('%B %d')}"
        )
        
        # Enterprise follow-up
        if incident.severity <= 2:
            for account in self.get_affected_enterprise_accounts(incident):
                self.schedule_customer_call(account, incident)

Subscriber Notification Timing

Status page subscriber notifications should be sent strategically:

## Subscriber Notification Strategy

### DO send subscriber notifications for:
- Initial incident declaration (when impact is confirmed, not suspected)
- Status change (Investigating → Identified → Monitoring → Resolved)
- Resolution

### DON'T send subscriber notifications for:
- Every update (creates notification fatigue)
- When the incident resolves in under 5 minutes (not worth the noise)
- Scheduled maintenance (use advance maintenance window notifications)

### Why this matters:
Too many notifications → users unsubscribe → miss future real incidents
Too few notifications → users don't know you're transparent

Post-Incident Customer Communication

Within 24-48 hours after a significant incident:

# Post-Incident Email Template (to affected enterprise customers)

Subject: Incident Report — [Service] disruption on [Date]

[First Name],

On [date] at [time] UTC, [brief description of what customers experienced].
The issue lasted [duration] and affected [scope — N% of users, N customers].

## What happened

[2-3 sentence plain English explanation of root cause. Avoid jargon.]

## What we did

[Bullet points of specific actions taken]
- [Time]: Detected issue via automated monitoring
- [Time]: Identified root cause
- [Time]: Implemented fix

## How we're preventing this

[3-5 specific action items from the postmortem]
- We're adding [specific test] to prevent this class of error
- We're improving detection time by [specific change]

## Your impact

Based on our records, your account experienced [specific impact — N failed requests, N minute outage].

[If applicable: SLA credit] Per our SLA terms, we're crediting your account [X] days/amount. You'll see this reflected in your next invoice.

We're sorry for the disruption. Please reach out if you have any questions.

[Your name]
[Title]

Conclusion

Status update quality during incidents is a competitive differentiator — customers remember how they were treated when things went wrong more vividly than they remember the incident itself. The teams that communicate clearly, frequently, and honestly build a reservoir of trust that weathers future incidents. Pair transparent communication with the monitoring infrastructure to detect issues quickly in the first place: AzMonitor's public status pages with automatic incident updates, subscriber email notifications, and uptime history give you the platform to communicate professional, trust-building updates without building the infrastructure from scratch.

Tags:incident communicationstatus updatescustomer communicationtransparency

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Status Updates During Incidents: Communication That Builds Trust