Treating every alert as equally urgent is a recipe for burnout and missed real emergencies. When everything is critical, nothing is. A well-designed severity framework lets your team triage incidents instantly, apply appropriate urgency, and protect on-call engineers from unnecessary disruption — while ensuring genuine emergencies get immediate attention.
Why Severity Levels Matter
Without defined severity levels, every incident depends on the judgment of whoever happens to see it first. That leads to inconsistency: one engineer pages the entire team for a minor degradation; another under-reacts to a full outage because they don't want to "overreact." Severity levels replace guesswork with defined criteria.
A good severity framework:
- Is objective — based on impact, not feeling
- Is fast to apply — can be determined in 30 seconds
- Drives consistent response — same severity = same urgency everywhere
- Is reviewable — you can audit whether incidents were correctly classified
The P1-P4 Framework
Most engineering organizations use a 4-tier severity system. Here's a battle-tested definition:
P1 — Critical
Definition: Complete service outage or severe degradation affecting all or most users. Revenue or reputational damage is occurring or imminent.
Examples:
- Login/authentication completely broken
- Checkout or payment processing down
- All API endpoints returning 5xx errors
- Data loss or corruption actively occurring
- Security breach in progress
Response requirements:
- Page the primary on-call immediately
- Automatic escalation if not acknowledged in 5 minutes
- Communication to stakeholders within 15 minutes
- All-hands war room for complex issues
- Executive notification for outages > 30 minutes
- Status page updated within 5 minutes
P2 — High
Definition: Significant functionality degraded or unavailable for a subset of users, or full degradation of non-critical functionality. No current data loss but risk is elevated.
Examples:
- Checkout extremely slow (> 10 seconds) for all users
- Feature unavailable for specific user segments
- Performance degradation affecting user experience significantly
- Backup monitoring showing failures
Response requirements:
- Notify on-call engineer via Slack + push notification
- Acknowledge within 30 minutes
- Resolution within 4 hours
- Status page update if externally visible
P3 — Medium
Definition: Minor functionality impaired or non-critical service degraded. Workaround exists. Limited user impact.
Examples:
- Non-critical feature unavailable
- Performance degradation below SLA thresholds but above normal
- Elevated error rates in background jobs
- Monitoring alert that doesn't affect users directly
Response requirements:
- Slack notification to on-call and team channel
- Investigate during business hours
- Resolution within 24 hours
- No immediate page required
P4 — Low
Definition: Minor issues, cosmetic problems, or informational items requiring no immediate action.
Examples:
- Planned maintenance approaching
- Non-critical performance trend worth investigating
- Documentation or minor UI issues
- Informational threshold exceeded
Response requirements:
- Create ticket in backlog
- Address in next sprint or maintenance window
- No direct notification to on-call
Severity Decision Matrix
When an incident occurs, use this matrix to determine severity:
| User Impact | Revenue Impact | Data Risk | Severity | |---|---|---|---| | All users affected | High | Any | P1 | | Most users affected | High | None | P1 | | Most users affected | Low | None | P2 | | Some users affected | High | None | P1 | | Some users affected | Low | None | P2 | | Few users affected | Any | None | P3 | | No user impact | None | None | P3/P4 |
When in doubt, escalate severity rather than downgrade. You can always downgrade after investigation; you can't undo the delay from under-reacting to a P1.
Alert Configuration by Severity
Map your monitoring alerts directly to severity levels:
# Monitoring alert configuration with severity mapping
alerts:
# P1 - Immediate page
- name: "Payment Service Completely Down"
condition: "payment_availability < 90% for 3 consecutive checks"
severity: P1
channels:
- pagerduty_critical
- slack_incidents
- email_oncall
escalation_after_minutes: 5
# P1 - Immediate page
- name: "Login Endpoint Down"
condition: "login_success_rate < 95% for 2 consecutive checks"
severity: P1
channels:
- pagerduty_critical
# P2 - Urgent notification
- name: "API Latency High"
condition: "api_p95_latency > 2000ms for 10 minutes"
severity: P2
channels:
- pagerduty_high
- slack_engineering
# P2 - Urgent notification
- name: "Error Rate Elevated"
condition: "error_rate > 2% for 5 minutes"
severity: P2
channels:
- slack_engineering
# P3 - Business hours notification
- name: "Cache Hit Rate Dropping"
condition: "cache_hit_rate < 80% for 30 minutes"
severity: P3
channels:
- slack_engineering
only_during_business_hours: false
# P4 - Ticket only
- name: "SSL Expiry in 30 days"
condition: "ssl_days_until_expiry < 30"
severity: P4
channels:
- create_jira_ticket
Severity in Practice: Common Mistakes
Overcalling P1 — Teams that cry wolf with P1 designations train their engineers to treat pages as low-urgency. Reserve P1 for genuine emergencies.
Undercalling to avoid "disruption" — Engineers sometimes downgrade severity to avoid paging someone at night. This is the opposite of helpful — it delays response to real problems.
Not escalating severity during incidents — An incident that starts as P2 (partial degradation) and spreads to all users should be upgraded to P1. Severity should reflect current impact, not initial assessment.
No severity for third-party failures — When Stripe is down and payments are failing, that's still a P1 for your users — even though you can't fix it. Severity is about user impact, not root cause.
Escalation Policies by Severity
Define escalation automatically — don't rely on human judgment during high-stress moments:
P1 Escalation:
0 min: Page primary on-call (Slack + SMS + phone)
5 min (if unacknowledged): Page secondary on-call
15 min: Page engineering manager
30 min: Page VP Engineering
60 min: Page C-suite
P2 Escalation:
0 min: Slack notification to on-call
30 min (if unacknowledged): Page on-call
2 hr: Notify engineering manager
P3:
0 min: Slack notification to team channel
Next business day: Review in standup
Configure this in your alerting tool:
# PagerDuty escalation policy
escalation_policy:
name: "P1 - Critical Incidents"
rules:
- escalation_delay_in_minutes: 0
targets:
- type: schedule
id: primary-oncall-schedule
- escalation_delay_in_minutes: 5
targets:
- type: schedule
id: secondary-oncall-schedule
- escalation_delay_in_minutes: 15
targets:
- type: user
id: engineering-manager
- escalation_delay_in_minutes: 30
targets:
- type: user
id: vp-engineering
Severity Tracking and Retrospectives
Track severity distribution over time to identify patterns:
-- Monthly incident severity distribution
SELECT
DATE_TRUNC('month', created_at) as month,
severity,
COUNT(*) as incident_count,
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/60) as avg_mttr_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '6 months'
GROUP BY DATE_TRUNC('month', created_at), severity
ORDER BY month DESC, severity;
Watch for these patterns:
| Pattern | Meaning | Action | |---|---|---| | P1 count increasing | More frequent critical failures | Architecture review | | High P3/P4 ratio | Good severity calibration | Keep up current practices | | P2 count > P1 | Good — catching issues before full outage | Maintain early detection | | Everything classified P1 | Overcalling severity | Revisit severity criteria | | Lots of P3 > 24hr old | Teams not acting on medium incidents | Review P3 response SLA |
Communicating Severity to Stakeholders
Different stakeholders need different information based on severity:
# Stakeholder Communication Matrix
## P1 Incidents
**Engineering team:** Full technical details via Slack + incident channel
**Customer Success:** Impact statement within 15 minutes:
"We are experiencing an issue affecting [X% of users / all users].
Engineering is actively working on resolution. Next update in 30 minutes."
**Status Page:** Update immediately with incident details
**Executive:** Brief (3-sentence) summary within 30 minutes
## P2 Incidents
**Engineering team:** Slack notification
**Customer Success:** Heads-up if customer-visible
**Status Page:** Update if externally visible impact
**Executive:** No notification unless escalates to P1
## P3/P4 Incidents
**Engineering team:** Slack or Jira ticket
**No external communication required**
Conclusion
A clear severity framework removes ambiguity from incident response. P1 always means the same thing: immediate attention, all hands if needed, customer communication. P3 means business hours, no pages. When your monitoring alerts include severity levels and your team agrees on what each level means, incidents become manageable — even at 3 AM. Combine clear severity definitions with AzMonitor's multi-channel alerting and you have a system where the right people get notified with the right urgency, every time.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →