Treating every alert as equally urgent is a recipe for burnout and missed real emergencies. When everything is critical, nothing is. A well-designed severity framework lets your team triage incidents instantly, apply appropriate urgency, and protect on-call engineers from unnecessary disruption — while ensuring genuine emergencies get immediate attention.

Why Severity Levels Matter

Without defined severity levels, every incident depends on the judgment of whoever happens to see it first. That leads to inconsistency: one engineer pages the entire team for a minor degradation; another under-reacts to a full outage because they don't want to "overreact." Severity levels replace guesswork with defined criteria.

A good severity framework:

Is objective — based on impact, not feeling
Is fast to apply — can be determined in 30 seconds
Drives consistent response — same severity = same urgency everywhere
Is reviewable — you can audit whether incidents were correctly classified

The P1-P4 Framework

Most engineering organizations use a 4-tier severity system. Here's a battle-tested definition:

P1 — Critical

Definition: Complete service outage or severe degradation affecting all or most users. Revenue or reputational damage is occurring or imminent.

Examples:

Login/authentication completely broken
Checkout or payment processing down
All API endpoints returning 5xx errors
Data loss or corruption actively occurring
Security breach in progress

Response requirements:

Page the primary on-call immediately
Automatic escalation if not acknowledged in 5 minutes
Communication to stakeholders within 15 minutes
All-hands war room for complex issues
Executive notification for outages > 30 minutes
Status page updated within 5 minutes

P2 — High

Definition: Significant functionality degraded or unavailable for a subset of users, or full degradation of non-critical functionality. No current data loss but risk is elevated.

Examples:

Checkout extremely slow (> 10 seconds) for all users
Feature unavailable for specific user segments
Performance degradation affecting user experience significantly
Backup monitoring showing failures

Response requirements:

Notify on-call engineer via Slack + push notification
Acknowledge within 30 minutes
Resolution within 4 hours
Status page update if externally visible

P3 — Medium

Definition: Minor functionality impaired or non-critical service degraded. Workaround exists. Limited user impact.

Examples:

Non-critical feature unavailable
Performance degradation below SLA thresholds but above normal
Elevated error rates in background jobs
Monitoring alert that doesn't affect users directly

Response requirements:

Slack notification to on-call and team channel
Investigate during business hours
Resolution within 24 hours
No immediate page required

P4 — Low

Definition: Minor issues, cosmetic problems, or informational items requiring no immediate action.

Examples:

Planned maintenance approaching
Non-critical performance trend worth investigating
Documentation or minor UI issues
Informational threshold exceeded

Response requirements:

Create ticket in backlog
Address in next sprint or maintenance window
No direct notification to on-call

Severity Decision Matrix

When an incident occurs, use this matrix to determine severity:

| User Impact | Revenue Impact | Data Risk | Severity | |---|---|---|---| | All users affected | High | Any | P1 | | Most users affected | High | None | P1 | | Most users affected | Low | None | P2 | | Some users affected | High | None | P1 | | Some users affected | Low | None | P2 | | Few users affected | Any | None | P3 | | No user impact | None | None | P3/P4 |

When in doubt, escalate severity rather than downgrade. You can always downgrade after investigation; you can't undo the delay from under-reacting to a P1.

Alert Configuration by Severity

Map your monitoring alerts directly to severity levels:

# Monitoring alert configuration with severity mapping
alerts:
  # P1 - Immediate page
  - name: "Payment Service Completely Down"
    condition: "payment_availability < 90% for 3 consecutive checks"
    severity: P1
    channels:
      - pagerduty_critical
      - slack_incidents
      - email_oncall
    escalation_after_minutes: 5
    
  # P1 - Immediate page
  - name: "Login Endpoint Down"  
    condition: "login_success_rate < 95% for 2 consecutive checks"
    severity: P1
    channels:
      - pagerduty_critical
      
  # P2 - Urgent notification
  - name: "API Latency High"
    condition: "api_p95_latency > 2000ms for 10 minutes"
    severity: P2
    channels:
      - pagerduty_high
      - slack_engineering
      
  # P2 - Urgent notification
  - name: "Error Rate Elevated"
    condition: "error_rate > 2% for 5 minutes"
    severity: P2
    channels:
      - slack_engineering
      
  # P3 - Business hours notification
  - name: "Cache Hit Rate Dropping"
    condition: "cache_hit_rate < 80% for 30 minutes"
    severity: P3
    channels:
      - slack_engineering
    only_during_business_hours: false
    
  # P4 - Ticket only
  - name: "SSL Expiry in 30 days"
    condition: "ssl_days_until_expiry < 30"
    severity: P4
    channels:
      - create_jira_ticket

Severity in Practice: Common Mistakes

Overcalling P1 — Teams that cry wolf with P1 designations train their engineers to treat pages as low-urgency. Reserve P1 for genuine emergencies.

Undercalling to avoid "disruption" — Engineers sometimes downgrade severity to avoid paging someone at night. This is the opposite of helpful — it delays response to real problems.

Not escalating severity during incidents — An incident that starts as P2 (partial degradation) and spreads to all users should be upgraded to P1. Severity should reflect current impact, not initial assessment.

No severity for third-party failures — When Stripe is down and payments are failing, that's still a P1 for your users — even though you can't fix it. Severity is about user impact, not root cause.

Escalation Policies by Severity

Define escalation automatically — don't rely on human judgment during high-stress moments:

P1 Escalation:
  0 min: Page primary on-call (Slack + SMS + phone)
  5 min (if unacknowledged): Page secondary on-call
  15 min: Page engineering manager
  30 min: Page VP Engineering
  60 min: Page C-suite

P2 Escalation:
  0 min: Slack notification to on-call
  30 min (if unacknowledged): Page on-call
  2 hr: Notify engineering manager

P3:
  0 min: Slack notification to team channel
  Next business day: Review in standup

Configure this in your alerting tool:

# PagerDuty escalation policy
escalation_policy:
  name: "P1 - Critical Incidents"
  rules:
    - escalation_delay_in_minutes: 0
      targets:
        - type: schedule
          id: primary-oncall-schedule
    - escalation_delay_in_minutes: 5
      targets:
        - type: schedule
          id: secondary-oncall-schedule
    - escalation_delay_in_minutes: 15
      targets:
        - type: user
          id: engineering-manager
    - escalation_delay_in_minutes: 30
      targets:
        - type: user
          id: vp-engineering

Severity Tracking and Retrospectives

Track severity distribution over time to identify patterns:

-- Monthly incident severity distribution
SELECT
    DATE_TRUNC('month', created_at) as month,
    severity,
    COUNT(*) as incident_count,
    AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/60) as avg_mttr_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '6 months'
GROUP BY DATE_TRUNC('month', created_at), severity
ORDER BY month DESC, severity;

Watch for these patterns:

| Pattern | Meaning | Action | |---|---|---| | P1 count increasing | More frequent critical failures | Architecture review | | High P3/P4 ratio | Good severity calibration | Keep up current practices | | P2 count > P1 | Good — catching issues before full outage | Maintain early detection | | Everything classified P1 | Overcalling severity | Revisit severity criteria | | Lots of P3 > 24hr old | Teams not acting on medium incidents | Review P3 response SLA |

Communicating Severity to Stakeholders

Different stakeholders need different information based on severity:

# Stakeholder Communication Matrix

## P1 Incidents

**Engineering team:** Full technical details via Slack + incident channel
**Customer Success:** Impact statement within 15 minutes:
  "We are experiencing an issue affecting [X% of users / all users].
   Engineering is actively working on resolution. Next update in 30 minutes."
**Status Page:** Update immediately with incident details
**Executive:** Brief (3-sentence) summary within 30 minutes

## P2 Incidents
**Engineering team:** Slack notification
**Customer Success:** Heads-up if customer-visible
**Status Page:** Update if externally visible impact
**Executive:** No notification unless escalates to P1

## P3/P4 Incidents
**Engineering team:** Slack or Jira ticket
**No external communication required**

Conclusion

A clear severity framework removes ambiguity from incident response. P1 always means the same thing: immediate attention, all hands if needed, customer communication. P3 means business hours, no pages. When your monitoring alerts include severity levels and your team agrees on what each level means, incidents become manageable — even at 3 AM. Combine clear severity definitions with AzMonitor's multi-channel alerting and you have a system where the right people get notified with the right urgency, every time.

Tags:incident severityP1 P2 P3incident managementon-call

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Incident Severity Levels: How to Define and Use P1-P4 Classifications

Why Severity Levels Matter

The P1-P4 Framework

P1 — Critical

P2 — High

P3 — Medium

P4 — Low

Severity Decision Matrix

Alert Configuration by Severity

Severity in Practice: Common Mistakes

Escalation Policies by Severity

Severity Tracking and Retrospectives

Communicating Severity to Stakeholders

Conclusion

Related articles

Why Weekend Monitoring Is Critical for Modern Businesses

How to Configure Monitoring Alerts: A Practical Guide

Alert Deduplication: Preventing Alert Storms and Notification Floods