On-call done right is a manageable part of engineering work. On-call done wrong destroys team morale, causes burnout, and drives away your best engineers — who have the most options. The difference is almost entirely in how the schedule is structured, what's expected, and how the organization treats the burden of being available at night.
The On-Call Burden Problem
Before designing your schedule, understand what makes on-call burdensome:
- Frequency — If a small team is covering on-call, each person is on more often
- Alert volume — Noisy alerts at 3 AM are the primary driver of burnout
- Lack of support — Being alone with a production problem and no backup
- No recovery time — Returning to regular work after a sleepless night
- Uncompensated burden — Being on-call with no acknowledgment or compensation
A schedule that addresses these factors keeps on-call sustainable. One that ignores them will eventually lose engineers.
On-Call Rotation Patterns
Simple Weekly Rotation
The most common pattern for small teams:
Week 1: Alice (primary)
Week 2: Bob (primary)
Week 3: Carol (primary)
Week 4: Dave (primary)
Week 5: Alice (primary) [repeats]
Pros: Simple, predictable, easy to plan personal life around. Cons: Long shifts if something major happens on Monday of your week. Weekends are always included.
Split Weekly Rotation (Weekday/Weekend)
Distribute the weekend burden more explicitly:
Team of 4, rotating weekday and weekend separately:
Weekdays:
Mon-Fri: Each person covers 1 week in 4
Weekends:
Sat-Sun: Each person covers 1 weekend in 4 (separate from weekday)
Result: On weekends, you might not be the weekday on-call.
This works well when weekends have different traffic patterns or fewer team members willing to share.
Follow-the-Sun (Global Teams)
For teams distributed across time zones, hand off during business hours:
08:00-17:00 UTC: Europe team
17:00-02:00 UTC: US East team
02:00-08:00 UTC: Asia-Pacific team
Handoff happens at shift start — brief sync on active issues.
Pros: Nobody takes calls in the middle of their night. Cons: Requires sufficient coverage in each region. Handoffs add coordination overhead.
Primary/Secondary Pattern
Add a backup layer to avoid leaving one person alone:
Week of 2025-06-09:
Primary on-call: Alice
Secondary on-call: Bob
Flow:
1. Alert fires → Alice (primary) gets paged
2. If no ack within 10 minutes → Bob (secondary) gets paged
3. If no ack within 5 more minutes → Engineering manager gets paged
The secondary doesn't need to investigate everything — just needs to be available if the primary needs backup or gets overwhelmed.
Configuring Schedules in PagerDuty
# PagerDuty API - create schedule programmatically
import pdpyras
session = pdpyras.APISession(api_key="YOUR_KEY")
schedule = {
"type": "schedule",
"name": "Engineering Primary On-Call",
"time_zone": "America/New_York",
"schedule_layers": [
{
"name": "Weekly Rotation",
"users": [
{"user": {"id": "ALICE_ID", "type": "user_reference"}},
{"user": {"id": "BOB_ID", "type": "user_reference"}},
{"user": {"id": "CAROL_ID", "type": "user_reference"}},
{"user": {"id": "DAVE_ID", "type": "user_reference"}},
],
"rotation_virtual_start": "2025-01-06T00:00:00-05:00",
"rotation_turn_length_seconds": 604800, # 1 week
"start": "2025-01-06T00:00:00-05:00",
}
],
}
created_schedule = session.post("schedules", json={"schedule": schedule})
Handoff Procedures
A good handoff prevents the incoming on-call engineer from starting blind:
# On-Call Handoff Template
## Handing off: [Outgoing Name]
## Taking over: [Incoming Name]
## Handoff time: [Date, Time, Timezone]
## Open Incidents
[List any active or recently resolved incidents]
- None at this time / [Incident details if applicable]
## Active Investigations
[Things that are being watched but aren't incidents yet]
- Elevated database latency on postgres-replica-2 - watching for 48 hours
- SSL cert for api.example.com expires in 21 days - ticket created
## Recent Changes
[Deployments or config changes from the past 48 hours to watch for]
- payment-service v2.4.1 deployed Thu 14:00 UTC - stable so far
- Nginx rate limit config updated Wed 09:00 UTC - monitor for 429s
## Known Issues
[Non-critical issues the incoming person should know about]
- Monitoring check for legacy endpoint times out occasionally - known, false positive
- Background job runs slow on Sunday mornings - team is investigating
## Notes
[Anything else the incoming person should know]
- Marketing campaign running this weekend - expect 2x normal traffic Sat 10:00-18:00 UTC
- On-call buddy this week: Eve (secondary)
## Useful Links
- Dashboard: [link]
- Runbooks: [link]
- Escalation: [link]
This takes 15 minutes to write but can save hours during the new on-call's week.
Minimum Team Size for On-Call
On-call with too small a team is unsustainable:
| Team Size | Weeks on Per Person | Sustainability | |---|---|---| | 2 people | Every other week | Unsustainable long-term | | 3 people | Every third week | Stressful | | 4 people | Every fourth week | Minimum viable | | 5-6 people | Every 5-6 weeks | Good | | 7+ people | Less than monthly | Comfortable |
If your team is too small for sustainable rotation, you have two options: grow the team (adding people with on-call expectations during hiring) or reduce the operational burden (fix alerts, automate remediation, improve reliability).
On-Call Compensation
Engineers who carry production responsibility outside business hours should be compensated. This isn't optional — it's both an ethical and practical necessity (teams that feel uncompensated have higher turnover).
Common compensation models:
| Model | Mechanics | Best For | |---|---|---| | Flat stipend | Fixed amount per week on-call | Simple, predictable | | Per-incident pay | Pay per incident worked | Variable, aligns incentives | | Time off | Extra PTO after on-call | Non-monetary preference | | Points system | Points redeemed for time off | Flexible | | Salary bump | On-call expectations in salary negotiation | Cleanest for planning |
Example policy:
# On-Call Compensation Policy
## Stipend
- $X per week on primary on-call rotation
- $Y per week on secondary on-call rotation
## Incident Pay
- Incidents worked outside 09:00-18:00 local time: 1 hour minimum pay
- Incidents lasting > 2 hours: actual hours worked
## Recovery Time
- After any incident requiring > 1 hour of work between 22:00-07:00:
Entitled to arrive 2 hours late the next business day (or equivalent time off)
## Incident-Free Week Bonus
- Weeks with zero incidents: recognition in team channel
- Quarter with zero P1 incidents: team dinner
Reducing On-Call Burden
The best on-call improvement isn't scheduling — it's reducing the work itself:
Fix noisy alerts — An alert audit can often eliminate 60-80% of nighttime pages that are false positives or non-actionable.
Automate common remediations — If the same issue requires the same fix three times, automate the fix.
Improve reliability — Each reliability improvement directly reduces on-call incidents.
Establish business hours for non-urgent work — Encourage teams to schedule risky deployments during business hours, not Friday afternoons.
# Track on-call burden per engineer
def calculate_oncall_burden(incidents, period_days=90):
"""
Calculate on-call burden metrics per engineer.
Helps identify if burden is evenly distributed.
"""
from collections import defaultdict
burden = defaultdict(lambda: {
"weeks_oncall": 0,
"incidents_worked": 0,
"after_hours_incidents": 0,
"total_incident_minutes": 0,
"sleep_disruptions": 0 # incidents between 22:00-07:00
})
for incident in incidents:
responder = incident.primary_responder
burden[responder]["incidents_worked"] += 1
burden[responder]["total_incident_minutes"] += incident.duration_minutes
if is_after_hours(incident.detected_at):
burden[responder]["after_hours_incidents"] += 1
if is_sleep_hours(incident.detected_at):
burden[responder]["sleep_disruptions"] += 1
return dict(burden)
On-Call Metrics to Track
Track on-call health over time to catch problems before they cause attrition:
| Metric | Target | Red Flag | |---|---|---| | Pages per on-call week | < 5 | > 20 | | Sleep-hour incidents per person/quarter | < 3 | > 10 | | After-hours incidents | < 10/week | > 30/week | | On-call rotation size | 5+ people | < 4 people | | Time to ack | < 5 min | > 15 min |
Survey your on-call engineers quarterly: "How sustainable is on-call on a scale of 1-10?" Track the trend. When scores drop, investigate why.
Conclusion
Sustainable on-call is built on predictable schedules, clear handoffs, fair compensation, and continuous investment in reducing the operational burden. Teams that get this right retain their best engineers and maintain production reliability without sacrificing wellbeing. The technical foundation — good monitoring, clear alerts, and automated detection — is what AzMonitor provides. The human foundation is the scheduling, compensation, and culture practices described here. Both are necessary; neither alone is sufficient.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →