Site Reliability Engineering (SRE) emerged from Google in the early 2000s as an answer to a specific problem: how do you manage the operations of global-scale internet services that are too complex for traditional operations approaches? The answer Google developed — and eventually published in their SRE book — has become the template for how modern internet companies think about reliability, operations, and the relationship between engineering and ops.
What SRE Actually Is
The term "site reliability engineering" suggests a job title, but it's better understood as a discipline and philosophy. Ben Treynor Sloss, who founded Google's SRE team, defined it as "what happens when a software engineer is tasked with what used to be called operations."
The key insight is that operations at scale — running services reliably for millions of users — is an engineering problem that requires engineering solutions, not just processes and heroics. An SRE writes code to fix operational problems, automates repetitive tasks, and measures reliability with the same rigor that software engineers measure code quality.
SRE vs Traditional Ops
| Aspect | Traditional Ops | SRE | |---|---|---| | Primary tool | Runbooks, manual processes | Automation, code | | Reliability approach | Best effort, firefighting | Engineered, measured | | Relationship with dev | Separate, often adversarial | Collaborative, shared goals | | Success metric | Low incident count | Error budgets, SLOs | | Change posture | Conservative, avoid risk | Calibrated risk with error budgets | | On-call burden | Unlimited (until burnout) | Capped at 50% operational work |
SRE vs DevOps
SRE is often compared to DevOps. They share values (collaboration, automation, shared responsibility) but have different specifics:
- DevOps is a cultural philosophy about breaking down silos between development and operations
- SRE is a specific implementation of those principles with defined practices: SLOs, error budgets, toil reduction, and specific on-call management approaches
You can implement SRE practices within a DevOps culture, and you can use DevOps tools and culture while building SRE practices.
The Core SRE Concepts
Reliability as a Feature
In traditional product development, reliability is an afterthought — a constraint that slows down feature development. SRE reframes reliability as a feature that must be explicitly designed, measured, and prioritized alongside user-facing features.
Users don't use a product because it's reliable — they use it because it solves a problem. But they stop using it if it's unreliable. Reliability enables the use of all other features.
Service Level Objectives (SLOs)
SLOs are the central organizing concept of SRE. They define what "reliable enough" means for a service:
# SLO definitions for a payment service
slos:
- name: "Payment API Availability"
sli: "successful_payment_requests / total_payment_requests"
target: 99.95%
window: 30d
- name: "Payment Processing Latency"
sli: "p99_payment_processing_latency_ms"
target: 2000ms
window: 7d
- name: "Transaction Correctness"
sli: "correct_transaction_outcomes / total_transactions"
target: 99.999%
window: 30d
SLOs answer: "How reliable does this service need to be?" The answer isn't "as reliable as possible" — it's "reliable enough to satisfy users, as defined by measurable criteria."
Error Budgets
Error budgets flow directly from SLOs:
For a 99.9% SLO over 30 days:
Error Budget = 0.1% of 30 days = 43.2 minutes
This is the "allowed unreliability" that can be spent on:
- Normal incident occurrence
- Risky deployments
- Infrastructure maintenance
The error budget is a resource that teams can spend strategically. When the budget is full, teams can deploy aggressively. When it's exhausted, reliability work takes priority over feature development.
Toil
SRE introduced "toil" as a specific concept: manual, repetitive operational work that scales with service growth but doesn't leave the service fundamentally better.
Toil characteristics:
- Manual — Requires human action rather than automation
- Repetitive — Performed over and over in the same way
- Scalable — Volume grows proportional to service scale
- No lasting value — Doing the same task leaves the service unchanged
- Automatable — Could theoretically be replaced by automation
Examples of toil:
- Manually restarting services when they crash
- Executing deployment runbooks step by step
- Reviewing logs to find error patterns
- Manually rotating SSL certificates
SRE teams have a principle: toil should consume no more than 50% of an SRE's time. The other 50% is engineering work that reduces toil and improves reliability long-term.
# Automating toil: SSL certificate rotation
# Instead of manually: check expiry → run certbot → restart nginx → verify
# Automate: Continuous monitoring + automated renewal + automated restart
class CertificateManager:
def __init__(self):
self.check_interval = 86400 # Daily
def start_automated_renewal(self):
"""Run automated certificate renewal - eliminates toil"""
while True:
for domain in self.get_monitored_domains():
days_until_expiry = self.check_expiry(domain)
if days_until_expiry < 30:
# Auto-renew via Let's Encrypt
success = self.renew_certificate(domain)
if success:
# Auto-reload web server
self.reload_web_server()
self.log_renewal(domain, "success")
else:
# Alert humans only when automation fails
self.alert(f"Certificate renewal failed for {domain}")
time.sleep(self.check_interval)
SRE Team Structures
Embedded SRE
SRE engineers work within product teams, providing reliability expertise while learning the service deeply:
Product Team A:
- 6 software engineers
- 1 SRE (embedded)
- SRE focuses on: SLOs, on-call, monitoring, release process
Product Team B:
- 8 software engineers
- 1 SRE (embedded)
- SRE focuses on: capacity planning, incident management, automation
Central SRE Team
A central SRE team supports multiple product teams, providing shared infrastructure and consulting:
Central SRE Team (5 engineers):
- Owns: monitoring infrastructure, deployment pipeline, on-call tooling
- Supports: All product teams with SLO definition and incident management
- Consults: Architecture reviews, production readiness reviews
Product Teams A, B, C:
- Own their services
- Consult SRE team for reliability guidance
- Follow SRE team standards
Production Readiness Reviews
SRE teams often gate production access through production readiness reviews (PRR). A service must meet minimum reliability standards before SRE team support:
# Production Readiness Checklist
## SLOs
- [ ] Availability SLO defined and measured
- [ ] Latency SLO defined and measured
- [ ] Error budget policy documented
## Monitoring
- [ ] Health endpoint implemented and monitored
- [ ] Dashboards created for golden signals (latency, traffic, errors, saturation)
- [ ] Alerts configured for SLO breaches
## Incident Response
- [ ] Runbook exists for top 5 failure modes
- [ ] On-call rotation set up with team
- [ ] Escalation path defined
## Capacity
- [ ] Load tested to 2x expected peak traffic
- [ ] Auto-scaling configured and tested
- [ ] Resource limits set appropriately
## Security
- [ ] Authentication implemented on all endpoints
- [ ] Secrets managed via secrets manager (not hardcoded)
- [ ] Rate limiting implemented
The SRE Engagement Model
A key innovation in SRE is defining when SRE team involvement scales with service reliability. The original Google model:
SRE supports if:
- Service has SLO defined
- Service has adequate monitoring
- Service has an on-call rotation that includes the service team
- Error budget policy is agreed upon
SRE disengages (gives back pager) if:
- Service exceeds error budget consistently
- SRE ops work exceeds 50% of capacity
- Fundamental reliability improvements aren't being made
This creates an economic incentive: product teams that build unreliable services are responsible for their own on-call burden. This aligns incentives better than the traditional model where ops team handles all incidents regardless of code quality.
SRE Metrics: DORA and Beyond
The DORA (DevOps Research and Assessment) metrics measure software delivery and operational performance:
| Metric | Elite Performers | High | Medium | Low | |---|---|---|---|---| | Deployment frequency | Multiple times/day | 1/day-1/week | 1/week-1/month | < monthly | | Lead time for changes | < 1 hour | 1 day - 1 week | 1 week - 1 month | > 1 month | | Time to restore service | < 1 hour | < 1 day | 1 day - 1 week | > 1 week | | Change failure rate | 0-15% | 16-30% | 16-30% | 16-30% |
# Calculate DORA metrics from your data
def calculate_dora_metrics(deployments, incidents):
"""Calculate DORA metrics for a time period"""
# Deployment frequency
deployments_per_day = len(deployments) / 30 # 30-day window
# Lead time for changes
lead_times = [
(d.deployed_at - d.committed_at).total_seconds() / 3600
for d in deployments
]
median_lead_time_hours = sorted(lead_times)[len(lead_times) // 2]
# Mean time to restore (MTTR)
mttr_values = [
(i.resolved_at - i.detected_at).total_seconds() / 3600
for i in incidents if i.resolved_at
]
mean_mttr_hours = sum(mttr_values) / len(mttr_values) if mttr_values else 0
# Change failure rate
total_deployments = len(deployments)
failed_deployments = len([
d for d in deployments
if any(i.deployment_id == d.id for i in incidents)
])
change_failure_rate = failed_deployments / total_deployments if total_deployments else 0
return {
"deployment_frequency_per_day": round(deployments_per_day, 2),
"median_lead_time_hours": round(median_lead_time_hours, 1),
"mean_mttr_hours": round(mean_mttr_hours, 1),
"change_failure_rate_pct": round(change_failure_rate * 100, 1)
}
Starting with SRE Practices
You don't need a dedicated SRE team to adopt SRE practices. Start with:
Week 1-2: Define SLOs for your 2-3 most critical services Week 3-4: Implement error budgets and create a policy Month 2: Audit toil and automate your top 3 most frequent manual tasks Month 3: Establish a postmortem process for all P1/P2 incidents Month 4: Create production readiness criteria for new services
The tools and metrics that SRE uses — monitoring, alerting, incident tracking — are the same ones that AzMonitor provides. The SRE practices give these tools purpose: SLOs define what the monitoring should be measuring, error budgets define when the alerts should fire, and incident practices define what to do when they do.
Conclusion
SRE is fundamentally about treating reliability as an engineering problem with engineering solutions — measuring it precisely with SLOs, managing it rationally with error budgets, reducing operational toil with automation, and learning systematically from incidents. Whether you have a dedicated SRE team or a small engineering team practicing SRE principles, the practices create the same outcome: services that are reliably excellent rather than unreliably heroic. AzMonitor provides the monitoring infrastructure that makes SRE measurable — the continuous uptime and performance data that feeds SLO calculations, error budget tracking, and incident detection.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →