Site Reliability Engineering (SRE) emerged from Google in the early 2000s as an answer to a specific problem: how do you manage the operations of global-scale internet services that are too complex for traditional operations approaches? The answer Google developed — and eventually published in their SRE book — has become the template for how modern internet companies think about reliability, operations, and the relationship between engineering and ops.

What SRE Actually Is

The term "site reliability engineering" suggests a job title, but it's better understood as a discipline and philosophy. Ben Treynor Sloss, who founded Google's SRE team, defined it as "what happens when a software engineer is tasked with what used to be called operations."

The key insight is that operations at scale — running services reliably for millions of users — is an engineering problem that requires engineering solutions, not just processes and heroics. An SRE writes code to fix operational problems, automates repetitive tasks, and measures reliability with the same rigor that software engineers measure code quality.

SRE vs Traditional Ops

| Aspect | Traditional Ops | SRE | |---|---|---| | Primary tool | Runbooks, manual processes | Automation, code | | Reliability approach | Best effort, firefighting | Engineered, measured | | Relationship with dev | Separate, often adversarial | Collaborative, shared goals | | Success metric | Low incident count | Error budgets, SLOs | | Change posture | Conservative, avoid risk | Calibrated risk with error budgets | | On-call burden | Unlimited (until burnout) | Capped at 50% operational work |

SRE vs DevOps

SRE is often compared to DevOps. They share values (collaboration, automation, shared responsibility) but have different specifics:

DevOps is a cultural philosophy about breaking down silos between development and operations
SRE is a specific implementation of those principles with defined practices: SLOs, error budgets, toil reduction, and specific on-call management approaches

You can implement SRE practices within a DevOps culture, and you can use DevOps tools and culture while building SRE practices.

The Core SRE Concepts

Reliability as a Feature

In traditional product development, reliability is an afterthought — a constraint that slows down feature development. SRE reframes reliability as a feature that must be explicitly designed, measured, and prioritized alongside user-facing features.

Users don't use a product because it's reliable — they use it because it solves a problem. But they stop using it if it's unreliable. Reliability enables the use of all other features.

Service Level Objectives (SLOs)

SLOs are the central organizing concept of SRE. They define what "reliable enough" means for a service:

# SLO definitions for a payment service
slos:
  - name: "Payment API Availability"
    sli: "successful_payment_requests / total_payment_requests"
    target: 99.95%
    window: 30d
    
  - name: "Payment Processing Latency"
    sli: "p99_payment_processing_latency_ms"
    target: 2000ms
    window: 7d
    
  - name: "Transaction Correctness"
    sli: "correct_transaction_outcomes / total_transactions"
    target: 99.999%
    window: 30d

SLOs answer: "How reliable does this service need to be?" The answer isn't "as reliable as possible" — it's "reliable enough to satisfy users, as defined by measurable criteria."

Error Budgets

Error budgets flow directly from SLOs:

For a 99.9% SLO over 30 days:
Error Budget = 0.1% of 30 days = 43.2 minutes

This is the "allowed unreliability" that can be spent on:
- Normal incident occurrence
- Risky deployments
- Infrastructure maintenance

The error budget is a resource that teams can spend strategically. When the budget is full, teams can deploy aggressively. When it's exhausted, reliability work takes priority over feature development.

Toil

SRE introduced "toil" as a specific concept: manual, repetitive operational work that scales with service growth but doesn't leave the service fundamentally better.

Toil characteristics:

Manual — Requires human action rather than automation
Repetitive — Performed over and over in the same way
Scalable — Volume grows proportional to service scale
No lasting value — Doing the same task leaves the service unchanged
Automatable — Could theoretically be replaced by automation

Examples of toil:

Manually restarting services when they crash
Executing deployment runbooks step by step
Reviewing logs to find error patterns
Manually rotating SSL certificates

SRE teams have a principle: toil should consume no more than 50% of an SRE's time. The other 50% is engineering work that reduces toil and improves reliability long-term.

# Automating toil: SSL certificate rotation
# Instead of manually: check expiry → run certbot → restart nginx → verify
# Automate: Continuous monitoring + automated renewal + automated restart

class CertificateManager:
    def __init__(self):
        self.check_interval = 86400  # Daily
        
    def start_automated_renewal(self):
        """Run automated certificate renewal - eliminates toil"""
        while True:
            for domain in self.get_monitored_domains():
                days_until_expiry = self.check_expiry(domain)
                
                if days_until_expiry < 30:
                    # Auto-renew via Let's Encrypt
                    success = self.renew_certificate(domain)
                    
                    if success:
                        # Auto-reload web server
                        self.reload_web_server()
                        self.log_renewal(domain, "success")
                    else:
                        # Alert humans only when automation fails
                        self.alert(f"Certificate renewal failed for {domain}")
                
            time.sleep(self.check_interval)

SRE Team Structures

Embedded SRE

SRE engineers work within product teams, providing reliability expertise while learning the service deeply:

Product Team A:
  - 6 software engineers
  - 1 SRE (embedded)
  - SRE focuses on: SLOs, on-call, monitoring, release process

Product Team B:
  - 8 software engineers
  - 1 SRE (embedded)
  - SRE focuses on: capacity planning, incident management, automation

Central SRE Team

A central SRE team supports multiple product teams, providing shared infrastructure and consulting:

Central SRE Team (5 engineers):
  - Owns: monitoring infrastructure, deployment pipeline, on-call tooling
  - Supports: All product teams with SLO definition and incident management
  - Consults: Architecture reviews, production readiness reviews

Product Teams A, B, C:
  - Own their services
  - Consult SRE team for reliability guidance
  - Follow SRE team standards

Production Readiness Reviews

SRE teams often gate production access through production readiness reviews (PRR). A service must meet minimum reliability standards before SRE team support:

# Production Readiness Checklist

## SLOs
- [ ] Availability SLO defined and measured
- [ ] Latency SLO defined and measured
- [ ] Error budget policy documented

## Monitoring
- [ ] Health endpoint implemented and monitored
- [ ] Dashboards created for golden signals (latency, traffic, errors, saturation)
- [ ] Alerts configured for SLO breaches

## Incident Response
- [ ] Runbook exists for top 5 failure modes
- [ ] On-call rotation set up with team
- [ ] Escalation path defined

## Capacity
- [ ] Load tested to 2x expected peak traffic
- [ ] Auto-scaling configured and tested
- [ ] Resource limits set appropriately

## Security
- [ ] Authentication implemented on all endpoints
- [ ] Secrets managed via secrets manager (not hardcoded)
- [ ] Rate limiting implemented

The SRE Engagement Model

A key innovation in SRE is defining when SRE team involvement scales with service reliability. The original Google model:

SRE supports if:

Service has SLO defined
Service has adequate monitoring
Service has an on-call rotation that includes the service team
Error budget policy is agreed upon

SRE disengages (gives back pager) if:

Service exceeds error budget consistently
SRE ops work exceeds 50% of capacity
Fundamental reliability improvements aren't being made

This creates an economic incentive: product teams that build unreliable services are responsible for their own on-call burden. This aligns incentives better than the traditional model where ops team handles all incidents regardless of code quality.

SRE Metrics: DORA and Beyond

The DORA (DevOps Research and Assessment) metrics measure software delivery and operational performance:

| Metric | Elite Performers | High | Medium | Low | |---|---|---|---|---| | Deployment frequency | Multiple times/day | 1/day-1/week | 1/week-1/month | < monthly | | Lead time for changes | < 1 hour | 1 day - 1 week | 1 week - 1 month | > 1 month | | Time to restore service | < 1 hour | < 1 day | 1 day - 1 week | > 1 week | | Change failure rate | 0-15% | 16-30% | 16-30% | 16-30% |

# Calculate DORA metrics from your data
def calculate_dora_metrics(deployments, incidents):
    """Calculate DORA metrics for a time period"""
    
    # Deployment frequency
    deployments_per_day = len(deployments) / 30  # 30-day window
    
    # Lead time for changes
    lead_times = [
        (d.deployed_at - d.committed_at).total_seconds() / 3600
        for d in deployments
    ]
    median_lead_time_hours = sorted(lead_times)[len(lead_times) // 2]
    
    # Mean time to restore (MTTR)
    mttr_values = [
        (i.resolved_at - i.detected_at).total_seconds() / 3600
        for i in incidents if i.resolved_at
    ]
    mean_mttr_hours = sum(mttr_values) / len(mttr_values) if mttr_values else 0
    
    # Change failure rate
    total_deployments = len(deployments)
    failed_deployments = len([
        d for d in deployments 
        if any(i.deployment_id == d.id for i in incidents)
    ])
    change_failure_rate = failed_deployments / total_deployments if total_deployments else 0
    
    return {
        "deployment_frequency_per_day": round(deployments_per_day, 2),
        "median_lead_time_hours": round(median_lead_time_hours, 1),
        "mean_mttr_hours": round(mean_mttr_hours, 1),
        "change_failure_rate_pct": round(change_failure_rate * 100, 1)
    }

Starting with SRE Practices

You don't need a dedicated SRE team to adopt SRE practices. Start with:

Week 1-2: Define SLOs for your 2-3 most critical services Week 3-4: Implement error budgets and create a policy Month 2: Audit toil and automate your top 3 most frequent manual tasks Month 3: Establish a postmortem process for all P1/P2 incidents Month 4: Create production readiness criteria for new services

The tools and metrics that SRE uses — monitoring, alerting, incident tracking — are the same ones that AzMonitor provides. The SRE practices give these tools purpose: SLOs define what the monitoring should be measuring, error budgets define when the alerts should fire, and incident practices define what to do when they do.

Conclusion

SRE is fundamentally about treating reliability as an engineering problem with engineering solutions — measuring it precisely with SLOs, managing it rationally with error budgets, reducing operational toil with automation, and learning systematically from incidents. Whether you have a dedicated SRE team or a small engineering team practicing SRE principles, the practices create the same outcome: services that are reliably excellent rather than unreliably heroic. AzMonitor provides the monitoring infrastructure that makes SRE measurable — the continuous uptime and performance data that feeds SLO calculations, error budget tracking, and incident detection.

Tags:SREsite reliability engineeringDevOpsreliability

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

SRE Fundamentals: What Site Reliability Engineering Is and How It Works