SLA reporting is the evidence that your reliability commitments are being met — or the early warning system when they aren't. Without structured reporting, SLA discussions happen only at renewal time or after a breach, when they're already contentious. Regular, automated SLA reports shift the conversation from "were you reliable?" to "here's the data showing how we performed."
Who Reads SLA Reports and What They Need
Different audiences need different information from the same underlying data:
| Audience | Primary Question | Format | Frequency | |---|---|---|---| | Customers (enterprise) | "Were you reliable last month?" | Availability percentage, incident summary | Monthly | | Customer success team | "Which accounts are at risk?" | Accounts below SLA threshold, trends | Weekly | | Engineering leadership | "How are we tracking against SLOs?" | Error budget burn rate, MTTR trends | Weekly | | Finance/Legal | "Do we owe SLA credits?" | Breach events, credit calculations | Monthly | | Engineering team | "Where are we spending reliability budget?" | Per-service metrics, incident frequency | Weekly |
Core SLA Metrics to Report
Availability
def calculate_availability(checks, period_start, period_end):
"""
Calculate availability percentage over a time period.
Availability = (total_time - downtime) / total_time * 100
"""
total_seconds = (period_end - period_start).total_seconds()
# Sum up downtime periods
downtime_seconds = 0
current_outage_start = None
sorted_checks = sorted(checks, key=lambda c: c.checked_at)
for check in sorted_checks:
if check.status == "down" and current_outage_start is None:
current_outage_start = check.checked_at
elif check.status == "up" and current_outage_start is not None:
downtime_seconds += (check.checked_at - current_outage_start).total_seconds()
current_outage_start = None
# Handle ongoing outage at end of period
if current_outage_start is not None:
downtime_seconds += (period_end - current_outage_start).total_seconds()
availability_pct = (total_seconds - downtime_seconds) / total_seconds * 100
return {
"availability_pct": round(availability_pct, 4),
"uptime_seconds": total_seconds - downtime_seconds,
"downtime_seconds": downtime_seconds,
"downtime_minutes": round(downtime_seconds / 60, 1),
"period_days": (period_end - period_start).days
}
Key Availability Thresholds
| SLA Level | Monthly Availability | Max Monthly Downtime | |---|---|---| | 99.0% | 99.0% | 7 hours 18 minutes | | 99.5% | 99.5% | 3 hours 39 minutes | | 99.9% | 99.9% | 43 minutes 50 seconds | | 99.95% | 99.95% | 21 minutes 55 seconds | | 99.99% | 99.99% | 4 minutes 23 seconds |
Latency
def calculate_latency_slo_compliance(response_times, target_p95_ms, target_p99_ms):
"""
Calculate what percentage of requests met latency SLOs.
"""
sorted_times = sorted(response_times)
n = len(sorted_times)
actual_p95 = sorted_times[int(n * 0.95)]
actual_p99 = sorted_times[int(n * 0.99)]
return {
"total_requests": n,
"p50_ms": sorted_times[int(n * 0.50)],
"p95_ms": actual_p95,
"p95_target_ms": target_p95_ms,
"p95_compliant": actual_p95 <= target_p95_ms,
"p99_ms": actual_p99,
"p99_target_ms": target_p99_ms,
"p99_compliant": actual_p99 <= target_p99_ms,
"requests_under_p95_target": sum(1 for t in response_times if t <= target_p95_ms),
"p95_compliance_rate": round(
sum(1 for t in response_times if t <= target_p95_ms) / n * 100, 2
)
}
Monthly SLA Report Template
# Service Level Agreement Report
## [Customer Name] | [Month Year]
---
### Availability Summary
| Service | SLA Target | Actual | Status | Downtime |
|---|---|---|---|---|
| API | 99.9% | 99.97% | ✓ Met | 8m 46s |
| Dashboard | 99.5% | 99.99% | ✓ Met | 0m |
| Authentication | 99.9% | 99.92% | ✓ Met | 6m 14s |
**Overall availability: 99.95%** (SLA target: 99.9%)
---
### Incident Summary
[Month] had 2 incidents affecting your account:
**Incident 1: Authentication service degradation**
- Date: [Date], [Time UTC]
- Duration: 6 minutes 14 seconds
- Impact: Login failures for ~12% of users
- Root cause: Configuration change rolled back
- Status: Resolved. Postmortem published at [link]
**Incident 2: API elevated latency**
- Date: [Date], [Time UTC]
- Duration: 8 minutes 46 seconds
- Impact: P99 latency increased from 180ms to 2.1 seconds
- Root cause: Database query optimization deployed
- Status: Resolved.
---
### Performance Metrics
| Metric | Target | Actual | Trend |
|---|---|---|---|
| P95 Response Time | < 500ms | 187ms | ↔ Stable |
| P99 Response Time | < 1000ms | 312ms | ↔ Stable |
| Error Rate | < 0.1% | 0.03% | ↔ Stable |
---
### Historical Availability (Last 12 Months)
| Month | Availability | Incidents | Downtime |
|---|---|---|---|
| May 2025 | 99.95% | 2 | 15m |
| Apr 2025 | 99.99% | 0 | 0m |
| Mar 2025 | 99.92% | 1 | 35m |
| [continue...] | | | |
---
### SLA Credit Calculation
Based on [Month] performance, no SLA credits are due.
Your service availability (99.95%) exceeded the contracted
SLA threshold (99.9%).
---
*Report generated [Date]. Data reflects monitoring from [start] to [end].
For questions, contact your Customer Success Manager.*
Automated Report Generation
# sla_report_generator.py
from datetime import datetime, date
from calendar import monthrange
import jinja2
class SLAReportGenerator:
def __init__(self, monitoring_client, customer_db, template_path):
self.monitoring = monitoring_client
self.customers = customer_db
self.template = jinja2.Environment(
loader=jinja2.FileSystemLoader(template_path)
).get_template("monthly_sla_report.html")
def generate_customer_report(self, customer_id, year, month):
"""Generate complete SLA report for a customer."""
customer = self.customers.get(customer_id)
period_start, period_end = self.get_month_bounds(year, month)
# Fetch monitoring data for this customer's services
availability_data = {}
for service in customer.monitored_services:
checks = self.monitoring.get_checks(
monitor_id=service.monitor_id,
start=period_start,
end=period_end
)
availability_data[service.name] = calculate_availability(
checks, period_start, period_end
)
# Fetch incidents
incidents = self.monitoring.get_incidents(
customer_id=customer_id,
start=period_start,
end=period_end
)
# Calculate SLA credits
credits = self.calculate_sla_credits(
customer=customer,
availability_data=availability_data,
period_start=period_start,
period_end=period_end
)
# Build report data
report_data = {
"customer": customer,
"period": f"{date(year, month, 1).strftime('%B %Y')}",
"period_start": period_start,
"period_end": period_end,
"availability": availability_data,
"incidents": incidents,
"credits": credits,
"generated_at": datetime.utcnow()
}
# Render HTML report
html_report = self.template.render(**report_data)
# Also generate machine-readable version
json_report = {
"customer_id": customer_id,
"period": f"{year}-{month:02d}",
"availability": availability_data,
"incident_count": len(incidents),
"credits_owed": credits["amount"]
}
return {
"html": html_report,
"json": json_report,
"customer": customer,
"credits_owed": credits["amount"] > 0
}
def generate_all_monthly_reports(self, year, month):
"""Generate reports for all enterprise customers."""
reports = []
for customer in self.customers.get_enterprise_customers():
report = self.generate_customer_report(customer.id, year, month)
reports.append(report)
# Send report to customer
self.email_report(customer, report)
# Flag accounts with credits owed
if report["credits_owed"]:
self.notify_customer_success(customer, report)
return reports
def get_month_bounds(self, year, month):
"""Get start and end of a calendar month in UTC."""
days_in_month = monthrange(year, month)[1]
start = datetime(year, month, 1, 0, 0, 0)
end = datetime(year, month, days_in_month, 23, 59, 59)
return start, end
Internal Engineering SLA Dashboard
For internal visibility, create a dashboard showing all accounts' SLA status:
def generate_engineering_sla_dashboard(customers, period_days=30):
"""
Dashboard for engineering team showing SLA health across all accounts.
"""
dashboard = {
"generated_at": datetime.utcnow().isoformat(),
"period_days": period_days,
"summary": {
"total_enterprise_accounts": 0,
"accounts_meeting_sla": 0,
"accounts_at_risk": 0,
"accounts_in_breach": 0,
"total_credits_owed": 0
},
"accounts": []
}
for customer in customers.get_enterprise_customers():
availability = calculate_customer_availability(customer, days=period_days)
sla_target = customer.contract.availability_target
buffer = availability - sla_target
if availability >= sla_target:
status = "meeting"
elif availability >= sla_target - 0.1:
status = "at_risk" # Within 0.1% of SLA threshold
else:
status = "breach"
account_data = {
"customer_id": customer.id,
"customer_name": customer.name,
"mrr": customer.mrr,
"sla_target": sla_target,
"actual_availability": availability,
"buffer_pct": buffer,
"status": status,
"incident_count": count_recent_incidents(customer, days=period_days)
}
dashboard["accounts"].append(account_data)
dashboard["summary"]["total_enterprise_accounts"] += 1
dashboard["summary"][f"accounts_{status}"] = \
dashboard["summary"].get(f"accounts_{status}", 0) + 1
# Sort by status (breaches first, then at-risk, then meeting)
dashboard["accounts"].sort(
key=lambda a: {"breach": 0, "at_risk": 1, "meeting": 2}[a["status"]]
)
return dashboard
Conclusion
Effective SLA reporting requires clarity about who the audience is, automated generation to ensure consistency, and enough detail to be actionable without overwhelming non-technical stakeholders. Monthly customer-facing reports build trust; weekly internal dashboards keep engineering teams aware of drift before it becomes a breach. AzMonitor's monitoring data — check history, response times, incident records — is the raw material that makes SLA reporting accurate and automatable. When your availability data is collected systematically by external monitoring, you can generate reports with confidence that the numbers reflect actual customer-experienced uptime rather than internal self-assessments.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →