API SLAs are promises. When you tell customers your API will be available 99.9% of the time with p95 latency under 300ms, you're making a commitment that has consequences — financial credits, churn risk, contract violations. Without rigorous SLA monitoring, you're either violating promises you don't know about or paying penalties for incidents you can't prove didn't happen.
Defining Meaningful API SLAs
An API SLA needs specific, measurable definitions. Vague commitments create disputes. Start by defining:
What counts as "available"? An API that returns 500 errors is technically "reachable" but not serving users. Define availability as the percentage of successful requests (2xx responses) out of all attempts.
What's measured? Synthetic monitoring from external locations? Real user traffic? Both have tradeoffs. Synthetic is consistent; real user traffic reflects actual user experience.
What's excluded? Planned maintenance, force majeure events, and client-caused issues are typically excluded from SLA calculations.
SLA Components for APIs
| Component | Definition | Typical Target | |---|---|---| | Availability | % of requests returning 2xx | 99.9% - 99.99% | | Latency (p95) | 95th percentile response time | < 300ms - 1000ms | | Error Rate | % of requests returning 5xx | < 0.1% - 0.5% | | Throughput | Guaranteed requests per second | Varies by tier | | Recovery Time | Time to restore after outage | < 15min - 4hr |
Calculating API Availability
The most common SLA metric is availability, usually expressed as uptime percentage:
def calculate_api_availability(checks, window_hours=720): # 30 days
"""
Calculate API availability from monitoring check results.
Args:
checks: List of (timestamp, success) tuples
window_hours: Measurement window in hours
Returns:
Availability percentage and downtime minutes
"""
if not checks:
return {"availability": 0, "downtime_minutes": window_hours * 60}
total_checks = len(checks)
successful_checks = sum(1 for _, success in checks if success)
availability_pct = (successful_checks / total_checks) * 100
downtime_pct = 100 - availability_pct
downtime_minutes = (downtime_pct / 100) * window_hours * 60
return {
"availability_pct": round(availability_pct, 4),
"downtime_minutes": round(downtime_minutes, 1),
"successful_checks": successful_checks,
"total_checks": total_checks,
"sla_target": 99.9,
"sla_met": availability_pct >= 99.9
}
# Convert uptime % to allowed downtime
SLA_DOWNTIME = {
99.0: {"monthly": "7h 18m", "weekly": "1h 41m", "daily": "14m 24s"},
99.5: {"monthly": "3h 39m", "weekly": "50m 24s", "daily": "7m 12s"},
99.9: {"monthly": "43m 49s", "weekly": "10m 4s", "daily": "1m 26s"},
99.95: {"monthly": "21m 54s", "weekly": "5m 2s", "daily": "43s"},
99.99: {"monthly": "4m 22s", "weekly": "1m 0s", "daily": "8.6s"},
}
Latency SLA Tracking
Availability is necessary but insufficient. An API that responds slowly is failing even if it technically responds. Track latency SLAs separately:
-- Calculate latency SLA compliance over rolling 30 days
WITH latency_data AS (
SELECT
timestamp,
endpoint,
response_time_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms)
OVER (
PARTITION BY endpoint
ORDER BY timestamp
ROWS BETWEEN 2880 PRECEDING AND CURRENT ROW -- 30 days of 15-min checks
) as rolling_p95
FROM api_checks
WHERE timestamp > NOW() - INTERVAL '30 days'
)
SELECT
endpoint,
AVG(response_time_ms) as avg_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) as p99_ms,
COUNT(*) FILTER (WHERE response_time_ms <= 300) * 100.0 / COUNT(*) as pct_within_300ms,
CASE WHEN
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) <= 300
THEN 'MET' ELSE 'VIOLATED'
END as latency_sla_status
FROM latency_data
GROUP BY endpoint
ORDER BY p95_ms DESC;
Setting Up SLA Monitoring Checks
Configure monitoring with SLA compliance in mind:
# SLA-oriented monitoring configuration
monitors:
- name: "Payments API - SLA Monitor"
url: "https://api.example.com/v2/health"
interval: 60 # Check every minute for accurate uptime calculation
regions:
- us-east-1
- eu-west-1
sla:
availability_target: 99.9
latency_p95_target: 500 # ms
error_rate_target: 0.1 # percent
alerts:
# SLA breach prediction (project current trajectory)
- condition: "projected_monthly_availability < 99.9"
severity: critical
message: "Current performance trajectory will breach SLA this month"
SLA Credit Calculation
When SLAs are breached, calculate credits owed:
def calculate_sla_credits(
actual_availability,
sla_tiers,
monthly_spend
):
"""
Calculate SLA credits based on actual uptime vs committed levels.
sla_tiers: List of (min_availability, credit_percentage) tuples
e.g. [(99.0, 10), (95.0, 25), (90.0, 50)]
"""
credit_percentage = 0
for min_availability, credit_pct in sorted(sla_tiers, reverse=True):
if actual_availability < min_availability:
credit_percentage = credit_pct
break
credit_amount = monthly_spend * (credit_percentage / 100)
return {
"actual_availability": actual_availability,
"credit_percentage": credit_percentage,
"credit_amount": credit_amount,
"monthly_spend": monthly_spend
}
# Example SLA credit structure
SLA_TIERS = [
(99.9, 0), # No credit if SLA met
(99.0, 10), # 10% credit for 99.0% - 99.9%
(95.0, 25), # 25% credit for 95.0% - 99.0%
(90.0, 50), # 50% credit for 90.0% - 95.0%
(0, 100), # Full credit below 90%
]
result = calculate_sla_credits(
actual_availability=99.2,
sla_tiers=SLA_TIERS,
monthly_spend=5000
)
# Output: {'actual_availability': 99.2, 'credit_percentage': 10, 'credit_amount': 500.0}
Building SLA Reports
Automated SLA reports help both internal tracking and customer communication:
def generate_monthly_sla_report(api_name, month, year, checks_data):
"""Generate monthly SLA compliance report"""
report = {
"api": api_name,
"period": f"{year}-{month:02d}",
"generated_at": datetime.utcnow().isoformat(),
}
# Overall availability
total = len(checks_data)
successful = sum(1 for c in checks_data if c['status'] == 'success')
availability = (successful / total * 100) if total > 0 else 0
report["availability"] = {
"percentage": round(availability, 4),
"target": 99.9,
"met": availability >= 99.9,
"total_checks": total,
"successful_checks": successful,
"downtime_minutes": round((1 - availability/100) * 24 * 30 * 60, 1)
}
# Latency compliance
latencies = [c['response_time_ms'] for c in checks_data if c['status'] == 'success']
if latencies:
sorted_latencies = sorted(latencies)
p95_index = int(len(sorted_latencies) * 0.95)
report["latency"] = {
"p50_ms": sorted_latencies[len(sorted_latencies) // 2],
"p95_ms": sorted_latencies[p95_index],
"target_p95_ms": 300,
"met": sorted_latencies[p95_index] <= 300
}
# Incidents
incidents = find_incidents(checks_data)
report["incidents"] = [
{
"start": inc['start'].isoformat(),
"end": inc['end'].isoformat(),
"duration_minutes": inc['duration_minutes'],
"affected_regions": inc['regions']
}
for inc in incidents
]
return report
Multi-Tier API SLA Management
Different customers often get different SLAs based on their plan:
| Plan | Availability | P95 Latency | Support Response | Cost | |---|---|---|---|---| | Free | 99.0% | 1000ms | Community | $0 | | Pro | 99.5% | 500ms | 24h email | $99/mo | | Business | 99.9% | 300ms | 4h priority | $499/mo | | Enterprise | 99.99% | 150ms | 1h dedicated | Custom |
Track SLA compliance separately for each tier:
def check_sla_compliance_by_tier(
monitoring_data,
customer_tiers
):
"""Check if each tier is meeting its SLA commitments"""
results = {}
for tier_name, tier_config in customer_tiers.items():
tier_checks = [
c for c in monitoring_data
if c['customer_tier'] == tier_name
]
availability = calculate_availability(tier_checks)
p95_latency = calculate_p95_latency(tier_checks)
results[tier_name] = {
"availability": {
"actual": availability,
"target": tier_config['availability_sla'],
"compliant": availability >= tier_config['availability_sla']
},
"latency": {
"p95_actual": p95_latency,
"target": tier_config['latency_sla'],
"compliant": p95_latency <= tier_config['latency_sla']
}
}
return results
Alerting for SLA Risk
Set up proactive alerts before SLA breaches occur:
alerts:
# Alert when monthly budget is 50% consumed
- name: "SLA Budget Half Consumed"
condition: |
(monthly_downtime_minutes / sla_allowed_downtime_minutes) > 0.5
severity: warning
message: "50% of monthly SLA downtime budget consumed by mid-month"
# Alert when on track to breach
- name: "SLA Breach Predicted"
condition: |
projected_end_of_month_availability < sla_target
severity: critical
message: "Current trajectory will breach SLA - immediate action required"
# Alert on sustained latency degradation
- name: "Latency SLA At Risk"
condition: |
p95_latency_last_hour > (latency_sla_target * 0.9)
severity: warning
message: "Latency within 10% of SLA limit"
Conclusion
API SLA monitoring turns commitments into accountable metrics. It requires precise definitions, continuous measurement, and automatic reporting — not manual calculations after the fact. The teams that handle SLAs best are those that monitor continuously, alert before breaches happen, and have automated reports ready for customer reviews. AzMonitor provides the continuous uptime and latency tracking needed to measure API SLA compliance accurately, with historical data that makes generating monthly reports straightforward rather than painful.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →