API Monitoring

API SLA Monitoring: Tracking and Reporting on API Service Agreements

Learn how to define, measure, and report on API SLAs, including availability, latency, and error rate commitments for internal and external consumers.

AzMonitor TeamMarch 19, 20258 min read · 1,266 wordsUpdated January 20, 2026
API SLAservice level agreementAPI availabilitySLA reporting

API SLAs are promises. When you tell customers your API will be available 99.9% of the time with p95 latency under 300ms, you're making a commitment that has consequences — financial credits, churn risk, contract violations. Without rigorous SLA monitoring, you're either violating promises you don't know about or paying penalties for incidents you can't prove didn't happen.

Defining Meaningful API SLAs

An API SLA needs specific, measurable definitions. Vague commitments create disputes. Start by defining:

What counts as "available"? An API that returns 500 errors is technically "reachable" but not serving users. Define availability as the percentage of successful requests (2xx responses) out of all attempts.

What's measured? Synthetic monitoring from external locations? Real user traffic? Both have tradeoffs. Synthetic is consistent; real user traffic reflects actual user experience.

What's excluded? Planned maintenance, force majeure events, and client-caused issues are typically excluded from SLA calculations.

SLA Components for APIs

| Component | Definition | Typical Target | |---|---|---| | Availability | % of requests returning 2xx | 99.9% - 99.99% | | Latency (p95) | 95th percentile response time | < 300ms - 1000ms | | Error Rate | % of requests returning 5xx | < 0.1% - 0.5% | | Throughput | Guaranteed requests per second | Varies by tier | | Recovery Time | Time to restore after outage | < 15min - 4hr |

Calculating API Availability

The most common SLA metric is availability, usually expressed as uptime percentage:

def calculate_api_availability(checks, window_hours=720):  # 30 days
    """
    Calculate API availability from monitoring check results.
    
    Args:
        checks: List of (timestamp, success) tuples
        window_hours: Measurement window in hours
    
    Returns:
        Availability percentage and downtime minutes
    """
    if not checks:
        return {"availability": 0, "downtime_minutes": window_hours * 60}
    
    total_checks = len(checks)
    successful_checks = sum(1 for _, success in checks if success)
    
    availability_pct = (successful_checks / total_checks) * 100
    downtime_pct = 100 - availability_pct
    downtime_minutes = (downtime_pct / 100) * window_hours * 60
    
    return {
        "availability_pct": round(availability_pct, 4),
        "downtime_minutes": round(downtime_minutes, 1),
        "successful_checks": successful_checks,
        "total_checks": total_checks,
        "sla_target": 99.9,
        "sla_met": availability_pct >= 99.9
    }

# Convert uptime % to allowed downtime
SLA_DOWNTIME = {
    99.0:  {"monthly": "7h 18m", "weekly": "1h 41m", "daily": "14m 24s"},
    99.5:  {"monthly": "3h 39m", "weekly": "50m 24s", "daily": "7m 12s"},
    99.9:  {"monthly": "43m 49s", "weekly": "10m 4s", "daily": "1m 26s"},
    99.95: {"monthly": "21m 54s", "weekly": "5m 2s", "daily": "43s"},
    99.99: {"monthly": "4m 22s", "weekly": "1m 0s", "daily": "8.6s"},
}

Latency SLA Tracking

Availability is necessary but insufficient. An API that responds slowly is failing even if it technically responds. Track latency SLAs separately:

-- Calculate latency SLA compliance over rolling 30 days
WITH latency_data AS (
  SELECT
    timestamp,
    endpoint,
    response_time_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) 
      OVER (
        PARTITION BY endpoint 
        ORDER BY timestamp 
        ROWS BETWEEN 2880 PRECEDING AND CURRENT ROW -- 30 days of 15-min checks
      ) as rolling_p95
  FROM api_checks
  WHERE timestamp > NOW() - INTERVAL '30 days'
)
SELECT
  endpoint,
  AVG(response_time_ms) as avg_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95_ms,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) as p99_ms,
  COUNT(*) FILTER (WHERE response_time_ms <= 300) * 100.0 / COUNT(*) as pct_within_300ms,
  CASE WHEN 
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) <= 300 
    THEN 'MET' ELSE 'VIOLATED' 
  END as latency_sla_status
FROM latency_data
GROUP BY endpoint
ORDER BY p95_ms DESC;

Setting Up SLA Monitoring Checks

Configure monitoring with SLA compliance in mind:

# SLA-oriented monitoring configuration
monitors:
  - name: "Payments API - SLA Monitor"
    url: "https://api.example.com/v2/health"
    interval: 60  # Check every minute for accurate uptime calculation
    regions:
      - us-east-1
      - eu-west-1
    sla:
      availability_target: 99.9
      latency_p95_target: 500  # ms
      error_rate_target: 0.1   # percent
    alerts:
      # SLA breach prediction (project current trajectory)
      - condition: "projected_monthly_availability < 99.9"
        severity: critical
        message: "Current performance trajectory will breach SLA this month"

SLA Credit Calculation

When SLAs are breached, calculate credits owed:

def calculate_sla_credits(
    actual_availability,
    sla_tiers,
    monthly_spend
):
    """
    Calculate SLA credits based on actual uptime vs committed levels.
    
    sla_tiers: List of (min_availability, credit_percentage) tuples
                e.g. [(99.0, 10), (95.0, 25), (90.0, 50)]
    """
    credit_percentage = 0
    
    for min_availability, credit_pct in sorted(sla_tiers, reverse=True):
        if actual_availability < min_availability:
            credit_percentage = credit_pct
            break
    
    credit_amount = monthly_spend * (credit_percentage / 100)
    
    return {
        "actual_availability": actual_availability,
        "credit_percentage": credit_percentage,
        "credit_amount": credit_amount,
        "monthly_spend": monthly_spend
    }

# Example SLA credit structure
SLA_TIERS = [
    (99.9, 0),    # No credit if SLA met
    (99.0, 10),   # 10% credit for 99.0% - 99.9%
    (95.0, 25),   # 25% credit for 95.0% - 99.0%
    (90.0, 50),   # 50% credit for 90.0% - 95.0%
    (0, 100),     # Full credit below 90%
]

result = calculate_sla_credits(
    actual_availability=99.2,
    sla_tiers=SLA_TIERS,
    monthly_spend=5000
)
# Output: {'actual_availability': 99.2, 'credit_percentage': 10, 'credit_amount': 500.0}

Building SLA Reports

Automated SLA reports help both internal tracking and customer communication:

def generate_monthly_sla_report(api_name, month, year, checks_data):
    """Generate monthly SLA compliance report"""
    report = {
        "api": api_name,
        "period": f"{year}-{month:02d}",
        "generated_at": datetime.utcnow().isoformat(),
    }
    
    # Overall availability
    total = len(checks_data)
    successful = sum(1 for c in checks_data if c['status'] == 'success')
    availability = (successful / total * 100) if total > 0 else 0
    
    report["availability"] = {
        "percentage": round(availability, 4),
        "target": 99.9,
        "met": availability >= 99.9,
        "total_checks": total,
        "successful_checks": successful,
        "downtime_minutes": round((1 - availability/100) * 24 * 30 * 60, 1)
    }
    
    # Latency compliance
    latencies = [c['response_time_ms'] for c in checks_data if c['status'] == 'success']
    if latencies:
        sorted_latencies = sorted(latencies)
        p95_index = int(len(sorted_latencies) * 0.95)
        
        report["latency"] = {
            "p50_ms": sorted_latencies[len(sorted_latencies) // 2],
            "p95_ms": sorted_latencies[p95_index],
            "target_p95_ms": 300,
            "met": sorted_latencies[p95_index] <= 300
        }
    
    # Incidents
    incidents = find_incidents(checks_data)
    report["incidents"] = [
        {
            "start": inc['start'].isoformat(),
            "end": inc['end'].isoformat(),
            "duration_minutes": inc['duration_minutes'],
            "affected_regions": inc['regions']
        }
        for inc in incidents
    ]
    
    return report

Multi-Tier API SLA Management

Different customers often get different SLAs based on their plan:

| Plan | Availability | P95 Latency | Support Response | Cost | |---|---|---|---|---| | Free | 99.0% | 1000ms | Community | $0 | | Pro | 99.5% | 500ms | 24h email | $99/mo | | Business | 99.9% | 300ms | 4h priority | $499/mo | | Enterprise | 99.99% | 150ms | 1h dedicated | Custom |

Track SLA compliance separately for each tier:

def check_sla_compliance_by_tier(
    monitoring_data,
    customer_tiers
):
    """Check if each tier is meeting its SLA commitments"""
    results = {}
    
    for tier_name, tier_config in customer_tiers.items():
        tier_checks = [
            c for c in monitoring_data
            if c['customer_tier'] == tier_name
        ]
        
        availability = calculate_availability(tier_checks)
        p95_latency = calculate_p95_latency(tier_checks)
        
        results[tier_name] = {
            "availability": {
                "actual": availability,
                "target": tier_config['availability_sla'],
                "compliant": availability >= tier_config['availability_sla']
            },
            "latency": {
                "p95_actual": p95_latency,
                "target": tier_config['latency_sla'],
                "compliant": p95_latency <= tier_config['latency_sla']
            }
        }
    
    return results

Alerting for SLA Risk

Set up proactive alerts before SLA breaches occur:

alerts:
  # Alert when monthly budget is 50% consumed
  - name: "SLA Budget Half Consumed"
    condition: |
      (monthly_downtime_minutes / sla_allowed_downtime_minutes) > 0.5
    severity: warning
    message: "50% of monthly SLA downtime budget consumed by mid-month"
    
  # Alert when on track to breach
  - name: "SLA Breach Predicted"
    condition: |
      projected_end_of_month_availability < sla_target
    severity: critical
    message: "Current trajectory will breach SLA - immediate action required"
    
  # Alert on sustained latency degradation
  - name: "Latency SLA At Risk"
    condition: |
      p95_latency_last_hour > (latency_sla_target * 0.9)
    severity: warning
    message: "Latency within 10% of SLA limit"

Conclusion

API SLA monitoring turns commitments into accountable metrics. It requires precise definitions, continuous measurement, and automatic reporting — not manual calculations after the fact. The teams that handle SLAs best are those that monitor continuously, alert before breaches happen, and have automated reports ready for customer reviews. AzMonitor provides the continuous uptime and latency tracking needed to measure API SLA compliance accurately, with historical data that makes generating monthly reports straightforward rather than painful.

Tags:API SLAservice level agreementAPI availabilitySLA reporting
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →