API Monitoring

API Performance Benchmarking: Establishing Baselines and Detecting Regressions

Learn how to benchmark API performance, establish latency baselines, detect performance regressions in CI/CD, and set meaningful response time thresholds.

AzMonitor TeamMarch 12, 20259 min read · 1,600 wordsUpdated January 20, 2026
API performancebenchmarkinglatencyperformance testingk6

API performance benchmarking answers a question that uptime monitoring can't: not just "is the API responding?" but "is it responding as fast as it should?" Establishing performance baselines before problems occur is the only way to detect regressions — without a baseline, you can't tell whether 800ms is normal or 2x slower than last week. Benchmarking creates the reference point that makes performance degradation visible.

Why Baselines Matter

The most common performance monitoring mistake is setting thresholds based on intuition rather than measurement. Teams choose 2000ms as a threshold because it sounds reasonable, without knowing their actual P99 is 150ms. When something genuinely degrades, the monitor doesn't fire until it's 13x slower — which means the degradation has been happening for a long time before it's noticed.

A proper performance baseline gives you:

  • Accurate thresholds — Set alerts at 2x your P95, not an arbitrary number
  • Regression detection — Compare current performance against historical baseline
  • Deployment validation — Automatically flag deployments that regress performance
  • SLO input — Base latency SLOs on measured reality, not aspirational targets

Benchmark Tools

k6 for Load Testing and Benchmarking

// k6-benchmark.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Custom metrics
const errorRate = new Rate('errors');
const apiLatency = new Trend('api_latency', true);

export const options = {
  // Baseline measurement scenario (not load test)
  scenarios: {
    baseline: {
      executor: 'constant-vus',
      vus: 5,          // Low concurrency for baseline
      duration: '5m',  // Enough samples for statistical significance
    }
  },
  thresholds: {
    http_req_duration: ['p(95)<500', 'p(99)<1000'],
    errors: ['rate<0.01'],  // Less than 1% errors
  }
};

export default function() {
  const BASE_URL = __ENV.BASE_URL || 'https://api.example.com';
  const TOKEN = __ENV.API_TOKEN;
  
  const params = {
    headers: {
      'Authorization': `Bearer ${TOKEN}`,
      'Content-Type': 'application/json',
    },
    timeout: '10s',
  };
  
  // Benchmark key endpoints
  const endpoints = [
    { name: 'list_users', url: `${BASE_URL}/v3/users?limit=20` },
    { name: 'get_user', url: `${BASE_URL}/v3/users/test-user-id` },
    { name: 'search', url: `${BASE_URL}/v3/search?q=test` },
    { name: 'health', url: `${BASE_URL}/health` },
  ];
  
  for (const endpoint of endpoints) {
    const start = Date.now();
    const res = http.get(endpoint.url, params);
    const duration = Date.now() - start;
    
    // Record to custom metric with endpoint tag
    apiLatency.add(duration, { endpoint: endpoint.name });
    
    const success = check(res, {
      'status is 200': (r) => r.status === 200,
      'response time < 500ms': (r) => r.timings.duration < 500,
    });
    
    errorRate.add(!success);
    
    sleep(0.1); // Small delay between requests
  }
}

export function handleSummary(data) {
  // Output detailed benchmark results
  const results = {};
  
  for (const [metric, values] of Object.entries(data.metrics)) {
    if (metric.startsWith('http_req_duration') || metric === 'api_latency') {
      results[metric] = {
        min: values.values.min,
        avg: values.values.avg,
        med: values.values.med,
        p90: values.values['p(90)'],
        p95: values.values['p(95)'],
        p99: values.values['p(99)'],
        max: values.values.max,
      };
    }
  }
  
  return {
    'benchmark-results.json': JSON.stringify({ results, timestamp: new Date().toISOString() }, null, 2),
    stdout: `\nBenchmark complete. Results saved to benchmark-results.json\n`,
  };
}

Apache Bench for Quick Baselines

# Quick benchmark of a specific endpoint
# 1000 requests, 10 concurrent users
ab -n 1000 -c 10 \
  -H "Authorization: Bearer ${API_TOKEN}" \
  -H "Content-Type: application/json" \
  https://api.example.com/v3/users

# Output:
# Requests per second:    245.30 [#/sec] (mean)
# Time per request:       40.766 [ms] (mean)
# 
# Percentage of the requests served within a certain time (ms)
#   50%     38
#   66%     41
#   75%     44
#   80%     46
#   90%     54
#   95%     62
#   98%     78
#   99%     89
#  100%    203 (longest request)

wrk for More Realistic Load Patterns

# wrk benchmark with Lua script for custom logic
wrk -t4 -c50 -d60s \
  --latency \
  -H "Authorization: Bearer ${API_TOKEN}" \
  https://api.example.com/v3/users

# Output includes:
# Latency Distribution
#    50%    38.24ms
#    75%    48.12ms
#    90%    62.55ms
#    99%   112.34ms

Establishing a Baseline

Run benchmarks in a consistent, controlled environment:

# benchmark_baseline.py
import json
import subprocess
import statistics
from datetime import datetime
from pathlib import Path

def run_k6_benchmark(script, env_vars, output_file):
    """Run k6 benchmark and capture results"""
    cmd = [
        "k6", "run",
        "--env", f"BASE_URL={env_vars['BASE_URL']}",
        "--env", f"API_TOKEN={env_vars['API_TOKEN']}",
        "--out", f"json={output_file}",
        script
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.returncode == 0

def extract_percentiles(k6_output_file):
    """Extract key percentiles from k6 JSON output"""
    metrics = {}
    
    with open(k6_output_file) as f:
        for line in f:
            try:
                entry = json.loads(line)
                if entry.get("type") == "Point" and entry.get("metric") == "http_req_duration":
                    endpoint = entry.get("data", {}).get("tags", {}).get("name", "unknown")
                    value = entry["data"]["value"]
                    
                    if endpoint not in metrics:
                        metrics[endpoint] = []
                    metrics[endpoint].append(value)
            except json.JSONDecodeError:
                continue
    
    baselines = {}
    for endpoint, values in metrics.items():
        sorted_values = sorted(values)
        n = len(sorted_values)
        
        baselines[endpoint] = {
            "sample_count": n,
            "p50_ms": sorted_values[int(n * 0.50)],
            "p90_ms": sorted_values[int(n * 0.90)],
            "p95_ms": sorted_values[int(n * 0.95)],
            "p99_ms": sorted_values[int(n * 0.99)],
            "max_ms": sorted_values[-1],
            "mean_ms": statistics.mean(values),
            "stdev_ms": statistics.stdev(values) if len(values) > 1 else 0
        }
    
    return baselines

def save_baseline(baselines, baseline_path):
    """Save baseline with metadata"""
    baseline = {
        "created_at": datetime.utcnow().isoformat(),
        "environment": "production",
        "metrics": baselines,
        "thresholds": {
            endpoint: {
                # Alert threshold = 2x P95 baseline
                "warning_ms": data["p95_ms"] * 2,
                # Critical threshold = 3x P95 baseline
                "critical_ms": data["p95_ms"] * 3
            }
            for endpoint, data in baselines.items()
        }
    }
    
    Path(baseline_path).write_text(json.dumps(baseline, indent=2))
    return baseline

Regression Detection in CI/CD

Run performance tests on every deployment and compare against baseline:

# regression_detector.py
import json
import sys
from pathlib import Path

def detect_regression(current_results, baseline_path, threshold_multiplier=2.0):
    """
    Compare current benchmark results against baseline.
    Returns list of regressions detected.
    """
    baseline = json.loads(Path(baseline_path).read_text())
    regressions = []
    
    for endpoint, current in current_results.items():
        if endpoint not in baseline["metrics"]:
            # New endpoint, no baseline to compare
            continue
        
        baseline_p95 = baseline["metrics"][endpoint]["p95_ms"]
        current_p95 = current["p95_ms"]
        
        ratio = current_p95 / baseline_p95 if baseline_p95 > 0 else float('inf')
        
        if ratio >= threshold_multiplier:
            regressions.append({
                "endpoint": endpoint,
                "baseline_p95_ms": baseline_p95,
                "current_p95_ms": current_p95,
                "ratio": round(ratio, 2),
                "regression_pct": round((ratio - 1) * 100, 1),
                "severity": "critical" if ratio >= 3.0 else "warning"
            })
    
    return regressions

def generate_regression_report(regressions, current_results, baseline):
    """Generate CI-friendly report"""
    if not regressions:
        print("Performance check PASSED — no regressions detected")
        return True
    
    print(f"\nPerformance regression detected in {len(regressions)} endpoint(s):\n")
    print(f"{'Endpoint':<30} {'Baseline P95':>12} {'Current P95':>12} {'Change':>10} {'Severity':>10}")
    print("-" * 80)
    
    for reg in regressions:
        print(
            f"{reg['endpoint']:<30} "
            f"{reg['baseline_p95_ms']:>10.0f}ms "
            f"{reg['current_p95_ms']:>10.0f}ms "
            f"{reg['regression_pct']:>+9.1f}% "
            f"{reg['severity']:>10}"
        )
    
    print("\nFailing CI check due to performance regression.")
    return False

# In CI pipeline
if __name__ == "__main__":
    current_file = sys.argv[1]
    baseline_file = sys.argv[2]
    
    current = json.loads(Path(current_file).read_text())
    baseline = json.loads(Path(baseline_file).read_text())
    
    regressions = detect_regression(current["metrics"], baseline_file)
    passed = generate_regression_report(regressions, current, baseline)
    
    sys.exit(0 if passed else 1)

CI/CD Integration

# GitHub Actions workflow with performance gates
name: API Performance Benchmark

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Install k6
        run: |
          sudo apt-get update
          sudo apt-get install -y k6
      
      - name: Deploy to staging
        run: ./scripts/deploy-staging.sh
      
      - name: Wait for deployment
        run: ./scripts/wait-for-healthy.sh $STAGING_URL
      
      - name: Run performance benchmark
        env:
          BASE_URL: ${{ vars.STAGING_URL }}
          API_TOKEN: ${{ secrets.STAGING_API_TOKEN }}
        run: |
          k6 run \
            --env BASE_URL=$BASE_URL \
            --env API_TOKEN=$API_TOKEN \
            --out json=benchmark-current.json \
            ./tests/k6-benchmark.js
      
      - name: Download baseline from S3
        run: |
          aws s3 cp s3://benchmarks/baseline.json ./baseline.json
      
      - name: Check for regressions
        run: |
          python3 ./scripts/regression_detector.py \
            benchmark-current.json \
            baseline.json
      
      - name: Update baseline (main branch only)
        if: github.ref == 'refs/heads/main' && success()
        run: |
          aws s3 cp ./benchmark-current.json \
            s3://benchmarks/baseline.json
      
      - name: Upload results
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: benchmark-results
          path: benchmark-current.json

Latency Percentiles and What They Tell You

Understanding which percentile to use for different decisions:

| Percentile | Use Case | What It Measures | |---|---|---| | P50 (median) | "Typical" user experience | Half of requests are faster than this | | P90 | "Most" user experience | 90% of requests are faster than this | | P95 | SLO target setting | Standard latency SLO target | | P99 | Identifying tail latency | Worst 1% of users | | P99.9 | Enterprise SLA targets | Worst 0.1% of users | | Max | Outlier detection | Single worst request |

def set_monitoring_thresholds_from_baseline(baseline):
    """
    Derive monitoring thresholds from measured baseline.
    """
    thresholds = {}
    
    for endpoint, metrics in baseline["metrics"].items():
        p95 = metrics["p95_ms"]
        p99 = metrics["p99_ms"]
        
        thresholds[endpoint] = {
            # Uptime monitor: alert if P50 exceeds 2x baseline P95
            # (This means something is very wrong)
            "uptime_alert_threshold_ms": p95 * 2,
            
            # Performance alert: P95 is 50% above baseline
            "performance_warning_ms": p95 * 1.5,
            
            # Performance critical: P95 has doubled
            "performance_critical_ms": p95 * 2,
            
            # SLO target: 99% of requests under this threshold
            "slo_p99_target_ms": p99 * 1.2,  # 20% headroom
        }
    
    return thresholds

Continuous Performance Monitoring

Beyond CI/CD gates, run continuous performance benchmarks against production:

# Scheduled performance benchmark — runs every 6 hours
# This detects gradual performance drift that CI tests miss

monitors:
  # AzMonitor handles continuous uptime + basic response time
  - name: "API - Users Endpoint"
    url: "https://api.example.com/v3/users?limit=1"
    interval: 60
    assertions:
      - type: response_time
        operator: less_than
        value: 500  # Set from measured P95 baseline
        
  - name: "API - Search Endpoint"
    url: "https://api.example.com/v3/search?q=test"
    interval: 60
    assertions:
      - type: response_time
        operator: less_than
        value: 1200  # Search is naturally slower

# Separate scheduled k6 benchmark for detailed percentile tracking
scheduled_benchmarks:
  - name: "production-performance-baseline"
    schedule: "0 */6 * * *"  # Every 6 hours
    script: "k6-benchmark.js"
    environment: production
    compare_against: rolling_7d_baseline
    alert_on_regression: true

Conclusion

API performance benchmarking transforms monitoring from reactive to proactive. Without baselines, you're guessing at thresholds and missing gradual degradation that happens too slowly to trigger point-in-time alerts. With baselines, you have the context to detect a 30% slowdown before it becomes a user-visible problem, and to automatically block deployments that regress performance. AzMonitor's response time assertions — when configured from measured baselines rather than guesses — become meaningful gates that catch the regressions that matter, giving you both continuous production monitoring and meaningful alerting thresholds grounded in real performance data.

Tags:API performancebenchmarkinglatencyperformance testingk6
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →