Monitoring API latency with averages is like measuring traffic congestion by counting total cars — the number tells you something, but misses the experience of individual drivers stuck in gridlock. Percentiles tell the true story of your API's performance, revealing the experience of every user, including the 1% who wait 10 seconds while everyone else gets a sub-second response.

Why Averages Lie About API Performance

Consider an API endpoint with these response times over 1,000 requests:

990 requests: 50ms
9 requests: 200ms
1 request: 10,000ms

Average response time: (990 × 50 + 9 × 200 + 1 × 10,000) / 1,000 = 61.3ms

An average of 61ms looks excellent. But 1 in 1,000 users waited 10 seconds — that's one frustrated user in every 1,000 sessions, which at scale means thousands of bad experiences per day.

Percentiles reveal what the average hides:

P50: 50ms (typical experience)
P90: 55ms (top 10% of requests)
P95: 180ms (top 5% of requests)
P99: 9,800ms (top 1% — the outlier)

The P99 spike is clearly visible and actionable. The average is not.

Understanding Percentile Calculations

P50 (50th percentile / Median): Half of all requests complete in this time or faster. The "typical" user experience.

P75 (75th percentile): 75% of requests complete in this time or faster. Most users have at least this experience.

P90 (90th percentile): 90% of requests are at or below this latency. Catches the slowest decile.

P95 (95th percentile): 95% of requests complete here or faster. The standard alerting threshold for most APIs.

P99 (99th percentile): 99% complete here. Represents the slowest 1% — extreme outliers but real users.

P999 (99.9th percentile): The slowest 0.1% — used by high-volume services where even 0.1% = thousands of users.

Setting API Latency Thresholds by Percentile

| Endpoint Type | P50 Target | P95 Target | P99 Target | |-------------|-----------|-----------|-----------| | Simple read API | < 50ms | < 200ms | < 500ms | | Complex query API | < 200ms | < 800ms | < 2000ms | | Write/mutation API | < 100ms | < 500ms | < 1500ms | | Search API | < 200ms | < 800ms | < 2000ms | | Authentication | < 100ms | < 300ms | < 1000ms | | Webhook endpoint | < 500ms | < 2000ms | < 5000ms |

These are targets, not requirements. Your actual targets should be based on your current baseline and user expectations for your specific application.

Alert Strategy for API Latency Percentiles

Alert on P95, page on P99:

Warning alert (P95 > threshold): Slack notification. Investigate when you have capacity. Affects 5% of users.
Critical alert (P99 > threshold): PagerDuty/SMS. Investigate immediately. The pattern suggests a deeper problem likely to worsen.

# AzMonitor API latency alerting
api_monitors:
  - endpoint: https://api.yourapp.com/v1/users
    response_time_alerts:
      - percentile: p95
        threshold: 800ms
        severity: warning
        channels: [slack]
      - percentile: p99
        threshold: 2000ms
        severity: critical
        channels: [pagerduty, sms]

Measuring Percentiles at Scale

For production APIs handling thousands of requests per second, percentile calculation requires the right data structures. You can't store every request in memory.

Histogram-based approximation (recommended): Tools like Prometheus use bucket-based histograms that provide approximate percentiles with bounded memory:

# Prometheus histogram configuration
histogram_buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

HdrHistogram: High Dynamic Range histogram provides accurate percentiles with low memory usage — commonly used in benchmarking tools.

External monitoring (AzMonitor): External API monitoring provides exact measurements for each check — not statistical approximations. AzMonitor records the precise response time for every check and provides percentile analysis across time windows.

Tail Latency Amplification in Distributed Systems

In microservices architectures, tail latency amplifies as requests fan out across services:

User Request → Service A → Service B + Service C (parallel)
                          → Service D

If each service has P99 of 100ms:
P99 of Service B + C (parallel, take the max) ≈ 190ms
P99 of full chain ≈ 290ms

Each additional service in your call chain adds its tail latency to the aggregate. A user request that touches 5 services each with P99 of 100ms sees an aggregate P99 of roughly 500ms.

This "tail latency amplification" means that individual service P99s must be much better than your user-facing P99 target.

Common Causes of High Tail Latency

Database lock contention: Most queries run fast, but occasional queries hit a lock and wait. These become the P99 outliers.

Garbage collection pauses: JVM-based services (Java, Scala) experience periodic GC pauses. Go's garbage collector is generally better but still causes occasional pauses.

Cold starts (serverless): AWS Lambda and similar serverless functions have cold start latency (100-1000ms) for infrequently-called functions. Cold starts are the P99 outlier in serverless architectures.

Connection pool exhaustion: When all database connections are in use, new requests wait for one to become free. These waits can be significant and appear as P99 spikes.

Network congestion and packet loss: Occasional TCP retransmissions add 100-300ms to affected requests.

Monitoring External API Latency

If your service calls external APIs (Stripe, SendGrid, Auth0), their latency becomes your latency from the user's perspective. Monitor external API response times the same way you monitor your own:

third_party_monitors:
  - name: "Stripe API"
    url: https://api.stripe.com/v1/balance
    auth: bearer_token
    response_time_alert:
      p95: 1000ms
      p99: 3000ms

  - name: "SendGrid API"
    url: https://api.sendgrid.com/v3/mail/send
    method: POST
    response_time_alert:
      p95: 2000ms

AzMonitor's API monitoring records response time for every check — start monitoring your API latency and get percentile-based alerting that actually reflects user experience.

Tags:API latencypercentilesP95P99API monitoring

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

API Latency Monitoring: P50, P95, P99 Percentiles Explained

Why Averages Lie About API Performance

Understanding Percentile Calculations

Setting API Latency Thresholds by Percentile

Alert Strategy for API Latency Percentiles

Measuring Percentiles at Scale

Tail Latency Amplification in Distributed Systems

Common Causes of High Tail Latency

Monitoring External API Latency

Related articles

Uptime Monitoring for Mobile Apps and Backend APIs

Monitoring Protected Pages: Authenticated Endpoint Checks

Image Optimization Monitoring: WebP, AVIF, and Lazy Loading