The API gateway is the front door to your entire service catalog. Every request passes through it before reaching your backend services. When the gateway has problems — high latency, misconfigured routing, failed authentication plugins — the impact is total. Every API is affected simultaneously. Yet many teams monitor their backend services carefully while treating the gateway as an afterthought.
What to Monitor in an API Gateway
API gateways introduce a unique layer of infrastructure between clients and services. This layer can fail or degrade in ways that are distinct from application-level failures:
Routing failures — A route misconfiguration can send traffic to the wrong backend or return 404s for valid endpoints.
Plugin failures — Authentication plugins, rate limiters, and request transformers can fail while the gateway itself stays up.
Latency overhead — The gateway adds latency to every request. This overhead should be consistently small (< 5ms). Spikes indicate gateway-level problems.
TLS termination — Certificate issues at the gateway level affect all services simultaneously.
Connection pool exhaustion — If the gateway runs out of upstream connections, all services degrade together.
AWS API Gateway Monitoring
AWS API Gateway exposes metrics through CloudWatch. The most important ones:
| Metric | Description | Alert Threshold |
|---|---|---|
| Count | Total number of API calls | Baseline deviation |
| Latency | Full request latency including integration | > 3000ms |
| IntegrationLatency | Backend response time (excludes gateway) | > 2500ms |
| 4XXError | Client-side errors | > 2% rate |
| 5XXError | Server-side errors | > 1% rate |
| CacheHitCount | Response cache hits | Track efficiency |
| CacheMissCount | Response cache misses | Track efficiency |
The difference between Latency and IntegrationLatency is the gateway's own overhead. If Latency is 500ms and IntegrationLatency is 490ms, the gateway adds ~10ms — normal. If the gateway adds 200ms, something's wrong.
CloudWatch Alarms for API Gateway
# Create CloudWatch alarm for 5XX errors
aws cloudwatch put-metric-alarm \
--alarm-name "APIGateway-5XXErrors-High" \
--alarm-description "API Gateway 5XX error rate above 1%" \
--metric-name 5XXError \
--namespace AWS/ApiGateway \
--dimensions Name=ApiName,Value=my-api \
--statistic Average \
--period 300 \
--threshold 0.01 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions "arn:aws:sns:us-east-1:123456789:alerts"
# Alarm for latency
aws cloudwatch put-metric-alarm \
--alarm-name "APIGateway-Latency-High" \
--alarm-description "API Gateway p99 latency above 3000ms" \
--metric-name Latency \
--namespace AWS/ApiGateway \
--extended-statistic p99 \
--period 300 \
--threshold 3000 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--alarm-actions "arn:aws:sns:us-east-1:123456789:alerts"
Kong Gateway Monitoring
Kong is a popular open-source API gateway with rich plugin support. Monitor it through its Admin API and Prometheus metrics:
# Kong Prometheus plugin configuration
plugins:
- name: prometheus
config:
status_code_metrics: true
latency_metrics: true
bandwidth_metrics: true
upstream_health_metrics: true
Key Kong Prometheus metrics:
# Request rate by service
rate(kong_http_requests_total[5m])
# Error rate per service
rate(kong_http_requests_total{code=~"5.."}[5m])
/ rate(kong_http_requests_total[5m])
# Gateway latency (not including upstream)
histogram_quantile(0.99, rate(kong_kong_latency_ms_bucket[5m]))
# Upstream (backend) latency
histogram_quantile(0.99, rate(kong_upstream_latency_ms_bucket[5m]))
# Bandwidth usage
rate(kong_bandwidth_bytes_total[5m])
Kong Health Check Configuration
Kong can actively health-check upstream services. Configure this to get automatic failover and visibility into upstream health:
# Kong upstream with active health checks
upstreams:
- name: user-service
healthchecks:
active:
healthy:
interval: 10
successes: 2
unhealthy:
interval: 5
http_failures: 3
http_statuses: [429, 500, 503]
http_path: "/health"
timeout: 1
passive:
healthy:
successes: 5
unhealthy:
http_failures: 5
http_statuses: [500, 503]
Monitor Kong's upstream health through the Admin API:
# Check upstream health status
curl -s http://kong-admin:8001/upstreams/user-service/health | jq '
.data[] | {
address: .address,
health: .health,
weight: .weight
}
'
# Output:
# {
# "address": "user-service-1:8080",
# "health": "HEALTHY",
# "weight": 100
# }
# {
# "address": "user-service-2:8080",
# "health": "UNHEALTHY",
# "weight": 0
# }
Nginx API Gateway Monitoring
For teams using Nginx as an API gateway, the key metrics come from the ngx_http_stub_status_module or the commercial Nginx Plus status:
# Enable status module
server {
listen 8080;
location /nginx_status {
stub_status on;
allow 127.0.0.1;
deny all;
}
}
Parse the status output:
# Nginx status endpoint output:
# Active connections: 45
# server accepts handled requests
# 1000 1000 5432
# Reading: 0 Writing: 5 Waiting: 40
# Parse with curl and awk
curl -s http://localhost:8080/nginx_status | \
awk '/Active/ {print "active_connections=" $3}
/Reading/ {print "reading=" $2, "writing=" $4, "waiting=" $6}'
For detailed upstream monitoring, use the ngx_http_upstream_module:
upstream api_backends {
server backend1:8080;
server backend2:8080;
# Expose upstream status
keepalive 32;
}
Gateway Latency Overhead Analysis
Calculate how much latency the gateway adds versus what your backend contributes:
Total Latency = Gateway Overhead + Backend Processing + Network
Gateway Overhead = Total Latency - Integration Latency
Target gateway overhead by gateway type:
| Gateway | Acceptable Overhead | High Overhead | |---|---|---| | AWS API Gateway | < 10ms | > 50ms | | Kong | < 5ms | > 20ms | | Nginx | < 2ms | > 10ms | | Traefik | < 3ms | > 15ms | | Envoy | < 2ms | > 10ms |
When gateway overhead spikes, check for:
- Plugin execution taking too long
- Rate limiter checking Redis with high latency
- JWT validation with slow key fetching
- Excessive logging or request/response transformation
Monitoring Route Configuration
Invalid route configurations are a common source of incidents. Monitor the health of routing:
# Validate all routes are reachable
def validate_gateway_routes(gateway_admin_url, sample_check=True):
"""
Fetch all configured routes and validate each one responds correctly.
"""
# Get all configured routes
routes = requests.get(f"{gateway_admin_url}/routes").json()['data']
results = []
for route in routes:
service_name = route.get('service', {}).get('id', 'unknown')
paths = route.get('paths', [])
if not paths:
continue
if sample_check:
# Test the first path
test_path = paths[0]
response = requests.get(
f"https://api.example.com{test_path}",
headers={"X-Monitor": "true"},
timeout=5,
allow_redirects=False
)
results.append({
"route_id": route['id'],
"service": service_name,
"path": test_path,
"status": response.status_code,
"healthy": response.status_code not in [404, 502, 503]
})
unhealthy = [r for r in results if not r['healthy']]
return {
"total_routes": len(results),
"healthy": len(results) - len(unhealthy),
"unhealthy": unhealthy
}
Cross-Gateway Latency Comparison
If you run multiple API gateways (for different environments or regions), compare their latency profiles:
| Gateway Instance | Avg Latency | P95 Latency | Error Rate | |---|---|---|---| | gateway-us-east-1 | 8ms | 24ms | 0.02% | | gateway-us-west-2 | 11ms | 31ms | 0.03% | | gateway-eu-west-1 | 9ms | 28ms | 0.02% | | gateway-ap-east-1 | 45ms | 130ms | 0.08% |
The ap-east-1 gateway is significantly slower — likely a configuration issue or a slow upstream DNS lookup in that region.
Alerting for Gateway Issues
alerts:
# Gateway is adding excessive latency
- name: "Gateway Latency Overhead High"
condition: "gateway_overhead_ms > 50"
severity: warning
# High proportion of gateway errors
- name: "Gateway Error Rate Critical"
condition: "gateway_5xx_rate > 0.01"
severity: critical
runbook: "https://wiki.example.com/runbooks/api-gateway-5xx"
# Upstream connections exhausted
- name: "Gateway Connection Pool Exhausted"
condition: "gateway_upstream_connections_available < 10"
severity: critical
# Route check failure
- name: "Gateway Route Misconfigured"
condition: "gateway_route_health_check_failed = 1"
severity: critical
Conclusion
The API gateway sits at the intersection of all your services — when it has problems, everything has problems. Thorough gateway monitoring covers latency overhead (distinguish backend latency from gateway overhead), routing validation, plugin health, upstream connection pools, and error rates. AzMonitor can monitor your gateway's public endpoints from multiple regions to catch routing issues and latency problems that internal monitoring might miss, providing an outside-in view of your API perimeter health.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →