Database failures are disproportionately responsible for web application outages. A crashed web server is bad; a crashed database is catastrophic — it takes down every application that depends on it simultaneously and often requires careful data validation before recovery. Preventing database-caused outages starts with monitoring the right metrics before failures occur.

Why Database Monitoring Is Different

Database monitoring requires a different approach than application monitoring:

Gradual degradation: Databases don't usually fail suddenly. They degrade gradually as indexes become bloated, connection pools fill, or query plans change with data growth. The right monitoring catches this gradual degradation.

Dependencies cascade: When a database is slow or unavailable, every application service that depends on it slows down or fails. Database health is the health of everything that depends on it.

Recovery is complicated: Rolling back an application deployment takes minutes. Recovering a corrupted database takes hours and may involve data loss. Prevention is exponentially more valuable than recovery.

Critical Database Metrics to Monitor

Query Performance Metrics

Query latency percentiles: P50, P95, P99 for all query types:

-- PostgreSQL: Find slowest queries (requires pg_stat_statements extension)
SELECT 
  query,
  calls,
  mean_exec_time as avg_ms,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY exec_time) as p95_ms,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY exec_time) as p99_ms
FROM pg_stat_statements
JOIN pg_stat_statements_history ...
ORDER BY p99_ms DESC
LIMIT 20;

Alert when:

P95 query time increases 50% above baseline
Any query exceeds 1 second average (investigate, not necessarily alert)
Specific critical queries exceed defined thresholds

Slow query log: Enable slow query logging to capture queries that exceed a threshold:

-- PostgreSQL: Log queries taking > 1 second
ALTER SYSTEM SET log_min_duration_statement = 1000;  -- milliseconds
SELECT pg_reload_conf();

Connection Pool Metrics

Connection pool exhaustion is a common cause of database-related outages. When no connections are available, all new database requests wait indefinitely (or timeout):

-- PostgreSQL: Current connections
SELECT 
  state,
  count(*) as connections,
  max(now() - query_start) as max_duration
FROM pg_stat_activity
WHERE datname = 'your_database'
GROUP BY state;

Alert thresholds:

Warning: Connection pool at 70% capacity
Critical: Connection pool at 85% capacity
Emergency: Connection pool at 95%+ (application requests starting to queue)

Replication Lag

If you use read replicas or have standby servers for failover, replication lag determines how outdated replica data is and how long failover recovery might take:

-- PostgreSQL: Replication lag on replicas
SELECT 
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  (sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;

Alert when:

Replication lag exceeds 30 seconds (reads from replica are stale)
Replication lag exceeds 5 minutes (failover would lose significant data)
Replica is disconnected from primary

Index Efficiency

Poor index efficiency causes query slowdowns that get worse as data grows:

-- PostgreSQL: Find tables with high sequential scan ratios (missing indexes)
SELECT 
  schemaname,
  tablename,
  seq_scan,
  idx_scan,
  seq_scan / (seq_scan + idx_scan + 1.0) as seq_scan_ratio
FROM pg_stat_user_tables
WHERE seq_scan + idx_scan > 100
ORDER BY seq_scan_ratio DESC;

A high sequential scan ratio on large tables indicates missing indexes. Alert when a table's sequential scan ratio exceeds 10% and the table has > 100K rows.

Database Size and Growth Rate

Unexpected database size growth can fill disks and cause outages:

-- PostgreSQL: Database size and growth
SELECT 
  pg_size_pretty(pg_database_size('your_database')) as size,
  pg_database_size('your_database') as size_bytes;

Monitor:

Total database size (alert at 80% disk capacity)
Weekly growth rate (alert if growth rate doubles unexpectedly)
Individual table sizes (alert if a table grows unexpectedly fast — could indicate data leak)

Lock Contention

Table locks can cause cascading slowdowns across the application:

-- PostgreSQL: Find blocked queries
SELECT 
  pid,
  now() - query_start as duration,
  query,
  state
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY duration DESC;

Alert when:

Queries are waiting on locks for > 30 seconds
More than 5 queries are simultaneously waiting on locks

Health Check Endpoint for Database

Expose a health endpoint that validates database connectivity and basic operation:

# FastAPI health check endpoint
from fastapi import FastAPI
import databases
import time

app = FastAPI()
database = databases.Database(DATABASE_URL)

@app.get("/health")
async def health_check():
    checks = {}
    healthy = True
    
    # Database connectivity
    try:
        start = time.time()
        result = await database.fetch_one("SELECT 1 as health")
        latency_ms = (time.time() - start) * 1000
        checks["database"] = {
            "status": "healthy",
            "latency_ms": round(latency_ms, 2)
        }
        if latency_ms > 100:
            checks["database"]["warning"] = "high_latency"
    except Exception as e:
        checks["database"] = {"status": "unhealthy", "error": str(e)}
        healthy = False
    
    status_code = 200 if healthy else 503
    return JSONResponse(content={
        "status": "healthy" if healthy else "degraded",
        "checks": checks
    }, status_code=status_code)

Monitor this endpoint with AzMonitor at 1-minute intervals. Alert if the latency check shows > 100ms for a simple SELECT 1 — it indicates database stress even before full failures.

Database Monitoring Strategy

External monitoring (AzMonitor): Monitor your health check endpoint. This tells you if the database is accessible and responding to basic queries.

Internal monitoring (Prometheus/Grafana, CloudWatch, Datadog): Monitor detailed metrics — connection pool usage, slow queries, replication lag, disk usage.

Database-native tools: PostgreSQL has pg_stat_* views; MySQL has information_schema; most databases have built-in performance tools.

The external monitor is your early warning system. When it fires, you turn to internal metrics to understand why.

Set up database health monitoring with AzMonitor alongside your application endpoint monitoring. Your database health check endpoint is one of the most important monitors in your configuration.

Related: API health checks guide for designing comprehensive health endpoints that include database validation.

Tags:database monitoringperformance monitoringPostgreSQLdatabase metrics

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Database Performance Monitoring: Metrics You Can't Ignore

Why Database Monitoring Is Different

Critical Database Metrics to Monitor

Query Performance Metrics

Connection Pool Metrics

Replication Lag

Index Efficiency

Database Size and Growth Rate

Lock Contention

Health Check Endpoint for Database

Database Monitoring Strategy

Related articles

Image Optimization Monitoring: WebP, AVIF, and Lazy Loading

RUM vs Lab Data: Which Performance Metrics Should You Trust?

Server Response Time: Benchmarks and Optimization Strategies