Microservices architecture fundamentally changes how monitoring works. Instead of monitoring one application with one database, you're monitoring dozens or hundreds of services, each with their own health, dependencies, and failure modes. A single user request might touch 15 services — and any one of them can fail independently.
Traditional uptime monitoring (check one URL, get one status) breaks down in microservices environments. This guide covers the patterns and practices that actually work.
The Microservices Monitoring Challenge
In a monolithic application, the app is either up or it isn't. In microservices:
- Individual services can be up while dependent services are down
- A "degraded" service (high latency, elevated error rate) can cascade into failures elsewhere
- Service A failing might not cause user-visible problems if Service B handles the load
- Service A failing might cause catastrophic cascading failures if other services depend on it
This complexity requires monitoring strategies that understand service relationships and cascade effects.
Health Check Patterns for Microservices
Liveness Probes
A liveness probe answers: "Is this process alive and should it keep running?" If it fails, the orchestration system (Kubernetes) restarts the container.
// Simple liveness check - just confirms the process can handle requests
func livenessHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "alive"})
}
Liveness probes should be simple and always succeed unless the process is truly stuck. A liveness probe that checks database connectivity will restart your service even if the database is temporarily unavailable — which usually makes things worse.
Readiness Probes
A readiness probe answers: "Is this service ready to serve traffic?" If it fails, traffic is routed away from this instance (but the instance isn't restarted).
// Readiness check - verifies dependencies are accessible
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// Check database
if err := db.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "not_ready",
"reason": "database_unavailable",
})
return
}
// Check cache
if err := cache.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "not_ready",
"reason": "cache_unavailable",
})
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "ready"})
}
Deep Health Checks
A deep health check (sometimes called a startup probe or diagnostic endpoint) answers: "Is this service fully healthy and functioning as expected?"
// Deep health check - comprehensive dependency verification
func healthHandler(w http.ResponseWriter, r *http.Request) {
health := HealthStatus{
Service: "order-service",
Version: buildVersion,
Checks: map[string]CheckResult{},
}
allHealthy := true
// Database check with timing
dbStart := time.Now()
if err := db.Ping(); err != nil {
health.Checks["database"] = CheckResult{Status: "unhealthy", Error: err.Error()}
allHealthy = false
} else {
health.Checks["database"] = CheckResult{Status: "healthy", Latency: time.Since(dbStart).Milliseconds()}
}
// Upstream service check
if resp, err := http.Get("http://payment-service/health"); err != nil || resp.StatusCode != 200 {
health.Checks["payment_service"] = CheckResult{Status: "unhealthy"}
allHealthy = false
} else {
health.Checks["payment_service"] = CheckResult{Status: "healthy"}
}
if allHealthy {
health.Status = "healthy"
w.WriteHeader(http.StatusOK)
} else {
health.Status = "degraded"
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(health)
}
See our complete microservices health check guide for the full implementation reference.
External vs Internal Monitoring in Microservices
Internal monitoring (Kubernetes health probes): Ensures containers are restarted when they fail. This is your infrastructure layer.
External monitoring (AzMonitor): Validates that the service works from the outside — the same perspective as your users. External monitoring doesn't know or care about your Kubernetes cluster; it only knows what your service returns.
Both are required. Internal monitoring keeps your instances healthy. External monitoring confirms the combination of all internal systems delivers correct responses to users.
The External Monitoring Layer for Microservices
For external uptime monitoring in a microservices environment, focus on:
1. API Gateway / Entry Points Monitor the public-facing entry points to your system. These are the aggregation points where all internal service failures manifest as user-visible errors.
GET https://api.yourapp.com/health
→ Aggregated health of all upstream services
→ Alert if any dependency is degraded
2. Critical User Journey Endpoints Monitor the API endpoints that represent complete user workflows, not just individual service health checks:
| User Journey | Endpoint to Monitor |
|-------------|---------------------|
| User login | POST /auth/login |
| Core feature | GET /api/feature/list |
| Data creation | POST /api/resource |
| Data retrieval | GET /api/resource/{id} |
3. Event Bus / Message Queue Health In event-driven microservices, message queue health is critical:
Monitor: Queue depth (alert if growing unbounded)
Monitor: Consumer lag (alert if consumers fall behind)
Monitor: Failed message count (alert on failed message accumulation)
Alert Strategies for Microservices
Single-service alerts generate too much noise in microservices environments. Use alert correlation:
Symptom-based alerting: Alert on user-visible symptoms rather than individual service failures. "Payment API returning 5% errors" is more actionable than "Payment service health check failing."
Dependency-aware suppression: When Service A is down and Services B, C, D all depend on A, suppress alerts for B, C, D and focus on A. This prevents alert floods from cascading failures.
Alert deduplication: Group related alerts into a single incident. 20 services failing simultaneously due to a database outage is one incident, not 20.
Monitoring Cascading Failures
Cascading failures — where one service's failure causes others to fail — are a major risk in microservices. Detect them early by monitoring:
Error rate trends: A service seeing increasing 5xx responses is under stress and may soon fail completely. Alert on trends, not just binary up/down status.
Latency percentiles (P99): High P99 latency in Service A creates pressure on services that call A. A P99 spike often precedes a cascade.
Connection pool utilization: When a service's connection pool to a database or upstream service is near capacity, it's about to experience failure. Alert at 80% utilization.
Service Mesh Integration
If you're using a service mesh (Istio, Linkerd, Consul Connect), you gain rich telemetry about service-to-service communication. Integrate this data with your external monitoring:
- Service mesh provides internal metrics (latency, error rates between services)
- External monitoring provides user-perspective availability data
- Combined: you know what users experience AND why
AzMonitor integrates with standard observability formats, making it straightforward to correlate external check data with service mesh telemetry.
Building a Microservices Monitoring Dashboard
For effective microservices monitoring, structure your dashboard in layers:
Layer 1: User Impact
- Overall service availability (%)
- API error rate
- P95 response time across all services
Layer 2: Service Health
- Health status of each service (healthy/degraded/down)
- Per-service error rates and latencies
Layer 3: Infrastructure
- Per-service resource utilization
- Database connection pool status
- Message queue depths
Start external monitoring with AzMonitor for Layer 1 and the critical endpoints in Layer 2. Try it free and build your microservices monitoring foundation in under an hour.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →