Microservices amplify monitoring complexity exponentially. A monolith has one service to monitor. A microservices system might have 50, 100, or 500 services — each with their own APIs, dependencies, failure modes, and deployment cycles. When a user reports a slow checkout, the root cause might be in any of a dozen services. Without systematic monitoring, you're guessing.
The Microservices Monitoring Challenge
In a monolith, a slow database query is easy to find — it's your database. In microservices, you need to know:
- Which service in the call chain is slow?
- Which service is calling which other service?
- When Service A degrades, which dependent services are affected?
- Is a cascading failure in progress?
- Which team owns the failing service?
These questions require monitoring at three levels: individual service health, service-to-service communication, and end-to-end user journeys.
Service Dependency Maps
Before you can monitor microservices effectively, you need to know what depends on what. Build a service dependency map:
# Service dependency manifest (example)
services:
checkout-service:
owns: ["checkout-team"]
depends_on:
- service: user-service
type: synchronous
criticality: required
- service: inventory-service
type: synchronous
criticality: required
- service: payment-service
type: synchronous
criticality: required
- service: notification-service
type: asynchronous
criticality: optional
payment-service:
owns: ["payments-team"]
depends_on:
- service: fraud-detection-service
type: synchronous
criticality: required
- service: stripe-api
type: external
criticality: required
This map tells you that checkout-service is directly impacted by any failure in user-service, inventory-service, or payment-service. An alert on payment-service should also trigger awareness for the checkout team.
Health Check Endpoints for Microservices
Every service should expose a standardized health endpoint. Implement a structured health response:
// Go: Structured health endpoint
type HealthResponse struct {
Status string `json:"status"`
Timestamp string `json:"timestamp"`
Version string `json:"version"`
Uptime int64 `json:"uptime_seconds"`
Checks map[string]Check `json:"checks"`
}
type Check struct {
Status string `json:"status"`
Latency int `json:"latency_ms,omitempty"`
Error string `json:"error,omitempty"`
}
func HealthHandler(w http.ResponseWriter, r *http.Request) {
checks := map[string]Check{}
overallStatus := "healthy"
// Check database
dbStart := time.Now()
if err := db.PingContext(r.Context()); err != nil {
checks["database"] = Check{Status: "unhealthy", Error: err.Error()}
overallStatus = "unhealthy"
} else {
checks["database"] = Check{
Status: "healthy",
Latency: int(time.Since(dbStart).Milliseconds()),
}
}
// Check Redis cache
cacheStart := time.Now()
if err := cache.Ping(r.Context()).Err(); err != nil {
checks["cache"] = Check{Status: "degraded", Error: err.Error()}
if overallStatus == "healthy" {
overallStatus = "degraded"
}
} else {
checks["cache"] = Check{
Status: "healthy",
Latency: int(time.Since(cacheStart).Milliseconds()),
}
}
// Check downstream dependency
depStart := time.Now()
if err := checkDownstreamService(); err != nil {
checks["inventory-service"] = Check{Status: "unhealthy", Error: err.Error()}
overallStatus = "unhealthy"
} else {
checks["inventory-service"] = Check{
Status: "healthy",
Latency: int(time.Since(depStart).Milliseconds()),
}
}
response := HealthResponse{
Status: overallStatus,
Timestamp: time.Now().UTC().Format(time.RFC3339),
Version: AppVersion,
Uptime: int64(time.Since(StartTime).Seconds()),
Checks: checks,
}
statusCode := http.StatusOK
if overallStatus == "unhealthy" {
statusCode = http.StatusServiceUnavailable
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(statusCode)
json.NewEncoder(w).Encode(response)
}
Monitoring Service Communication Patterns
Microservices communicate in two primary patterns, each with different monitoring needs:
Synchronous (HTTP/gRPC) Monitoring
For synchronous calls, track these metrics per service pair:
# Request rate between service pairs
rate(http_client_requests_total{source_service="checkout", target_service="payment"}[5m])
# Error rate between service pairs
rate(http_client_requests_total{source_service="checkout", target_service="payment", status=~"5.."}[5m])
/ rate(http_client_requests_total{source_service="checkout", target_service="payment"}[5m])
# p99 latency for service-to-service calls
histogram_quantile(0.99,
rate(http_client_request_duration_seconds_bucket{
source_service="checkout",
target_service="payment"
}[5m])
)
Asynchronous (Message Queue) Monitoring
For event-driven communication through Kafka, RabbitMQ, or SQS:
# Kafka consumer lag monitoring
def check_kafka_consumer_lag(bootstrap_servers, topic, group_id, lag_threshold=1000):
"""
Monitor Kafka consumer lag as a proxy for async processing health.
High lag = consumers can't keep up with producers.
"""
from kafka.admin import KafkaAdminClient
admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
consumer_groups = admin.describe_consumer_groups([group_id])
total_lag = 0
partition_lags = {}
for group in consumer_groups:
for member in group.members:
for partition, offset_info in member.assignment.items():
latest_offset = get_latest_offset(topic, partition)
current_offset = offset_info.offset
lag = latest_offset - current_offset
total_lag += lag
partition_lags[f"{topic}-{partition}"] = lag
status = "healthy"
if total_lag > lag_threshold:
status = "degraded"
if total_lag > lag_threshold * 10:
status = "unhealthy"
return {
"status": status,
"total_lag": total_lag,
"partition_lags": partition_lags,
"threshold": lag_threshold
}
Distributed Tracing Integration
Distributed tracing is the most powerful tool for microservices debugging. When combined with monitoring, it transforms "something is slow" into "the payment-service call to fraud-detection is adding 400ms for 2% of requests."
# OpenTelemetry setup for a microservice
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
# Configure tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(
endpoint="http://otel-collector:4317"
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Auto-instrument HTTP calls
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)
# Manual instrumentation for business logic
tracer = trace.get_tracer(__name__)
@app.route("/checkout", methods=["POST"])
def checkout():
with tracer.start_as_current_span("checkout.process") as span:
order = request.json
span.set_attribute("order.id", order["id"])
span.set_attribute("order.amount", order["total"])
with tracer.start_as_current_span("checkout.validate_inventory"):
inventory_result = inventory_client.check(order["items"])
with tracer.start_as_current_span("checkout.process_payment"):
payment_result = payment_client.charge(order["payment"])
return jsonify({"status": "success", "order_id": order["id"]})
Circuit Breaker Monitoring
Circuit breakers prevent cascading failures, but they need monitoring to be effective:
// Circuit breaker with monitoring
type CircuitBreaker struct {
name string
state string // "closed", "open", "half-open"
failures int
successes int
lastStateChange time.Time
threshold int
metrics *prometheus.GaugeVec
}
func (cb *CircuitBreaker) recordStateChange(newState string) {
oldState := cb.state
cb.state = newState
cb.lastStateChange = time.Now()
// Track state transitions in metrics
cb.metrics.WithLabelValues(cb.name, "state").Set(
map[string]float64{
"closed": 0,
"open": 1,
"half-open": 2,
}[newState],
)
// Alert on state changes
if newState == "open" {
log.Printf("[CIRCUIT_OPEN] %s: circuit opened after %d failures",
cb.name, cb.failures)
alerting.Send(Alert{
Name: "Circuit Breaker Opened",
Service: cb.name,
Severity: "warning",
Message: fmt.Sprintf("%s circuit opened - downstream failures detected", cb.name),
})
}
if oldState == "open" && newState == "closed" {
log.Printf("[CIRCUIT_CLOSED] %s: circuit recovered", cb.name)
}
}
Monitor circuit breaker states across all services:
| Service | Circuit State | Failure Count | Last State Change | |---|---|---|---| | checkout→payment | Closed | 0 | Never opened | | checkout→inventory | Half-Open | 12 | 2 min ago | | checkout→fraud | Open | 47 | 5 min ago | | payment→stripe | Closed | 0 | Never opened |
If fraud-detection circuit is open, checkout is currently bypassing fraud checks. That's a security risk that needs immediate attention.
Service Mesh Observability
Service meshes like Istio and Linkerd provide automatic observability for all service-to-service traffic:
# Istio VirtualService with traffic management and observability
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- timeout: 3s
retries:
attempts: 3
perTryTimeout: 1s
retryOn: gateway-error,connect-failure,retriable-4xx
route:
- destination:
host: payment-service
port:
number: 8080
Istio automatically generates Prometheus metrics for every service pair. No code changes required to get:
- Request rate, error rate, latency per service pair
- Retry rate and success rate
- Circuit breaker status
Service-Level Monitoring Alerts
Define alerts at the service level, not just the infrastructure level:
# Service-level alerts
alerts:
- name: "Checkout Service Degraded"
condition: "checkout_service_error_rate > 1%"
severity: critical
runbook: "https://wiki/checkout-outage"
routing: ["checkout-team-pagerduty"]
- name: "Inter-Service Latency Spike"
condition: |
service_to_service_p99_latency{
source="checkout",
target="payment"
} > 2000ms
severity: warning
- name: "Cascading Failure Risk"
condition: "unhealthy_dependencies_count > 2"
severity: critical
message: "Multiple downstream services failing - potential cascade"
Conclusion
Monitoring microservices requires thinking at multiple levels simultaneously: individual service health, service-to-service communication, distributed traces, and end-to-end user journeys. The teams that do this well invest in standardized health endpoints, distributed tracing from day one, service dependency maps, and automated alerting that knows which team owns which service. AzMonitor's API monitoring capabilities can serve as the external health check layer for your microservices, providing the outside-in availability view that complements your internal observability stack.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →