Microservices amplify monitoring complexity exponentially. A monolith has one service to monitor. A microservices system might have 50, 100, or 500 services — each with their own APIs, dependencies, failure modes, and deployment cycles. When a user reports a slow checkout, the root cause might be in any of a dozen services. Without systematic monitoring, you're guessing.

The Microservices Monitoring Challenge

In a monolith, a slow database query is easy to find — it's your database. In microservices, you need to know:

Which service in the call chain is slow?
Which service is calling which other service?
When Service A degrades, which dependent services are affected?
Is a cascading failure in progress?
Which team owns the failing service?

These questions require monitoring at three levels: individual service health, service-to-service communication, and end-to-end user journeys.

Service Dependency Maps

Before you can monitor microservices effectively, you need to know what depends on what. Build a service dependency map:

# Service dependency manifest (example)
services:
  checkout-service:
    owns: ["checkout-team"]
    depends_on:
      - service: user-service
        type: synchronous
        criticality: required
      - service: inventory-service
        type: synchronous
        criticality: required
      - service: payment-service
        type: synchronous
        criticality: required
      - service: notification-service
        type: asynchronous
        criticality: optional
        
  payment-service:
    owns: ["payments-team"]
    depends_on:
      - service: fraud-detection-service
        type: synchronous
        criticality: required
      - service: stripe-api
        type: external
        criticality: required

This map tells you that checkout-service is directly impacted by any failure in user-service, inventory-service, or payment-service. An alert on payment-service should also trigger awareness for the checkout team.

Health Check Endpoints for Microservices

Every service should expose a standardized health endpoint. Implement a structured health response:

// Go: Structured health endpoint
type HealthResponse struct {
    Status     string            `json:"status"`
    Timestamp  string            `json:"timestamp"`
    Version    string            `json:"version"`
    Uptime     int64             `json:"uptime_seconds"`
    Checks     map[string]Check  `json:"checks"`
}

type Check struct {
    Status  string `json:"status"`
    Latency int    `json:"latency_ms,omitempty"`
    Error   string `json:"error,omitempty"`
}

func HealthHandler(w http.ResponseWriter, r *http.Request) {
    checks := map[string]Check{}
    overallStatus := "healthy"
    
    // Check database
    dbStart := time.Now()
    if err := db.PingContext(r.Context()); err != nil {
        checks["database"] = Check{Status: "unhealthy", Error: err.Error()}
        overallStatus = "unhealthy"
    } else {
        checks["database"] = Check{
            Status: "healthy",
            Latency: int(time.Since(dbStart).Milliseconds()),
        }
    }
    
    // Check Redis cache
    cacheStart := time.Now()
    if err := cache.Ping(r.Context()).Err(); err != nil {
        checks["cache"] = Check{Status: "degraded", Error: err.Error()}
        if overallStatus == "healthy" {
            overallStatus = "degraded"
        }
    } else {
        checks["cache"] = Check{
            Status: "healthy",
            Latency: int(time.Since(cacheStart).Milliseconds()),
        }
    }
    
    // Check downstream dependency
    depStart := time.Now()
    if err := checkDownstreamService(); err != nil {
        checks["inventory-service"] = Check{Status: "unhealthy", Error: err.Error()}
        overallStatus = "unhealthy"
    } else {
        checks["inventory-service"] = Check{
            Status: "healthy",
            Latency: int(time.Since(depStart).Milliseconds()),
        }
    }
    
    response := HealthResponse{
        Status:    overallStatus,
        Timestamp: time.Now().UTC().Format(time.RFC3339),
        Version:   AppVersion,
        Uptime:    int64(time.Since(StartTime).Seconds()),
        Checks:    checks,
    }
    
    statusCode := http.StatusOK
    if overallStatus == "unhealthy" {
        statusCode = http.StatusServiceUnavailable
    }
    
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(statusCode)
    json.NewEncoder(w).Encode(response)
}

Monitoring Service Communication Patterns

Microservices communicate in two primary patterns, each with different monitoring needs:

Synchronous (HTTP/gRPC) Monitoring

For synchronous calls, track these metrics per service pair:

# Request rate between service pairs
rate(http_client_requests_total{source_service="checkout", target_service="payment"}[5m])

# Error rate between service pairs
rate(http_client_requests_total{source_service="checkout", target_service="payment", status=~"5.."}[5m])
/ rate(http_client_requests_total{source_service="checkout", target_service="payment"}[5m])

# p99 latency for service-to-service calls
histogram_quantile(0.99,
  rate(http_client_request_duration_seconds_bucket{
    source_service="checkout",
    target_service="payment"
  }[5m])
)

Asynchronous (Message Queue) Monitoring

For event-driven communication through Kafka, RabbitMQ, or SQS:

# Kafka consumer lag monitoring
def check_kafka_consumer_lag(bootstrap_servers, topic, group_id, lag_threshold=1000):
    """
    Monitor Kafka consumer lag as a proxy for async processing health.
    High lag = consumers can't keep up with producers.
    """
    from kafka.admin import KafkaAdminClient
    
    admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
    consumer_groups = admin.describe_consumer_groups([group_id])
    
    total_lag = 0
    partition_lags = {}
    
    for group in consumer_groups:
        for member in group.members:
            for partition, offset_info in member.assignment.items():
                latest_offset = get_latest_offset(topic, partition)
                current_offset = offset_info.offset
                lag = latest_offset - current_offset
                
                total_lag += lag
                partition_lags[f"{topic}-{partition}"] = lag
    
    status = "healthy"
    if total_lag > lag_threshold:
        status = "degraded"
    if total_lag > lag_threshold * 10:
        status = "unhealthy"
    
    return {
        "status": status,
        "total_lag": total_lag,
        "partition_lags": partition_lags,
        "threshold": lag_threshold
    }

Distributed Tracing Integration

Distributed tracing is the most powerful tool for microservices debugging. When combined with monitoring, it transforms "something is slow" into "the payment-service call to fraud-detection is adding 400ms for 2% of requests."

# OpenTelemetry setup for a microservice
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Configure tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(
    endpoint="http://otel-collector:4317"
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument HTTP calls
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)

# Manual instrumentation for business logic
tracer = trace.get_tracer(__name__)

@app.route("/checkout", methods=["POST"])
def checkout():
    with tracer.start_as_current_span("checkout.process") as span:
        order = request.json
        span.set_attribute("order.id", order["id"])
        span.set_attribute("order.amount", order["total"])
        
        with tracer.start_as_current_span("checkout.validate_inventory"):
            inventory_result = inventory_client.check(order["items"])
        
        with tracer.start_as_current_span("checkout.process_payment"):
            payment_result = payment_client.charge(order["payment"])
        
        return jsonify({"status": "success", "order_id": order["id"]})

Circuit Breaker Monitoring

Circuit breakers prevent cascading failures, but they need monitoring to be effective:

// Circuit breaker with monitoring
type CircuitBreaker struct {
    name          string
    state         string // "closed", "open", "half-open"
    failures       int
    successes      int
    lastStateChange time.Time
    threshold      int
    metrics        *prometheus.GaugeVec
}

func (cb *CircuitBreaker) recordStateChange(newState string) {
    oldState := cb.state
    cb.state = newState
    cb.lastStateChange = time.Now()
    
    // Track state transitions in metrics
    cb.metrics.WithLabelValues(cb.name, "state").Set(
        map[string]float64{
            "closed":    0,
            "open":      1,
            "half-open": 2,
        }[newState],
    )
    
    // Alert on state changes
    if newState == "open" {
        log.Printf("[CIRCUIT_OPEN] %s: circuit opened after %d failures", 
            cb.name, cb.failures)
        alerting.Send(Alert{
            Name:     "Circuit Breaker Opened",
            Service:  cb.name,
            Severity: "warning",
            Message:  fmt.Sprintf("%s circuit opened - downstream failures detected", cb.name),
        })
    }
    
    if oldState == "open" && newState == "closed" {
        log.Printf("[CIRCUIT_CLOSED] %s: circuit recovered", cb.name)
    }
}

Monitor circuit breaker states across all services:

| Service | Circuit State | Failure Count | Last State Change | |---|---|---|---| | checkout→payment | Closed | 0 | Never opened | | checkout→inventory | Half-Open | 12 | 2 min ago | | checkout→fraud | Open | 47 | 5 min ago | | payment→stripe | Closed | 0 | Never opened |

If fraud-detection circuit is open, checkout is currently bypassing fraud checks. That's a security risk that needs immediate attention.

Service Mesh Observability

Service meshes like Istio and Linkerd provide automatic observability for all service-to-service traffic:

# Istio VirtualService with traffic management and observability
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
    - payment-service
  http:
    - timeout: 3s
      retries:
        attempts: 3
        perTryTimeout: 1s
        retryOn: gateway-error,connect-failure,retriable-4xx
      route:
        - destination:
            host: payment-service
            port:
              number: 8080

Istio automatically generates Prometheus metrics for every service pair. No code changes required to get:

Request rate, error rate, latency per service pair
Retry rate and success rate
Circuit breaker status

Service-Level Monitoring Alerts

Define alerts at the service level, not just the infrastructure level:

# Service-level alerts
alerts:
  - name: "Checkout Service Degraded"
    condition: "checkout_service_error_rate > 1%"
    severity: critical
    runbook: "https://wiki/checkout-outage"
    routing: ["checkout-team-pagerduty"]
    
  - name: "Inter-Service Latency Spike"
    condition: |
      service_to_service_p99_latency{
        source="checkout", 
        target="payment"
      } > 2000ms
    severity: warning
    
  - name: "Cascading Failure Risk"
    condition: "unhealthy_dependencies_count > 2"
    severity: critical
    message: "Multiple downstream services failing - potential cascade"

Conclusion

Monitoring microservices requires thinking at multiple levels simultaneously: individual service health, service-to-service communication, distributed traces, and end-to-end user journeys. The teams that do this well invest in standardized health endpoints, distributed tracing from day one, service dependency maps, and automated alerting that knows which team owns which service. AzMonitor's API monitoring capabilities can serve as the external health check layer for your microservices, providing the outside-in availability view that complements your internal observability stack.

Tags:microservicesdistributed systemsAPI monitoringservice mesh

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

Microservices API Monitoring: Observability at Scale

The Microservices Monitoring Challenge

Service Dependency Maps

Health Check Endpoints for Microservices

Monitoring Service Communication Patterns

Synchronous (HTTP/gRPC) Monitoring

Asynchronous (Message Queue) Monitoring

Distributed Tracing Integration

Circuit Breaker Monitoring

Service Mesh Observability

Service-Level Monitoring Alerts

Conclusion

Related articles

Uptime Monitoring for Mobile Apps and Backend APIs

Monitoring Protected Pages: Authenticated Endpoint Checks

The Three Pillars of Observability: Logs, Metrics, and Traces