Monitoring tells you something is wrong. Observability tells you why. The distinction matters enormously in complex distributed systems where a problem might originate anywhere across dozens of services and surface anywhere else. Observability — the ability to understand your system's internal state from its external outputs — rests on three pillars: logs, metrics, and distributed traces. Understanding what each pillar provides and how they work together is fundamental to building systems that are diagnosable when (not if) they break.

Why Monitoring Alone Isn't Enough

Traditional monitoring answers the question: "Is my system working?" Metrics tell you CPU is at 95%, error rate is 2%, latency p99 is 3 seconds. These numbers tell you something is wrong. They don't tell you:

Which request is causing the high CPU?
Why are only some requests erroring?
Where in the multi-service call chain is the 3-second latency coming from?
Does the problem affect specific users, data patterns, or geographic regions?

Answering these questions requires observability — the ability to explore your system's behavior in ways you didn't predict in advance.

Pillar 1: Logs

Logs are timestamped records of discrete events. They're the most granular source of information about what happened and when.

Structured vs Unstructured Logs

Unstructured logs are human-readable text:

[2025-10-22 14:32:01] ERROR: Failed to process payment for user 12345

Structured logs are machine-parseable JSON (or similar):

{
  "timestamp": "2025-10-22T14:32:01Z",
  "level": "error",
  "event": "payment.processing.failed",
  "user_id": "12345",
  "payment_id": "pay_abc123",
  "amount": 9900,
  "currency": "usd",
  "error": "insufficient_funds",
  "duration_ms": 245,
  "service": "payment-service",
  "version": "2.4.1",
  "trace_id": "7f2a1b3c4d5e6f7a",
  "span_id": "1a2b3c4d"
}

Structured logs enable:

Filtering by any field (WHERE user_id = "12345")
Aggregating across events (GROUP BY error_type)
Linking to traces (via trace_id)
Alerting on log patterns

What to Log

# Good logging practice: log events with context, not just errors
import structlog

log = structlog.get_logger()

def process_payment(payment_id, user_id, amount):
    log.info("payment.processing.started",
        payment_id=payment_id,
        user_id=user_id,
        amount=amount
    )
    
    try:
        # Validate
        if amount <= 0:
            log.warning("payment.validation.failed",
                payment_id=payment_id,
                reason="invalid_amount",
                amount=amount
            )
            raise ValueError("Amount must be positive")
        
        # Process
        result = charge_stripe(amount)
        
        log.info("payment.processing.succeeded",
            payment_id=payment_id,
            stripe_charge_id=result.id,
            processing_time_ms=result.processing_time
        )
        
        return result
        
    except stripe.CardError as e:
        log.warning("payment.processing.card_declined",
            payment_id=payment_id,
            decline_code=e.code,
            # Don't log the full card number!
        )
        raise
        
    except Exception as e:
        log.error("payment.processing.failed",
            payment_id=payment_id,
            error_type=type(e).__name__,
            error_message=str(e)
        )
        raise

Log Aggregation and Search

Logs from multiple services need to be centralized for effective use:

| Tool | Best For | |---|---| | Elasticsearch + Kibana (ELK) | Flexible full-text search, large scale | | Loki + Grafana | Log-metric correlation, Prometheus-native | | Datadog Logs | All-in-one observability platform | | Cloudwatch Logs | AWS-native simplicity | | Splunk | Enterprise, compliance, complex queries |

Pillar 2: Metrics

Metrics are numeric measurements over time. They're efficient (low storage, fast aggregation) and excellent for alerting, dashboards, and trend analysis.

Metric Types

Counter — Monotonically increasing value. Total requests, total errors, total bytes.

from prometheus_client import Counter

requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status_code']
)

# Increment on each request
requests_total.labels(
    method='POST',
    path='/api/payments',
    status_code='200'
).inc()

Gauge — Current value that can go up or down. Active connections, queue depth, memory usage.

from prometheus_client import Gauge

active_connections = Gauge(
    'websocket_active_connections',
    'Number of active WebSocket connections'
)

# Set on connect/disconnect
active_connections.inc()  # on connect
active_connections.dec()  # on disconnect

Histogram — Distribution of values. Request durations, response sizes, payment amounts.

from prometheus_client import Histogram

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['endpoint'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# Observe each request
with request_duration.labels(endpoint='/api/checkout').time():
    process_checkout()

The Four Golden Signals

For most services, these four metrics capture the essential health picture:

Latency — How long requests take (and for errors separately, since slow errors are worse than fast errors).

Traffic — How many requests per second is your service handling.

Errors — Rate of failed requests (5xx errors, application errors, validation failures).

Saturation — How "full" your service is. CPU utilization, connection pool usage, queue depth.

# Golden signals in Prometheus

# Latency (p99)
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Traffic (requests per second)
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])

# Saturation (CPU)
1 - rate(node_cpu_seconds_total{mode="idle"}[5m])

Pillar 3: Distributed Traces

Distributed traces show the complete path of a request through your system. They're the only way to answer "where in my microservices is this request spending time?"

How Tracing Works

Every request gets a unique trace_id. As it passes through services, each service creates a span that records what happened and how long it took. Spans have parent-child relationships that reconstruct the full call tree.

Trace: 7f2a1b3c4d5e6f7a
│
├── Span: api-gateway (25ms total)
│   ├── Span: auth-service (8ms)
│   └── Span: checkout-service (15ms)
│       ├── Span: inventory-service (4ms)
│       ├── Span: payment-service (9ms)
│       │   └── Span: stripe-api (7ms)  ← HERE is the bottleneck
│       └── Span: notification-service (2ms, async)

Without tracing, you know checkout takes 25ms. With tracing, you know Stripe API calls take 7ms of that, and you can see it per-request.

OpenTelemetry: The Standard

OpenTelemetry (OTEL) is the emerging standard for distributed tracing (and metrics and logs). It works across all major languages:

# Python: OpenTelemetry setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument frameworks
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)

# Manual instrumentation for business logic
tracer = trace.get_tracer(__name__)

@app.route('/checkout', methods=['POST'])
def checkout():
    with tracer.start_as_current_span("checkout.process") as span:
        order = request.json
        
        # Add useful attributes to the span
        span.set_attribute("order.id", order["id"])
        span.set_attribute("order.total_cents", order["total"])
        span.set_attribute("order.item_count", len(order["items"]))
        
        with tracer.start_as_current_span("checkout.validate_inventory"):
            inventory_result = check_inventory(order["items"])
            
        with tracer.start_as_current_span("checkout.charge_payment"):
            payment_result = charge_payment(order["payment"])
            span.set_attribute("payment.charge_id", payment_result.charge_id)
            
        return jsonify({"status": "success", "order_id": order["id"]})

Trace Sampling

You don't need to trace every request — sampling 1-10% of traffic is usually enough:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 10% of requests
sampler = TraceIdRatioBased(0.1)

# But always trace errors and slow requests
from opentelemetry.sdk.trace.sampling import ParentBased, Decision

class SmartSampler:
    """Sample more of the interesting requests"""
    
    def should_sample(self, parent_context, trace_id, name, kind, attributes, links):
        # Always trace errors
        if attributes.get("http.status_code", 0) >= 500:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        
        # Always trace slow requests (flag set by middleware)
        if attributes.get("is_slow_request"):
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        
        # Sample 10% of everything else
        if hash(trace_id) % 100 < 10:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        
        return SamplingResult(Decision.DROP)

Connecting the Three Pillars

The pillars are most powerful when connected:

From metrics to logs — Your error rate metric spikes. Click to see the logs matching the same time window, filtered to error level.

From logs to traces — A log entry shows a slow request. The trace_id in the log takes you directly to the full trace for that specific request.

From traces to metrics — A trace shows the payment service is slow. Switch to the payment service metrics dashboard to see if latency is broadly elevated or isolated to this trace.

Alert: Error rate spiked to 5%
  ↓
Logs search: level=error AND time=[spike window]
  → Found error: "payment_processor_timeout"
  → Found trace_id: 7f2a1b3c4d5e6f7a
  ↓
Trace: 7f2a1b3c4d5e6f7a
  → payment-service span: 12s (!)
  → stripe-api span: 11s (timeout)
  ↓
Metrics: stripe-api call latency
  → p99 jumped from 800ms to 11s at 14:30
  ↓
External monitoring: Stripe status page
  → Stripe reported degradation starting 14:28
  
Root cause: Stripe API degradation
Next: Enable payment queue mode

This diagnostic journey took 5 minutes because each signal connected to the others.

Where External Monitoring Fits

External monitoring (like AzMonitor) is a fourth layer that complements the three pillars:

Logs — What happened inside your system
Metrics — How your system is performing internally
Traces — How requests flow through your system
External monitoring — What users experience from outside your system

External monitoring catches failures that internal observability misses: network path issues, CDN problems, regional DNS failures, issues that affect users before they generate any internal telemetry. It's the ground truth for availability from the user's perspective.

Building Your Observability Stack

Start with what provides the most immediate value:

Week 1 — Add structured logging and centralize logs
Week 2 — Add Prometheus metrics and basic dashboards
Week 3 — Set up external monitoring for critical endpoints
Month 2 — Add distributed tracing for top 5 critical paths
Month 3 — Connect pillars: trace IDs in logs, log links from metrics

Conclusion

The three pillars of observability aren't competing alternatives — they're complementary tools that answer different questions about your system. Logs tell you what happened. Metrics tell you how often and how severe. Traces tell you where in the request flow. External monitoring tells you what users actually experience. Together, they transform "something is wrong and we don't know why" into "we know what's wrong, where it is, and how to fix it." AzMonitor sits at the external monitoring layer of this stack, providing the user-perspective availability data that grounds the rest of your observability in what actually matters: whether your service works for real people.

Tags:observabilitylogs metrics tracesmonitoringdistributed systems

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

The Three Pillars of Observability: Logs, Metrics, and Traces