Monitoring tells you something is wrong. Observability tells you why. The distinction matters enormously in complex distributed systems where a problem might originate anywhere across dozens of services and surface anywhere else. Observability — the ability to understand your system's internal state from its external outputs — rests on three pillars: logs, metrics, and distributed traces. Understanding what each pillar provides and how they work together is fundamental to building systems that are diagnosable when (not if) they break.
Why Monitoring Alone Isn't Enough
Traditional monitoring answers the question: "Is my system working?" Metrics tell you CPU is at 95%, error rate is 2%, latency p99 is 3 seconds. These numbers tell you something is wrong. They don't tell you:
- Which request is causing the high CPU?
- Why are only some requests erroring?
- Where in the multi-service call chain is the 3-second latency coming from?
- Does the problem affect specific users, data patterns, or geographic regions?
Answering these questions requires observability — the ability to explore your system's behavior in ways you didn't predict in advance.
Pillar 1: Logs
Logs are timestamped records of discrete events. They're the most granular source of information about what happened and when.
Structured vs Unstructured Logs
Unstructured logs are human-readable text:
[2025-10-22 14:32:01] ERROR: Failed to process payment for user 12345
Structured logs are machine-parseable JSON (or similar):
{
"timestamp": "2025-10-22T14:32:01Z",
"level": "error",
"event": "payment.processing.failed",
"user_id": "12345",
"payment_id": "pay_abc123",
"amount": 9900,
"currency": "usd",
"error": "insufficient_funds",
"duration_ms": 245,
"service": "payment-service",
"version": "2.4.1",
"trace_id": "7f2a1b3c4d5e6f7a",
"span_id": "1a2b3c4d"
}
Structured logs enable:
- Filtering by any field (
WHERE user_id = "12345") - Aggregating across events (
GROUP BY error_type) - Linking to traces (via
trace_id) - Alerting on log patterns
What to Log
# Good logging practice: log events with context, not just errors
import structlog
log = structlog.get_logger()
def process_payment(payment_id, user_id, amount):
log.info("payment.processing.started",
payment_id=payment_id,
user_id=user_id,
amount=amount
)
try:
# Validate
if amount <= 0:
log.warning("payment.validation.failed",
payment_id=payment_id,
reason="invalid_amount",
amount=amount
)
raise ValueError("Amount must be positive")
# Process
result = charge_stripe(amount)
log.info("payment.processing.succeeded",
payment_id=payment_id,
stripe_charge_id=result.id,
processing_time_ms=result.processing_time
)
return result
except stripe.CardError as e:
log.warning("payment.processing.card_declined",
payment_id=payment_id,
decline_code=e.code,
# Don't log the full card number!
)
raise
except Exception as e:
log.error("payment.processing.failed",
payment_id=payment_id,
error_type=type(e).__name__,
error_message=str(e)
)
raise
Log Aggregation and Search
Logs from multiple services need to be centralized for effective use:
| Tool | Best For | |---|---| | Elasticsearch + Kibana (ELK) | Flexible full-text search, large scale | | Loki + Grafana | Log-metric correlation, Prometheus-native | | Datadog Logs | All-in-one observability platform | | Cloudwatch Logs | AWS-native simplicity | | Splunk | Enterprise, compliance, complex queries |
Pillar 2: Metrics
Metrics are numeric measurements over time. They're efficient (low storage, fast aggregation) and excellent for alerting, dashboards, and trend analysis.
Metric Types
Counter — Monotonically increasing value. Total requests, total errors, total bytes.
from prometheus_client import Counter
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status_code']
)
# Increment on each request
requests_total.labels(
method='POST',
path='/api/payments',
status_code='200'
).inc()
Gauge — Current value that can go up or down. Active connections, queue depth, memory usage.
from prometheus_client import Gauge
active_connections = Gauge(
'websocket_active_connections',
'Number of active WebSocket connections'
)
# Set on connect/disconnect
active_connections.inc() # on connect
active_connections.dec() # on disconnect
Histogram — Distribution of values. Request durations, response sizes, payment amounts.
from prometheus_client import Histogram
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['endpoint'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
# Observe each request
with request_duration.labels(endpoint='/api/checkout').time():
process_checkout()
The Four Golden Signals
For most services, these four metrics capture the essential health picture:
Latency — How long requests take (and for errors separately, since slow errors are worse than fast errors).
Traffic — How many requests per second is your service handling.
Errors — Rate of failed requests (5xx errors, application errors, validation failures).
Saturation — How "full" your service is. CPU utilization, connection pool usage, queue depth.
# Golden signals in Prometheus
# Latency (p99)
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# Traffic (requests per second)
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
# Saturation (CPU)
1 - rate(node_cpu_seconds_total{mode="idle"}[5m])
Pillar 3: Distributed Traces
Distributed traces show the complete path of a request through your system. They're the only way to answer "where in my microservices is this request spending time?"
How Tracing Works
Every request gets a unique trace_id. As it passes through services, each service creates a span that records what happened and how long it took. Spans have parent-child relationships that reconstruct the full call tree.
Trace: 7f2a1b3c4d5e6f7a
│
├── Span: api-gateway (25ms total)
│ ├── Span: auth-service (8ms)
│ └── Span: checkout-service (15ms)
│ ├── Span: inventory-service (4ms)
│ ├── Span: payment-service (9ms)
│ │ └── Span: stripe-api (7ms) ← HERE is the bottleneck
│ └── Span: notification-service (2ms, async)
Without tracing, you know checkout takes 25ms. With tracing, you know Stripe API calls take 7ms of that, and you can see it per-request.
OpenTelemetry: The Standard
OpenTelemetry (OTEL) is the emerging standard for distributed tracing (and metrics and logs). It works across all major languages:
# Python: OpenTelemetry setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Auto-instrument frameworks
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)
# Manual instrumentation for business logic
tracer = trace.get_tracer(__name__)
@app.route('/checkout', methods=['POST'])
def checkout():
with tracer.start_as_current_span("checkout.process") as span:
order = request.json
# Add useful attributes to the span
span.set_attribute("order.id", order["id"])
span.set_attribute("order.total_cents", order["total"])
span.set_attribute("order.item_count", len(order["items"]))
with tracer.start_as_current_span("checkout.validate_inventory"):
inventory_result = check_inventory(order["items"])
with tracer.start_as_current_span("checkout.charge_payment"):
payment_result = charge_payment(order["payment"])
span.set_attribute("payment.charge_id", payment_result.charge_id)
return jsonify({"status": "success", "order_id": order["id"]})
Trace Sampling
You don't need to trace every request — sampling 1-10% of traffic is usually enough:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 10% of requests
sampler = TraceIdRatioBased(0.1)
# But always trace errors and slow requests
from opentelemetry.sdk.trace.sampling import ParentBased, Decision
class SmartSampler:
"""Sample more of the interesting requests"""
def should_sample(self, parent_context, trace_id, name, kind, attributes, links):
# Always trace errors
if attributes.get("http.status_code", 0) >= 500:
return SamplingResult(Decision.RECORD_AND_SAMPLE)
# Always trace slow requests (flag set by middleware)
if attributes.get("is_slow_request"):
return SamplingResult(Decision.RECORD_AND_SAMPLE)
# Sample 10% of everything else
if hash(trace_id) % 100 < 10:
return SamplingResult(Decision.RECORD_AND_SAMPLE)
return SamplingResult(Decision.DROP)
Connecting the Three Pillars
The pillars are most powerful when connected:
From metrics to logs — Your error rate metric spikes. Click to see the logs matching the same time window, filtered to error level.
From logs to traces — A log entry shows a slow request. The trace_id in the log takes you directly to the full trace for that specific request.
From traces to metrics — A trace shows the payment service is slow. Switch to the payment service metrics dashboard to see if latency is broadly elevated or isolated to this trace.
Alert: Error rate spiked to 5%
↓
Logs search: level=error AND time=[spike window]
→ Found error: "payment_processor_timeout"
→ Found trace_id: 7f2a1b3c4d5e6f7a
↓
Trace: 7f2a1b3c4d5e6f7a
→ payment-service span: 12s (!)
→ stripe-api span: 11s (timeout)
↓
Metrics: stripe-api call latency
→ p99 jumped from 800ms to 11s at 14:30
↓
External monitoring: Stripe status page
→ Stripe reported degradation starting 14:28
Root cause: Stripe API degradation
Next: Enable payment queue mode
This diagnostic journey took 5 minutes because each signal connected to the others.
Where External Monitoring Fits
External monitoring (like AzMonitor) is a fourth layer that complements the three pillars:
- Logs — What happened inside your system
- Metrics — How your system is performing internally
- Traces — How requests flow through your system
- External monitoring — What users experience from outside your system
External monitoring catches failures that internal observability misses: network path issues, CDN problems, regional DNS failures, issues that affect users before they generate any internal telemetry. It's the ground truth for availability from the user's perspective.
Building Your Observability Stack
Start with what provides the most immediate value:
- Week 1 — Add structured logging and centralize logs
- Week 2 — Add Prometheus metrics and basic dashboards
- Week 3 — Set up external monitoring for critical endpoints
- Month 2 — Add distributed tracing for top 5 critical paths
- Month 3 — Connect pillars: trace IDs in logs, log links from metrics
Conclusion
The three pillars of observability aren't competing alternatives — they're complementary tools that answer different questions about your system. Logs tell you what happened. Metrics tell you how often and how severe. Traces tell you where in the request flow. External monitoring tells you what users actually experience. Together, they transform "something is wrong and we don't know why" into "we know what's wrong, where it is, and how to fix it." AzMonitor sits at the external monitoring layer of this stack, providing the user-perspective availability data that grounds the rest of your observability in what actually matters: whether your service works for real people.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →