Reliability Engineering

Chaos Engineering: Testing System Reliability by Breaking Things on Purpose

Learn how chaos engineering works, how to implement chaos experiments safely, and how to use controlled failures to find and fix reliability weaknesses before users do.

AzMonitor TeamNovember 19, 20259 min read · 1,496 wordsUpdated January 20, 2026
chaos engineeringreliability testingfault injectionSRE

Chaos engineering is the practice of intentionally introducing failures into production systems to discover how they behave under adverse conditions. The core principle sounds counterintuitive: deliberately break things to make them more reliable. The logic is sound — if failures will happen anyway (and they will), it's better to discover how your system responds in a controlled experiment than as a surprise incident at 3 AM.

Netflix pioneered this discipline with Chaos Monkey in 2011. Today, chaos engineering is standard practice at companies running complex distributed systems. It's moved from edgy engineering experiment to established reliability discipline.

The Case for Chaos Engineering

The fundamental problem chaos engineering solves: confidence in distributed systems is false confidence.

You can read your code and believe it handles failures correctly. You can write unit tests for failure cases. You can review architecture diagrams and see redundancy everywhere. None of this tells you how the system actually behaves when a real failure occurs with real load, real data, and real timing.

Real systems fail in ways that:

  • Weren't anticipated during design
  • Tests don't cover (especially integration and timing issues)
  • Only manifest at specific load levels
  • Cause cascading failures across multiple services

Chaos engineering reveals these issues under controlled conditions, where you have monitoring, time to investigate, and the ability to stop the experiment.

Chaos Engineering Principles

Before running chaos experiments, establish these foundations:

1. Hypothesis-driven experiments — Not "break things randomly" but "we believe our system will continue to function when service X fails, because we have fallbacks Y and Z."

2. Start in staging — Run experiments in staging environments first to build familiarity and safety.

3. Minimize blast radius — Start with experiments that affect a small percentage of traffic or a single instance before broader failures.

4. Have a kill switch — Always be able to stop the experiment immediately.

5. Monitor everything — You need visibility into what's happening to learn from the experiment.

6. Run during business hours initially — So your full team is available if something goes wrong.

Defining a Chaos Experiment

Every chaos experiment follows a structure:

# Chaos Experiment: Payment Service Dependency Failure

## Hypothesis
We believe that when the fraud detection service becomes unavailable,
payment processing will continue to function because:
1. We have a circuit breaker that opens after 5 failures
2. Failed fraud checks are logged and queued for async review
3. Payments are processed with a "manual_review" flag when fraud service is unavailable

## Steady State
Normal operation: fraud service responds in < 100ms, payment success rate > 99.5%

## Method
Inject failure: Block all traffic from payment service to fraud service for 5 minutes

## Expected Outcome
- Payment processing continues (with manual_review flag)
- Circuit breaker opens within 30 seconds
- Alert fires within 2 minutes notifying team of fraud service degradation
- Error rate remains below 1%
- No data loss or transaction failures

## Success Criteria
- Payment success rate stays above 99%
- Appropriate alerts fire
- Fallback behavior engaged

## Abort Conditions
Stop immediately if:
- Payment success rate drops below 95%
- Customer-visible errors are observed
- Error rate exceeds 5%

## Monitoring During Experiment
Dashboard: https://monitoring.example.com/payments
Alert channel: #payment-team-alerts

Common Chaos Experiments

1. Kill a Service Instance

The most basic experiment: terminate a service instance and verify the system recovers:

# Using Chaos Monkey / Gremlin to kill a random instance
# Or direct Kubernetes pod deletion

# Kill a random pod in the payment deployment
kubectl delete pod \
  $(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}') \
  --grace-period=0

# Monitor recovery
kubectl get pods -l app=payment-service -w

# Verify: New pod should start within 30 seconds
# Verify: Health check should recover within 60 seconds
# Verify: No customer-visible errors (check monitoring)

2. Inject Network Latency

Add artificial latency to service-to-service communication:

# Using tc (traffic control) to add 500ms latency to outgoing connections
# Run on the fraud-service node:
tc qdisc add dev eth0 root netem delay 500ms

# Run your experiment (5 minutes)
# Monitor: Does payment service timeout correctly?
# Monitor: Does circuit breaker open?
# Monitor: Do alerts fire?

# Cleanup
tc qdisc del dev eth0 root

Using a chaos engineering tool (Gremlin, Chaos Mesh, or AWS Fault Injection Simulator):

# Chaos Mesh network delay experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: fraud-service-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - fraud
    labelSelectors:
      "app": "fraud-service"
  delay:
    latency: "500ms"
    jitter: "50ms"
  duration: "5m"

3. Resource Exhaustion

Test what happens when a service runs out of memory or CPU:

# Chaos Mesh: stress CPU
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: payment-service-cpu-stress
spec:
  mode: one
  selector:
    labelSelectors:
      "app": "payment-service"
  stressors:
    cpu:
      workers: 4
      load: 80  # 80% CPU load
  duration: "3m"

Watch for:

  • Does autoscaling trigger correctly?
  • Do requests fail or just slow down?
  • Does the circuit breaker open before resources are exhausted?

4. Database Failure

Test database failover and read replica behavior:

# Programmatic chaos experiment: block DB connections
import psycopg2
import subprocess
import time

def run_db_failure_experiment(duration_seconds=60):
    """
    Test system behavior when primary database is unavailable.
    Requires: read replicas configured, application handles failure gracefully
    """
    
    print("Starting DB failure experiment")
    print(f"Duration: {duration_seconds} seconds")
    
    # Establish baseline
    baseline = measure_error_rate(window=60)
    print(f"Baseline error rate: {baseline['error_rate']:.2%}")
    
    # Introduce failure
    print("Blocking database connections...")
    block_db_connections()
    
    start_time = time.time()
    
    try:
        # Monitor during experiment
        while time.time() - start_time < duration_seconds:
            metrics = measure_error_rate(window=10)
            print(f"Error rate: {metrics['error_rate']:.2%}, "
                  f"Latency p95: {metrics['p95_ms']}ms")
            
            # Abort if error rate too high
            if metrics['error_rate'] > 0.05:
                print("ABORT: Error rate exceeded threshold")
                break
            
            time.sleep(10)
    finally:
        # Always restore
        print("Restoring database connections...")
        unblock_db_connections()
    
    # Measure recovery
    time.sleep(10)
    recovery = measure_error_rate(window=30)
    print(f"Recovery error rate: {recovery['error_rate']:.2%}")
    
    return {
        "baseline_error_rate": baseline['error_rate'],
        "max_error_rate_during_chaos": metrics['error_rate'],
        "recovery_error_rate": recovery['error_rate'],
        "recovered_successfully": recovery['error_rate'] < baseline['error_rate'] * 1.05
    }

Chaos Engineering in Kubernetes

Kubernetes environments have specific chaos tools:

# Chaos Mesh - comprehensive chaos for Kubernetes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-failure
spec:
  action: pod-kill
  mode: random-max-percent
  value: "30"  # Kill 30% of matching pods
  selector:
    namespaces:
      - production
    labelSelectors:
      "app": "payment-service"
  scheduler:
    cron: "@every 1h"  # Run hourly (GameDay automation)
  duration: "1m"
---
# AWS Fault Injection Simulator (FIS) for cloud resources
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  AZFailureExperiment:
    Type: AWS::FIS::ExperimentTemplate
    Properties:
      Description: "Simulate AZ failure by stopping instances"
      Targets:
        PaymentInstances:
          ResourceType: aws:ec2:instance
          ResourceTags:
            Service: payment-api
            AZ: us-east-1a
          SelectionMode: ALL
      Actions:
        StopInstances:
          ActionId: aws:ec2:stop-instances
          Parameters:
            startInstancesAfterDuration: PT5M
          Targets:
            Instances: PaymentInstances

Building a Chaos Engineering Practice

Start small and expand deliberately:

Phase 1 - Staging experiments only (Month 1-2)

  • Kill single instances and verify restart
  • Test database failover
  • Verify circuit breakers work

Phase 2 - Limited production experiments (Month 3-4)

  • 1% of production traffic, controlled experiments
  • Run only during business hours with team watching
  • Focus on your highest-confidence fallback mechanisms

Phase 3 - Regular GameDays (Month 5+)

  • Scheduled chaos experiments during business hours
  • Multiple simultaneous failures
  • Cross-functional team participation
  • Treat as learning exercises, not tests to pass

Phase 4 - Automated chaos (Month 9+)

  • Chaos Monkey style automated random instance termination
  • Regular scheduled experiments
  • Chaos as part of CI/CD pipeline

Connecting Chaos Engineering to Monitoring

Chaos experiments are only valuable with good monitoring. The experiment teaches you something if:

  • Your monitoring detected the failure quickly
  • Your dashboards showed the right information
  • Your alerts fired appropriately
  • You could observe the recovery

If monitoring was blind to the chaos you introduced, that's a finding: your monitoring has gaps.

# Verify monitoring coverage during chaos experiment
def run_experiment_with_monitoring_verification(experiment):
    """
    Run chaos experiment and verify monitoring correctly detected it.
    """
    
    # Record starting state
    start_alerts = get_active_alerts()
    
    # Run experiment
    start_time = time.time()
    experiment.start()
    
    # Wait for monitoring to detect
    expected_alert_name = experiment.expected_alert
    detection_time = wait_for_alert(expected_alert_name, timeout=300)
    
    if not detection_time:
        # Monitoring missed the failure - this is a finding
        print(f"MONITORING GAP: Alert '{expected_alert_name}' did not fire!")
        findings.append({
            "type": "monitoring_gap",
            "description": f"No alert fired for {experiment.failure_type}",
            "recommendation": f"Add alert for {experiment.failure_type}"
        })
    else:
        detection_latency = detection_time - start_time
        print(f"Monitoring detected failure in {detection_latency:.1f}s")
    
    # Verify recovery monitoring
    experiment.stop()
    recovery_time = wait_for_alert_resolve(expected_alert_name, timeout=300)
    
    return {
        "failure_detected": detection_time is not None,
        "detection_latency_seconds": detection_latency if detection_time else None,
        "recovery_detected": recovery_time is not None,
        "monitoring_coverage": detection_time is not None
    }

Conclusion

Chaos engineering converts theoretical resilience into demonstrated resilience. The difference between thinking your circuit breaker works and knowing it works is an experiment. Teams that practice chaos engineering regularly discover failure modes they never anticipated, fix them before users encounter them, and build genuine confidence in their systems' reliability. AzMonitor's external monitoring is a critical component of chaos engineering — during experiments, it's your ground truth for "are users experiencing failures?" while your internal monitoring and circuit breakers handle the internal response. When your chaos experiment ends and AzMonitor shows a clean recovery, you've verified something real.

Tags:chaos engineeringreliability testingfault injectionSRE
Back to blog
A
AzMonitor Team
The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.
Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →