Chaos engineering is the practice of intentionally introducing failures into production systems to discover how they behave under adverse conditions. The core principle sounds counterintuitive: deliberately break things to make them more reliable. The logic is sound — if failures will happen anyway (and they will), it's better to discover how your system responds in a controlled experiment than as a surprise incident at 3 AM.
Netflix pioneered this discipline with Chaos Monkey in 2011. Today, chaos engineering is standard practice at companies running complex distributed systems. It's moved from edgy engineering experiment to established reliability discipline.
The Case for Chaos Engineering
The fundamental problem chaos engineering solves: confidence in distributed systems is false confidence.
You can read your code and believe it handles failures correctly. You can write unit tests for failure cases. You can review architecture diagrams and see redundancy everywhere. None of this tells you how the system actually behaves when a real failure occurs with real load, real data, and real timing.
Real systems fail in ways that:
- Weren't anticipated during design
- Tests don't cover (especially integration and timing issues)
- Only manifest at specific load levels
- Cause cascading failures across multiple services
Chaos engineering reveals these issues under controlled conditions, where you have monitoring, time to investigate, and the ability to stop the experiment.
Chaos Engineering Principles
Before running chaos experiments, establish these foundations:
1. Hypothesis-driven experiments — Not "break things randomly" but "we believe our system will continue to function when service X fails, because we have fallbacks Y and Z."
2. Start in staging — Run experiments in staging environments first to build familiarity and safety.
3. Minimize blast radius — Start with experiments that affect a small percentage of traffic or a single instance before broader failures.
4. Have a kill switch — Always be able to stop the experiment immediately.
5. Monitor everything — You need visibility into what's happening to learn from the experiment.
6. Run during business hours initially — So your full team is available if something goes wrong.
Defining a Chaos Experiment
Every chaos experiment follows a structure:
# Chaos Experiment: Payment Service Dependency Failure
## Hypothesis
We believe that when the fraud detection service becomes unavailable,
payment processing will continue to function because:
1. We have a circuit breaker that opens after 5 failures
2. Failed fraud checks are logged and queued for async review
3. Payments are processed with a "manual_review" flag when fraud service is unavailable
## Steady State
Normal operation: fraud service responds in < 100ms, payment success rate > 99.5%
## Method
Inject failure: Block all traffic from payment service to fraud service for 5 minutes
## Expected Outcome
- Payment processing continues (with manual_review flag)
- Circuit breaker opens within 30 seconds
- Alert fires within 2 minutes notifying team of fraud service degradation
- Error rate remains below 1%
- No data loss or transaction failures
## Success Criteria
- Payment success rate stays above 99%
- Appropriate alerts fire
- Fallback behavior engaged
## Abort Conditions
Stop immediately if:
- Payment success rate drops below 95%
- Customer-visible errors are observed
- Error rate exceeds 5%
## Monitoring During Experiment
Dashboard: https://monitoring.example.com/payments
Alert channel: #payment-team-alerts
Common Chaos Experiments
1. Kill a Service Instance
The most basic experiment: terminate a service instance and verify the system recovers:
# Using Chaos Monkey / Gremlin to kill a random instance
# Or direct Kubernetes pod deletion
# Kill a random pod in the payment deployment
kubectl delete pod \
$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}') \
--grace-period=0
# Monitor recovery
kubectl get pods -l app=payment-service -w
# Verify: New pod should start within 30 seconds
# Verify: Health check should recover within 60 seconds
# Verify: No customer-visible errors (check monitoring)
2. Inject Network Latency
Add artificial latency to service-to-service communication:
# Using tc (traffic control) to add 500ms latency to outgoing connections
# Run on the fraud-service node:
tc qdisc add dev eth0 root netem delay 500ms
# Run your experiment (5 minutes)
# Monitor: Does payment service timeout correctly?
# Monitor: Does circuit breaker open?
# Monitor: Do alerts fire?
# Cleanup
tc qdisc del dev eth0 root
Using a chaos engineering tool (Gremlin, Chaos Mesh, or AWS Fault Injection Simulator):
# Chaos Mesh network delay experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: fraud-service-delay
spec:
action: delay
mode: one
selector:
namespaces:
- fraud
labelSelectors:
"app": "fraud-service"
delay:
latency: "500ms"
jitter: "50ms"
duration: "5m"
3. Resource Exhaustion
Test what happens when a service runs out of memory or CPU:
# Chaos Mesh: stress CPU
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: payment-service-cpu-stress
spec:
mode: one
selector:
labelSelectors:
"app": "payment-service"
stressors:
cpu:
workers: 4
load: 80 # 80% CPU load
duration: "3m"
Watch for:
- Does autoscaling trigger correctly?
- Do requests fail or just slow down?
- Does the circuit breaker open before resources are exhausted?
4. Database Failure
Test database failover and read replica behavior:
# Programmatic chaos experiment: block DB connections
import psycopg2
import subprocess
import time
def run_db_failure_experiment(duration_seconds=60):
"""
Test system behavior when primary database is unavailable.
Requires: read replicas configured, application handles failure gracefully
"""
print("Starting DB failure experiment")
print(f"Duration: {duration_seconds} seconds")
# Establish baseline
baseline = measure_error_rate(window=60)
print(f"Baseline error rate: {baseline['error_rate']:.2%}")
# Introduce failure
print("Blocking database connections...")
block_db_connections()
start_time = time.time()
try:
# Monitor during experiment
while time.time() - start_time < duration_seconds:
metrics = measure_error_rate(window=10)
print(f"Error rate: {metrics['error_rate']:.2%}, "
f"Latency p95: {metrics['p95_ms']}ms")
# Abort if error rate too high
if metrics['error_rate'] > 0.05:
print("ABORT: Error rate exceeded threshold")
break
time.sleep(10)
finally:
# Always restore
print("Restoring database connections...")
unblock_db_connections()
# Measure recovery
time.sleep(10)
recovery = measure_error_rate(window=30)
print(f"Recovery error rate: {recovery['error_rate']:.2%}")
return {
"baseline_error_rate": baseline['error_rate'],
"max_error_rate_during_chaos": metrics['error_rate'],
"recovery_error_rate": recovery['error_rate'],
"recovered_successfully": recovery['error_rate'] < baseline['error_rate'] * 1.05
}
Chaos Engineering in Kubernetes
Kubernetes environments have specific chaos tools:
# Chaos Mesh - comprehensive chaos for Kubernetes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-pod-failure
spec:
action: pod-kill
mode: random-max-percent
value: "30" # Kill 30% of matching pods
selector:
namespaces:
- production
labelSelectors:
"app": "payment-service"
scheduler:
cron: "@every 1h" # Run hourly (GameDay automation)
duration: "1m"
---
# AWS Fault Injection Simulator (FIS) for cloud resources
AWSTemplateFormatVersion: '2010-09-09'
Resources:
AZFailureExperiment:
Type: AWS::FIS::ExperimentTemplate
Properties:
Description: "Simulate AZ failure by stopping instances"
Targets:
PaymentInstances:
ResourceType: aws:ec2:instance
ResourceTags:
Service: payment-api
AZ: us-east-1a
SelectionMode: ALL
Actions:
StopInstances:
ActionId: aws:ec2:stop-instances
Parameters:
startInstancesAfterDuration: PT5M
Targets:
Instances: PaymentInstances
Building a Chaos Engineering Practice
Start small and expand deliberately:
Phase 1 - Staging experiments only (Month 1-2)
- Kill single instances and verify restart
- Test database failover
- Verify circuit breakers work
Phase 2 - Limited production experiments (Month 3-4)
- 1% of production traffic, controlled experiments
- Run only during business hours with team watching
- Focus on your highest-confidence fallback mechanisms
Phase 3 - Regular GameDays (Month 5+)
- Scheduled chaos experiments during business hours
- Multiple simultaneous failures
- Cross-functional team participation
- Treat as learning exercises, not tests to pass
Phase 4 - Automated chaos (Month 9+)
- Chaos Monkey style automated random instance termination
- Regular scheduled experiments
- Chaos as part of CI/CD pipeline
Connecting Chaos Engineering to Monitoring
Chaos experiments are only valuable with good monitoring. The experiment teaches you something if:
- Your monitoring detected the failure quickly
- Your dashboards showed the right information
- Your alerts fired appropriately
- You could observe the recovery
If monitoring was blind to the chaos you introduced, that's a finding: your monitoring has gaps.
# Verify monitoring coverage during chaos experiment
def run_experiment_with_monitoring_verification(experiment):
"""
Run chaos experiment and verify monitoring correctly detected it.
"""
# Record starting state
start_alerts = get_active_alerts()
# Run experiment
start_time = time.time()
experiment.start()
# Wait for monitoring to detect
expected_alert_name = experiment.expected_alert
detection_time = wait_for_alert(expected_alert_name, timeout=300)
if not detection_time:
# Monitoring missed the failure - this is a finding
print(f"MONITORING GAP: Alert '{expected_alert_name}' did not fire!")
findings.append({
"type": "monitoring_gap",
"description": f"No alert fired for {experiment.failure_type}",
"recommendation": f"Add alert for {experiment.failure_type}"
})
else:
detection_latency = detection_time - start_time
print(f"Monitoring detected failure in {detection_latency:.1f}s")
# Verify recovery monitoring
experiment.stop()
recovery_time = wait_for_alert_resolve(expected_alert_name, timeout=300)
return {
"failure_detected": detection_time is not None,
"detection_latency_seconds": detection_latency if detection_time else None,
"recovery_detected": recovery_time is not None,
"monitoring_coverage": detection_time is not None
}
Conclusion
Chaos engineering converts theoretical resilience into demonstrated resilience. The difference between thinking your circuit breaker works and knowing it works is an experiment. Teams that practice chaos engineering regularly discover failure modes they never anticipated, fix them before users encounter them, and build genuine confidence in their systems' reliability. AzMonitor's external monitoring is a critical component of chaos engineering — during experiments, it's your ground truth for "are users experiencing failures?" while your internal monitoring and circuit breakers handle the internal response. When your chaos experiment ends and AzMonitor shows a clean recovery, you've verified something real.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →