Chaos Engineering
Chaos Mesh, Litmus, Gremlin, Chaos Monkey - inject controlled failure to find weaknesses before they find you
Chaos Engineering
Chaos engineering is the discipline of intentionally injecting failure — killing pods, partitioning networks, slowing disks — to find out how your system actually behaves. Production has every form of failure waiting for you; the question is whether you find them on a Tuesday afternoon with the on-call ready, or at 3 a.m. with no warning.
The phrase comes from Netflix's Chaos Monkey (2011), which randomly killed EC2 instances in production to ensure every service could survive instance loss.
Why Chaos Engineering
| Without | With |
|---|---|
| Failure assumptions live in your head | Tested by killing things and observing |
| "We'd survive an AZ failure" — never tested | AZ outage simulated, confirmed, fixed |
| New service ships, fails on first real outage | Failure modes found in staging |
| On-call learns the runbook during the incident | Runbooks exercised during game days |
| Monitoring gaps revealed at 3 a.m. | Found during a controlled experiment |
The goal isn't to break things for fun — it's to find weaknesses while you're paying attention.
The Principles
From Netflix's Principles of Chaos:
- Build a hypothesis about steady-state behavior. "P99 latency stays under 500ms; error rate stays under 0.1%."
- Vary real-world events — kill a pod, drop network packets, slow a disk.
- Run in production (eventually). Staging tells you less than production.
- Automate experiments to run continuously. Once you've validated a recovery path, keep validating it.
- Minimize blast radius. Start small; expand as confidence grows.
The last point is critical. Chaos engineering done badly is just causing incidents.
The Players
| Tool | Where it runs | Notes |
|---|---|---|
| Chaos Mesh | Kubernetes | CNCF; rich fault types; Kubernetes-native |
| Litmus | Kubernetes | CNCF; cloud-native chaos; large library of experiments |
| Gremlin | SaaS / multi-platform | Enterprise; safe defaults; UI-driven |
| Chaos Monkey | AWS | The original; now part of Netflix's Simian Army |
| AWS Fault Injection Simulator (FIS) | AWS-native | Cloud provider's native chaos service |
| GCP Chaos Engineering (alpha tooling) | GCP-native | Catching up |
| Toxiproxy | Network-level | TCP-layer; "this connection has 30% packet loss" |
| Pumba | Docker | Per-container chaos |
| kube-monkey | Kubernetes | Simpler than Chaos Mesh; pod-killing focused |
| PowerfulSeal | Kubernetes | Older; still works |
For new projects:
- On Kubernetes → Chaos Mesh or Litmus (both CNCF, both solid).
- AWS-native → AWS FIS (no extra ops, integrated billing).
- Multi-platform enterprise → Gremlin.
- Network chaos only → Toxiproxy in your test environments.
What You Can Break
| Category | Examples |
|---|---|
| Pod / process | Kill, restart, OOM, fork bomb |
| Network | Drop packets, add latency, partition, corrupt packets, DNS failure |
| Disk / I/O | Fill disk, slow disk, fail reads/writes |
| CPU / memory | Stress to saturation, exhaust memory |
| Time | Skew clocks (a surprising number of bugs hide here) |
| Kernel | Inject kernel-level failures (advanced) |
| Application | HTTP 500s from specific endpoints, latency injection |
| Cloud-level | Stop instances, fail AZ, throttle APIs |
Each tool implements a different subset. Chaos Mesh and Litmus together cover essentially all of them on Kubernetes.
Learning Path
1. Getting Started
Install Chaos Mesh on a kind cluster; kill a pod; inject network latency; observe recovery
2. Patterns
Game days, blast radius, continuous experiments, fault scenarios, hypotheses
3. Best Practices
Safety, blast radius control, production experiments, organizational adoption, common pitfalls
When NOT to Practice Chaos
Honest cases:
- You don't have observability. If you can't see what's happening, chaos just causes incidents — you won't know what broke.
- You haven't fixed the obvious stuff. No HA in your DB? Single node Redis? Find those first; you don't need chaos to know they're broken.
- Your service has no resilience features. Chaos shouldn't be the first time you think about retries, timeouts, circuit breakers.
- You can't roll back. Every experiment needs a kill switch.
Maturity ladder:
- Step 1: have monitoring, alerting, SLOs.
- Step 2: have basic resilience (retries, replicas, health checks).
- Step 3: practice game days in staging.
- Step 4: run controlled chaos in staging.
- Step 5: run controlled chaos in production with safeguards.
Most teams stop at step 3 and still get most of the value.
The point of chaos engineering isn't to break production. It's to find out what would break production if it happened, before it happens. If your team isn't yet ready to find that out and fix it, fix the obvious things first. Chaos rewards mature systems; it punishes immature ones.