Chaos Mesh, Litmus, Gremlin, Chaos Monkey - inject controlled failure to find weaknesses before they find you

Chaos Engineering

Chaos engineering is the discipline of intentionally injecting failure — killing pods, partitioning networks, slowing disks — to find out how your system actually behaves. Production has every form of failure waiting for you; the question is whether you find them on a Tuesday afternoon with the on-call ready, or at 3 a.m. with no warning.

The phrase comes from Netflix's Chaos Monkey (2011), which randomly killed EC2 instances in production to ensure every service could survive instance loss.

Why Chaos Engineering

Without	With
Failure assumptions live in your head	Tested by killing things and observing
"We'd survive an AZ failure" — never tested	AZ outage simulated, confirmed, fixed
New service ships, fails on first real outage	Failure modes found in staging
On-call learns the runbook during the incident	Runbooks exercised during game days
Monitoring gaps revealed at 3 a.m.	Found during a controlled experiment

The goal isn't to break things for fun — it's to find weaknesses while you're paying attention.

The Principles

From Netflix's Principles of Chaos:

Build a hypothesis about steady-state behavior. "P99 latency stays under 500ms; error rate stays under 0.1%."
Vary real-world events — kill a pod, drop network packets, slow a disk.
Run in production (eventually). Staging tells you less than production.
Automate experiments to run continuously. Once you've validated a recovery path, keep validating it.
Minimize blast radius. Start small; expand as confidence grows.

The last point is critical. Chaos engineering done badly is just causing incidents.

The Players

Tool	Where it runs	Notes
Chaos Mesh	Kubernetes	CNCF; rich fault types; Kubernetes-native
Litmus	Kubernetes	CNCF; cloud-native chaos; large library of experiments
Gremlin	SaaS / multi-platform	Enterprise; safe defaults; UI-driven
Chaos Monkey	AWS	The original; now part of Netflix's Simian Army
AWS Fault Injection Simulator (FIS)	AWS-native	Cloud provider's native chaos service
GCP Chaos Engineering (alpha tooling)	GCP-native	Catching up
Toxiproxy	Network-level	TCP-layer; "this connection has 30% packet loss"
Pumba	Docker	Per-container chaos
kube-monkey	Kubernetes	Simpler than Chaos Mesh; pod-killing focused
PowerfulSeal	Kubernetes	Older; still works

For new projects:

On Kubernetes → Chaos Mesh or Litmus (both CNCF, both solid).
AWS-native → AWS FIS (no extra ops, integrated billing).
Multi-platform enterprise → Gremlin.
Network chaos only → Toxiproxy in your test environments.

What You Can Break

Category	Examples
Pod / process	Kill, restart, OOM, fork bomb
Network	Drop packets, add latency, partition, corrupt packets, DNS failure
Disk / I/O	Fill disk, slow disk, fail reads/writes
CPU / memory	Stress to saturation, exhaust memory
Time	Skew clocks (a surprising number of bugs hide here)
Kernel	Inject kernel-level failures (advanced)
Application	HTTP 500s from specific endpoints, latency injection
Cloud-level	Stop instances, fail AZ, throttle APIs

Each tool implements a different subset. Chaos Mesh and Litmus together cover essentially all of them on Kubernetes.

Learning Path

1. Getting Started

Install Chaos Mesh on a kind cluster; kill a pod; inject network latency; observe recovery

2. Patterns

Game days, blast radius, continuous experiments, fault scenarios, hypotheses

3. Best Practices

Safety, blast radius control, production experiments, organizational adoption, common pitfalls

When NOT to Practice Chaos

Honest cases:

You don't have observability. If you can't see what's happening, chaos just causes incidents — you won't know what broke.
You haven't fixed the obvious stuff. No HA in your DB? Single node Redis? Find those first; you don't need chaos to know they're broken.
Your service has no resilience features. Chaos shouldn't be the first time you think about retries, timeouts, circuit breakers.
You can't roll back. Every experiment needs a kill switch.

Maturity ladder:

Step 1: have monitoring, alerting, SLOs.
Step 2: have basic resilience (retries, replicas, health checks).
Step 3: practice game days in staging.
Step 4: run controlled chaos in staging.
Step 5: run controlled chaos in production with safeguards.

Most teams stop at step 3 and still get most of the value.

The point of chaos engineering isn't to break production. It's to find out what would break production if it happened, before it happens. If your team isn't yet ready to find that out and fix it, fix the obvious things first. Chaos rewards mature systems; it punishes immature ones.

Chaos Engineering

Chaos Engineering

Why Chaos Engineering

The Principles

The Players

What You Can Break

Learning Path

1. Getting Started

2. Patterns

3. Best Practices

When NOT to Practice Chaos

On this page