Steven's Knowledge

Chaos Engineering

Chaos Mesh, Litmus, Gremlin, Chaos Monkey - inject controlled failure to find weaknesses before they find you

Chaos Engineering

Chaos engineering is the discipline of intentionally injecting failure — killing pods, partitioning networks, slowing disks — to find out how your system actually behaves. Production has every form of failure waiting for you; the question is whether you find them on a Tuesday afternoon with the on-call ready, or at 3 a.m. with no warning.

The phrase comes from Netflix's Chaos Monkey (2011), which randomly killed EC2 instances in production to ensure every service could survive instance loss.

Why Chaos Engineering

WithoutWith
Failure assumptions live in your headTested by killing things and observing
"We'd survive an AZ failure" — never testedAZ outage simulated, confirmed, fixed
New service ships, fails on first real outageFailure modes found in staging
On-call learns the runbook during the incidentRunbooks exercised during game days
Monitoring gaps revealed at 3 a.m.Found during a controlled experiment

The goal isn't to break things for fun — it's to find weaknesses while you're paying attention.

The Principles

From Netflix's Principles of Chaos:

  1. Build a hypothesis about steady-state behavior. "P99 latency stays under 500ms; error rate stays under 0.1%."
  2. Vary real-world events — kill a pod, drop network packets, slow a disk.
  3. Run in production (eventually). Staging tells you less than production.
  4. Automate experiments to run continuously. Once you've validated a recovery path, keep validating it.
  5. Minimize blast radius. Start small; expand as confidence grows.

The last point is critical. Chaos engineering done badly is just causing incidents.

The Players

ToolWhere it runsNotes
Chaos MeshKubernetesCNCF; rich fault types; Kubernetes-native
LitmusKubernetesCNCF; cloud-native chaos; large library of experiments
GremlinSaaS / multi-platformEnterprise; safe defaults; UI-driven
Chaos MonkeyAWSThe original; now part of Netflix's Simian Army
AWS Fault Injection Simulator (FIS)AWS-nativeCloud provider's native chaos service
GCP Chaos Engineering (alpha tooling)GCP-nativeCatching up
ToxiproxyNetwork-levelTCP-layer; "this connection has 30% packet loss"
PumbaDockerPer-container chaos
kube-monkeyKubernetesSimpler than Chaos Mesh; pod-killing focused
PowerfulSealKubernetesOlder; still works

For new projects:

  • On KubernetesChaos Mesh or Litmus (both CNCF, both solid).
  • AWS-nativeAWS FIS (no extra ops, integrated billing).
  • Multi-platform enterpriseGremlin.
  • Network chaos onlyToxiproxy in your test environments.

What You Can Break

CategoryExamples
Pod / processKill, restart, OOM, fork bomb
NetworkDrop packets, add latency, partition, corrupt packets, DNS failure
Disk / I/OFill disk, slow disk, fail reads/writes
CPU / memoryStress to saturation, exhaust memory
TimeSkew clocks (a surprising number of bugs hide here)
KernelInject kernel-level failures (advanced)
ApplicationHTTP 500s from specific endpoints, latency injection
Cloud-levelStop instances, fail AZ, throttle APIs

Each tool implements a different subset. Chaos Mesh and Litmus together cover essentially all of them on Kubernetes.

Learning Path

When NOT to Practice Chaos

Honest cases:

  • You don't have observability. If you can't see what's happening, chaos just causes incidents — you won't know what broke.
  • You haven't fixed the obvious stuff. No HA in your DB? Single node Redis? Find those first; you don't need chaos to know they're broken.
  • Your service has no resilience features. Chaos shouldn't be the first time you think about retries, timeouts, circuit breakers.
  • You can't roll back. Every experiment needs a kill switch.

Maturity ladder:

  1. Step 1: have monitoring, alerting, SLOs.
  2. Step 2: have basic resilience (retries, replicas, health checks).
  3. Step 3: practice game days in staging.
  4. Step 4: run controlled chaos in staging.
  5. Step 5: run controlled chaos in production with safeguards.

Most teams stop at step 3 and still get most of the value.

The point of chaos engineering isn't to break production. It's to find out what would break production if it happened, before it happens. If your team isn't yet ready to find that out and fix it, fix the obvious things first. Chaos rewards mature systems; it punishes immature ones.

On this page