Best Practices
Safety controls, production chaos, organizational adoption, compliance, and avoiding self-inflicted outages
Best Practices
Chaos engineering is one of the few practices where a careless mistake creates the very incident you're trying to prevent. These guardrails keep the discipline useful and the system upright.
Safety First
Always have an abort. Every experiment defines automatic conditions to stop early. Common ones:
- Error rate exceeds a threshold
- Latency P99 crosses a line
- A page fires
- Any human says "stop"
If abort logic is hard to implement, the experiment is too risky. Reduce scope until it's easy.
Limit blast radius. Start with a single pod, a single instance, a single user. Grow only when previous runs are boring.
Run with humans watching. First runs of any experiment have an operator present. They watch dashboards, ready to abort. Only after weeks of clean runs does the experiment become automated/scheduled.
Have an escape hatch. Before running, know exactly which command stops the chaos:
# Chaos Mesh
kubectl delete <chaos-resource>
# Litmus
kubectl delete chaosengine <name>
# Gremlin
gremlin halt --allPractice the escape hatch first, on a trivial experiment. Don't learn it during a real abort.
Earn Production
The progression to production chaos:
- Dev cluster, fake app. Verify the tool works. Verify you understand it.
- Staging, real app, tiny scope. Kill 1 pod of a non-critical service. Validate steady state alerts fire correctly.
- Staging, real app, real scope. Game days that simulate real incidents.
- Production, dark experiments. Inject faults that affect synthetic / shadow traffic only — real users unaffected.
- Production, real but limited. 1 pod of 1 service, off-peak, with auto-abort. On-call notified.
- Production, normal practice. Scheduled experiments, business hours, expanded scope.
Skip a level and you're a cautionary tale.
Communicate Before Every Production Experiment
Before a prod chaos run:
- On-call team notified (give them an opt-out)
- Customer success / support notified
- No active incidents
- No major product launches or releases in flight
- No major customer events (Black Friday, Super Bowl, your enterprise's annual meeting)
- Maintenance window or low-traffic period
- Status page entry pre-written (in case impact happens)
- Slack channel with everyone watching
After:
- Post-run summary in the channel
- What was learned, what to fix
- Tickets filed for each gap
Don't Optimize for Theater
Chaos engineering can become performance: pretty dashboards, monthly game days that prove nothing because the scenarios are stale.
Antidotes:
- Pick scenarios from recent real incidents. "Last month we had a degraded DB primary; let's run that scenario quarterly and verify the runbook still works."
- Rotate facilitators so the same person doesn't build the same hypotheses.
- After each game day, change something — a runbook, an alert, a config. If 3 game days in a row produce zero changes, your scenarios are too easy.
Production Experiments: The 1% Rule
When testing in production for the first time, scope to 1% of the blast radius:
- 1 of 100 pods
- 1 of N regions
- 1% of user traffic (via feature flag or routing)
- 1 of N tables (don't
DROP DATABASE production_users)
If 1% is safe, you can ramp. But 1% is the floor, not the plan — many experiments should stay at 1% forever because the learning is the same and the risk is much lower.
Avoid Self-Inflicted Pages
If a chaos experiment pages your on-call at 2 AM, you've broken trust. The on-call's reaction will be "stop running chaos" — and they're right.
Defaults:
- Run during business hours
- Run when you yourself are watching
- Suppress paging for the affected service for the experiment duration (and slightly after)
- Re-enable paging immediately after
- If the experiment surfaces a real bug that fires later, that's fine — that's the value
Capacity & Recovery Buffers
After a chaos experiment, the system needs time to fully recover before the next one:
- Pods restart and warm up
- Caches refill
- Connection pools rebuild
- Saturation metrics return to baseline
A common mistake: chain experiments back-to-back without recovery, attribute the cascading failures to "the system is fragile" when they're attributable to "we didn't let it recover."
Rule of thumb: wait 5× the experiment duration before starting another in the same blast radius.
Security & Access Control
Chaos tools are powerful: they can take down production. Lock them down:
- Chaos Mesh / Litmus dashboards live behind SSO + MFA
- RBAC: experiments in
productionnamespace require an extra approval / role - All experiment definitions live in Git, reviewed via PR
- Audit log of who ran what, when
Chaos-as-code (ChaosEngine / PodChaos YAML in a repo) is much better than ad-hoc dashboard runs because it gets review and history for free.
Compliance Considerations
Some industries have specific concerns:
- Healthcare (HIPAA): Don't inject faults that could corrupt patient records. Test failover paths in non-prod environments using synthetic data.
- Finance (SOC2, PCI): Document chaos experiments as part of resilience controls. They can become evidence of operational maturity.
- GDPR: Logging from chaos experiments shouldn't capture user PII; sanitize at source.
In regulated industries, chaos engineering is often positively viewed because it demonstrates deliberate resilience testing. Document it.
Organizational Adoption
Adoption pattern that works:
- One champion. A senior engineer who's read the books and run a tool. They make the case and run the first experiments.
- One pilot service. A non-critical service runs chaos in staging. The team experiences the loop.
- One game day. Incident-shaped exercise across teams. Builds shared muscle and shared vocabulary.
- Centralized tooling. Platform team provides Chaos Mesh / Litmus / Gremlin as a service; product teams write experiments.
- SLO-coupling. When a service has an SLO, it must have a chaos plan. Practice = the discipline becomes the default.
Don't:
- Force teams to do chaos. They'll do it badly and resent it.
- Build a centralized "chaos team" that runs experiments on product teams. They'll be hated and bypassed.
- Treat chaos engineering as a project. It's an ongoing practice.
Tooling Choices Summary
| Tool | Best for |
|---|---|
| Chaos Mesh | K8s-native; rich fault catalog; OSS; CRD-driven |
| Litmus | K8s; large experiment library; CNCF graduated |
| Gremlin | SaaS; enterprise support; non-K8s targets |
| Chaos Monkey | Continuous random instance kill; if you're on AWS / Spinnaker |
| AWS FIS | AWS-level faults (stop EC2, throttle APIs, fail AZ for RDS) |
| Toxiproxy | Network faults at app level; great for integration tests |
| Pumba | Docker chaos (kill, pause, netem) outside K8s |
| Powerfulseal | K8s + cloud + custom scenarios; declarative |
Most teams: Chaos Mesh in K8s + AWS FIS for cloud-level faults + Toxiproxy in integration tests.
Checklist
Before running any production chaos experiment:
- Hypothesis written down
- Steady state metrics defined and monitored
- Abort conditions defined (auto + manual)
- Blast radius is the smallest meaningful scope
- On-call team notified and not opted out
- No active incidents or major events
- Operator watching the dashboard
- Escape hatch command tested
- Status page entry pre-drafted
- Post-run findings will be filed as tickets
What's Next
You have a chaos engineering practice. Connect it to the broader resilience picture:
- Monitoring — chaos experiments depend on observability; without it you can't measure anything
- Tracing — distributed tracing reveals which downstream a chaos experiment is actually affecting
- Service Mesh — mesh-level fault injection (Istio, Linkerd) complements chaos tools for L7 faults
- CI/CD — automate chaos experiments as a pipeline stage in staging