Best Practices

Safety controls, production chaos, organizational adoption, compliance, and avoiding self-inflicted outages

Best Practices

Chaos engineering is one of the few practices where a careless mistake creates the very incident you're trying to prevent. These guardrails keep the discipline useful and the system upright.

Safety First

Always have an abort. Every experiment defines automatic conditions to stop early. Common ones:

Error rate exceeds a threshold
Latency P99 crosses a line
A page fires
Any human says "stop"

If abort logic is hard to implement, the experiment is too risky. Reduce scope until it's easy.

Limit blast radius. Start with a single pod, a single instance, a single user. Grow only when previous runs are boring.

Run with humans watching. First runs of any experiment have an operator present. They watch dashboards, ready to abort. Only after weeks of clean runs does the experiment become automated/scheduled.

Have an escape hatch. Before running, know exactly which command stops the chaos:

# Chaos Mesh
kubectl delete <chaos-resource>

# Litmus
kubectl delete chaosengine <name>

# Gremlin
gremlin halt --all

Practice the escape hatch first, on a trivial experiment. Don't learn it during a real abort.

Earn Production

The progression to production chaos:

Dev cluster, fake app. Verify the tool works. Verify you understand it.
Staging, real app, tiny scope. Kill 1 pod of a non-critical service. Validate steady state alerts fire correctly.
Staging, real app, real scope. Game days that simulate real incidents.
Production, dark experiments. Inject faults that affect synthetic / shadow traffic only — real users unaffected.
Production, real but limited. 1 pod of 1 service, off-peak, with auto-abort. On-call notified.
Production, normal practice. Scheduled experiments, business hours, expanded scope.

Skip a level and you're a cautionary tale.

Communicate Before Every Production Experiment

Before a prod chaos run:

On-call team notified (give them an opt-out)
Customer success / support notified
No active incidents
No major product launches or releases in flight
No major customer events (Black Friday, Super Bowl, your enterprise's annual meeting)
Maintenance window or low-traffic period
Status page entry pre-written (in case impact happens)
Slack channel with everyone watching

After:

Post-run summary in the channel
What was learned, what to fix
Tickets filed for each gap

Don't Optimize for Theater

Chaos engineering can become performance: pretty dashboards, monthly game days that prove nothing because the scenarios are stale.

Antidotes:

Pick scenarios from recent real incidents. "Last month we had a degraded DB primary; let's run that scenario quarterly and verify the runbook still works."
Rotate facilitators so the same person doesn't build the same hypotheses.
After each game day, change something — a runbook, an alert, a config. If 3 game days in a row produce zero changes, your scenarios are too easy.

Production Experiments: The 1% Rule

When testing in production for the first time, scope to 1% of the blast radius:

1 of 100 pods
1 of N regions
1% of user traffic (via feature flag or routing)
1 of N tables (don't DROP DATABASE production_users)

If 1% is safe, you can ramp. But 1% is the floor, not the plan — many experiments should stay at 1% forever because the learning is the same and the risk is much lower.

Avoid Self-Inflicted Pages

If a chaos experiment pages your on-call at 2 AM, you've broken trust. The on-call's reaction will be "stop running chaos" — and they're right.

Defaults:

Run during business hours
Run when you yourself are watching
Suppress paging for the affected service for the experiment duration (and slightly after)
Re-enable paging immediately after
If the experiment surfaces a real bug that fires later, that's fine — that's the value

Capacity & Recovery Buffers

After a chaos experiment, the system needs time to fully recover before the next one:

Pods restart and warm up
Caches refill
Connection pools rebuild
Saturation metrics return to baseline

A common mistake: chain experiments back-to-back without recovery, attribute the cascading failures to "the system is fragile" when they're attributable to "we didn't let it recover."

Rule of thumb: wait 5× the experiment duration before starting another in the same blast radius.

Security & Access Control

Chaos tools are powerful: they can take down production. Lock them down:

Chaos Mesh / Litmus dashboards live behind SSO + MFA
RBAC: experiments in production namespace require an extra approval / role
All experiment definitions live in Git, reviewed via PR
Audit log of who ran what, when

Chaos-as-code (ChaosEngine / PodChaos YAML in a repo) is much better than ad-hoc dashboard runs because it gets review and history for free.

Compliance Considerations

Some industries have specific concerns:

Healthcare (HIPAA): Don't inject faults that could corrupt patient records. Test failover paths in non-prod environments using synthetic data.
Finance (SOC2, PCI): Document chaos experiments as part of resilience controls. They can become evidence of operational maturity.
GDPR: Logging from chaos experiments shouldn't capture user PII; sanitize at source.

In regulated industries, chaos engineering is often positively viewed because it demonstrates deliberate resilience testing. Document it.

Organizational Adoption

Adoption pattern that works:

One champion. A senior engineer who's read the books and run a tool. They make the case and run the first experiments.
One pilot service. A non-critical service runs chaos in staging. The team experiences the loop.
One game day. Incident-shaped exercise across teams. Builds shared muscle and shared vocabulary.
Centralized tooling. Platform team provides Chaos Mesh / Litmus / Gremlin as a service; product teams write experiments.
SLO-coupling. When a service has an SLO, it must have a chaos plan. Practice = the discipline becomes the default.

Don't:

Force teams to do chaos. They'll do it badly and resent it.
Build a centralized "chaos team" that runs experiments on product teams. They'll be hated and bypassed.
Treat chaos engineering as a project. It's an ongoing practice.

Tooling Choices Summary

Tool	Best for
Chaos Mesh	K8s-native; rich fault catalog; OSS; CRD-driven
Litmus	K8s; large experiment library; CNCF graduated
Gremlin	SaaS; enterprise support; non-K8s targets
Chaos Monkey	Continuous random instance kill; if you're on AWS / Spinnaker
AWS FIS	AWS-level faults (stop EC2, throttle APIs, fail AZ for RDS)
Toxiproxy	Network faults at app level; great for integration tests
Pumba	Docker chaos (kill, pause, netem) outside K8s
Powerfulseal	K8s + cloud + custom scenarios; declarative

Most teams: Chaos Mesh in K8s + AWS FIS for cloud-level faults + Toxiproxy in integration tests.

Checklist

What's Next

You have a chaos engineering practice. Connect it to the broader resilience picture:

Monitoring — chaos experiments depend on observability; without it you can't measure anything
Tracing — distributed tracing reveals which downstream a chaos experiment is actually affecting
Service Mesh — mesh-level fault injection (Istio, Linkerd) complements chaos tools for L7 faults
CI/CD — automate chaos experiments as a pipeline stage in staging

Best Practices Safety First Earn Production Communicate Before Every Production Experiment Don't Optimize for Theater Production Experiments: The 1% Rule Avoid Self-Inflicted Pages Capacity & Recovery Buffers Security & Access Control Compliance Considerations Organizational Adoption Tooling Choices Summary Checklist What's Next

Best Practices

On this page