Steven's Knowledge

Best Practices

Safety controls, production chaos, organizational adoption, compliance, and avoiding self-inflicted outages

Best Practices

Chaos engineering is one of the few practices where a careless mistake creates the very incident you're trying to prevent. These guardrails keep the discipline useful and the system upright.

Safety First

Always have an abort. Every experiment defines automatic conditions to stop early. Common ones:

  • Error rate exceeds a threshold
  • Latency P99 crosses a line
  • A page fires
  • Any human says "stop"

If abort logic is hard to implement, the experiment is too risky. Reduce scope until it's easy.

Limit blast radius. Start with a single pod, a single instance, a single user. Grow only when previous runs are boring.

Run with humans watching. First runs of any experiment have an operator present. They watch dashboards, ready to abort. Only after weeks of clean runs does the experiment become automated/scheduled.

Have an escape hatch. Before running, know exactly which command stops the chaos:

# Chaos Mesh
kubectl delete <chaos-resource>

# Litmus
kubectl delete chaosengine <name>

# Gremlin
gremlin halt --all

Practice the escape hatch first, on a trivial experiment. Don't learn it during a real abort.

Earn Production

The progression to production chaos:

  1. Dev cluster, fake app. Verify the tool works. Verify you understand it.
  2. Staging, real app, tiny scope. Kill 1 pod of a non-critical service. Validate steady state alerts fire correctly.
  3. Staging, real app, real scope. Game days that simulate real incidents.
  4. Production, dark experiments. Inject faults that affect synthetic / shadow traffic only — real users unaffected.
  5. Production, real but limited. 1 pod of 1 service, off-peak, with auto-abort. On-call notified.
  6. Production, normal practice. Scheduled experiments, business hours, expanded scope.

Skip a level and you're a cautionary tale.

Communicate Before Every Production Experiment

Before a prod chaos run:

  • On-call team notified (give them an opt-out)
  • Customer success / support notified
  • No active incidents
  • No major product launches or releases in flight
  • No major customer events (Black Friday, Super Bowl, your enterprise's annual meeting)
  • Maintenance window or low-traffic period
  • Status page entry pre-written (in case impact happens)
  • Slack channel with everyone watching

After:

  • Post-run summary in the channel
  • What was learned, what to fix
  • Tickets filed for each gap

Don't Optimize for Theater

Chaos engineering can become performance: pretty dashboards, monthly game days that prove nothing because the scenarios are stale.

Antidotes:

  • Pick scenarios from recent real incidents. "Last month we had a degraded DB primary; let's run that scenario quarterly and verify the runbook still works."
  • Rotate facilitators so the same person doesn't build the same hypotheses.
  • After each game day, change something — a runbook, an alert, a config. If 3 game days in a row produce zero changes, your scenarios are too easy.

Production Experiments: The 1% Rule

When testing in production for the first time, scope to 1% of the blast radius:

  • 1 of 100 pods
  • 1 of N regions
  • 1% of user traffic (via feature flag or routing)
  • 1 of N tables (don't DROP DATABASE production_users)

If 1% is safe, you can ramp. But 1% is the floor, not the plan — many experiments should stay at 1% forever because the learning is the same and the risk is much lower.

Avoid Self-Inflicted Pages

If a chaos experiment pages your on-call at 2 AM, you've broken trust. The on-call's reaction will be "stop running chaos" — and they're right.

Defaults:

  • Run during business hours
  • Run when you yourself are watching
  • Suppress paging for the affected service for the experiment duration (and slightly after)
  • Re-enable paging immediately after
  • If the experiment surfaces a real bug that fires later, that's fine — that's the value

Capacity & Recovery Buffers

After a chaos experiment, the system needs time to fully recover before the next one:

  • Pods restart and warm up
  • Caches refill
  • Connection pools rebuild
  • Saturation metrics return to baseline

A common mistake: chain experiments back-to-back without recovery, attribute the cascading failures to "the system is fragile" when they're attributable to "we didn't let it recover."

Rule of thumb: wait 5× the experiment duration before starting another in the same blast radius.

Security & Access Control

Chaos tools are powerful: they can take down production. Lock them down:

  • Chaos Mesh / Litmus dashboards live behind SSO + MFA
  • RBAC: experiments in production namespace require an extra approval / role
  • All experiment definitions live in Git, reviewed via PR
  • Audit log of who ran what, when

Chaos-as-code (ChaosEngine / PodChaos YAML in a repo) is much better than ad-hoc dashboard runs because it gets review and history for free.

Compliance Considerations

Some industries have specific concerns:

  • Healthcare (HIPAA): Don't inject faults that could corrupt patient records. Test failover paths in non-prod environments using synthetic data.
  • Finance (SOC2, PCI): Document chaos experiments as part of resilience controls. They can become evidence of operational maturity.
  • GDPR: Logging from chaos experiments shouldn't capture user PII; sanitize at source.

In regulated industries, chaos engineering is often positively viewed because it demonstrates deliberate resilience testing. Document it.

Organizational Adoption

Adoption pattern that works:

  1. One champion. A senior engineer who's read the books and run a tool. They make the case and run the first experiments.
  2. One pilot service. A non-critical service runs chaos in staging. The team experiences the loop.
  3. One game day. Incident-shaped exercise across teams. Builds shared muscle and shared vocabulary.
  4. Centralized tooling. Platform team provides Chaos Mesh / Litmus / Gremlin as a service; product teams write experiments.
  5. SLO-coupling. When a service has an SLO, it must have a chaos plan. Practice = the discipline becomes the default.

Don't:

  • Force teams to do chaos. They'll do it badly and resent it.
  • Build a centralized "chaos team" that runs experiments on product teams. They'll be hated and bypassed.
  • Treat chaos engineering as a project. It's an ongoing practice.

Tooling Choices Summary

ToolBest for
Chaos MeshK8s-native; rich fault catalog; OSS; CRD-driven
LitmusK8s; large experiment library; CNCF graduated
GremlinSaaS; enterprise support; non-K8s targets
Chaos MonkeyContinuous random instance kill; if you're on AWS / Spinnaker
AWS FISAWS-level faults (stop EC2, throttle APIs, fail AZ for RDS)
ToxiproxyNetwork faults at app level; great for integration tests
PumbaDocker chaos (kill, pause, netem) outside K8s
PowerfulsealK8s + cloud + custom scenarios; declarative

Most teams: Chaos Mesh in K8s + AWS FIS for cloud-level faults + Toxiproxy in integration tests.

Checklist

Before running any production chaos experiment:

  • Hypothesis written down
  • Steady state metrics defined and monitored
  • Abort conditions defined (auto + manual)
  • Blast radius is the smallest meaningful scope
  • On-call team notified and not opted out
  • No active incidents or major events
  • Operator watching the dashboard
  • Escape hatch command tested
  • Status page entry pre-drafted
  • Post-run findings will be filed as tickets

What's Next

You have a chaos engineering practice. Connect it to the broader resilience picture:

  • Monitoring — chaos experiments depend on observability; without it you can't measure anything
  • Tracing — distributed tracing reveals which downstream a chaos experiment is actually affecting
  • Service Mesh — mesh-level fault injection (Istio, Linkerd) complements chaos tools for L7 faults
  • CI/CD — automate chaos experiments as a pipeline stage in staging

On this page