Patterns
Game days, blast radius, hypothesis-driven experiments, continuous chaos, fault scenario catalog
Patterns
The tool injects faults. The discipline turns it into learning. These patterns are how mature teams practice chaos engineering.
Hypothesis-Driven Experiments
Every experiment is a prediction that gets validated or falsified. Don't just "break things and see."
Template:
Hypothesis: If [fault] happens, then [system behavior] will occur because [reasoning]. Steady state: [Metric we measure before, during, after] Blast radius: [What's affected; how to abort] Success criteria: [What "passed" means]
Example:
Hypothesis: If we kill 1 of 3 web pods, P99 latency stays under 200ms because the load balancer drains traffic from the dead pod within 5s. Steady state: Request rate, error rate, P99 latency (all from Prometheus) Blast radius: 1 pod, 1 service, 5 minutes max. Auto-rollback if error rate > 1%. Success criteria: Error rate stays below 0.1%; latency P99 < 250ms.
If the experiment confirms the hypothesis: you've earned confidence in that behavior. If it falsifies it: you've found a bug or a wrong mental model. Both are wins.
Blast Radius
The first principle of safe chaos: start tiny and grow.
A blast radius diagram:
┌───────────────────────────────┐
│ Production │
│ ┌──────────────────────┐ │
│ │ Region us-east-1 │ │
│ │ ┌────────────────┐ │ │
│ │ │ Cluster A │ │ │
│ │ │ ┌──────────┐ │ │ │
│ │ │ │ Service X│ │ │ │
│ │ │ │ ┌─────┐ │ │ │ │
│ │ │ │ │1 pod│ │ │ │ │ ← start here
│ │ │ │ └─────┘ │ │ │ │
│ │ │ └──────────┘ │ │ │
│ │ └────────────────┘ │ │
│ └──────────────────────┘ │
└───────────────────────────────┘Progression:
- Single pod, single service, dev cluster
- Single pod, single service, staging
- Multiple pods, single service, staging
- Single pod in production (with abort)
- Whole service in production (with abort)
- Multiple services / cross-AZ in production
Move up the ladder only after the previous level is boring — no surprises three runs in a row.
Abort Conditions
Every production experiment defines automatic abort. Examples:
- Error rate > 1% for 30 seconds
- P99 latency > 500 ms for 1 minute
- Any pages fired
- Manual button pressed by on-call
In Chaos Mesh, use a StatusCheck for this; in Litmus, a probe; in Gremlin, halt conditions. The principle is the same: a human or automated guard with a hand on the kill switch, always.
Steady State
Define the system's normal in measurable terms before the experiment:
- Requests per second
- Error rate
- Latency percentiles (P50, P99)
- Queue depth, consumer lag
- CPU/memory utilization
- Business metrics (checkouts/min, signups/min)
Capture these from your real observability stack — Prometheus, Datadog, CloudWatch. If steady state isn't measurable, don't run the experiment; you have no way to detect impact.
Game Days
A game day is a scheduled, multi-team chaos exercise. Format:
Before (1 week):
- Pick a scenario ("region failover," "DB primary loss," "Redis cluster split-brain")
- Write hypotheses
- Define abort conditions
- Invite affected teams and incident responders
During (2-4 hours):
- Brief everyone on scope and abort
- Inject the fault
- Observe and respond as if it's a real incident
- Don't fix proactively — let the system fail naturally so you see real behavior
- Abort if real impact occurs
After:
- Blameless retro: what worked, what didn't, what surprised us
- File tickets for every gap (alerting, runbook, code, capacity)
- Schedule a re-run in 3-6 months
Game days are the highest-bandwidth way to find resilience gaps that span people, processes, and code — the gaps that pure automation can't surface.
Continuous Chaos
After ad-hoc experiments mature, move to continuous: faults run on a schedule, the system stays resilient by default.
Netflix's Chaos Monkey kills random instances during business hours, every weekday. The discipline is enforced by chaos: if your service can't survive a random pod restart at 2 PM, you'll find out at 2 PM today instead of 2 AM next month.
# Continuous PodChaos - runs every hour for 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: hourly-pod-kill
spec:
schedule: '@hourly'
type: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors: { tier: stateless }
duration: '5m'Move to continuous chaos only when:
- Steady-state alerting works reliably
- Auto-rollback is in place
- Affected teams know and have signed off
- It runs during business hours (don't page on-call at 3 AM with self-inflicted noise)
Hypothesis Catalog: Common Faults
Reusable hypotheses to inspire your own:
Compute faults
| Fault | Hypothesis |
|---|---|
| Kill pod | Service routes around dead pod in < 5s |
| Kill node | Pods reschedule; service unaffected |
| CPU spike (95%) | Service degrades but doesn't crash; HPA scales out |
| Memory pressure | OOMKill triggers restart; no data loss |
| Disk full | Writes fail with clear errors; reads continue |
Network faults
| Fault | Hypothesis |
|---|---|
| 200ms added latency | Within SLO; no cascading timeouts |
| 50% packet loss | Retries succeed; user-facing error rate < 1% |
| Partition from cache | Falls back to source of truth; latency rises but no errors |
| Partition from DB | Reads from replica; writes fail with retry-after |
| DNS resolution fails | Cached entries serve traffic for 5+ min |
Dependency faults
| Fault | Hypothesis |
|---|---|
| 3rd-party API returns 500 | Circuit breaker trips; fallback served |
| 3rd-party API slow (5s) | Timeout fires at 2s; retry; eventual success |
| DB primary fails | Failover < 30s; some writes fail; reads continue |
| Message queue down | Producer queues locally up to N msgs; alerts |
Cloud-level faults
| Fault | Hypothesis |
|---|---|
| Single AZ outage | Traffic shifts to other AZs; no user impact |
| Region failover | RTO < 15 min; data loss < 1 min (RPO) |
| Spot instance reclaim | Workload migrates; batch job restarts cleanly |
Anti-Patterns
Random destruction without hypotheses. "We killed stuff and nothing broke" tells you nothing about specific properties. Define hypotheses or you're just enacting performance.
Skipping steady state. Without baseline metrics, you can't detect impact. You'll either think it's fine when it isn't, or panic at noise.
Starting in production. Yes, prod is the only environment that matters. But you don't get to do it day one. Earn it: staging → prod with tiny blast radius → prod with bigger blast radius. Months, not days.
No abort. Experiments without abort conditions become outages.
Running it once. A green chaos experiment a year ago says nothing about today's system. Continuous or scheduled re-runs catch regressions.
Blaming teams when experiments find gaps. The experiment didn't cause the gap; it exposed it. The gap would've fired at 3 AM otherwise. Reward the find.
What's Next
- Best Practices — running chaos in production safely; org adoption; security and compliance considerations