Game days, blast radius, hypothesis-driven experiments, continuous chaos, fault scenario catalog

Patterns

The tool injects faults. The discipline turns it into learning. These patterns are how mature teams practice chaos engineering.

Hypothesis-Driven Experiments

Every experiment is a prediction that gets validated or falsified. Don't just "break things and see."

Template:

Hypothesis: If [fault] happens, then [system behavior] will occur because [reasoning]. Steady state: [Metric we measure before, during, after] Blast radius: [What's affected; how to abort] Success criteria: [What "passed" means]

Example:

Hypothesis: If we kill 1 of 3 web pods, P99 latency stays under 200ms because the load balancer drains traffic from the dead pod within 5s. Steady state: Request rate, error rate, P99 latency (all from Prometheus) Blast radius: 1 pod, 1 service, 5 minutes max. Auto-rollback if error rate > 1%. Success criteria: Error rate stays below 0.1%; latency P99 < 250ms.

If the experiment confirms the hypothesis: you've earned confidence in that behavior. If it falsifies it: you've found a bug or a wrong mental model. Both are wins.

Blast Radius

The first principle of safe chaos: start tiny and grow.

A blast radius diagram:

┌───────────────────────────────┐
│ Production                    │
│  ┌──────────────────────┐     │
│  │ Region us-east-1     │     │
│  │  ┌────────────────┐  │     │
│  │  │ Cluster A      │  │     │
│  │  │  ┌──────────┐  │  │     │
│  │  │  │ Service X│  │  │     │
│  │  │  │  ┌─────┐ │  │  │     │
│  │  │  │  │1 pod│ │  │  │     │ ← start here
│  │  │  │  └─────┘ │  │  │     │
│  │  │  └──────────┘  │  │     │
│  │  └────────────────┘  │     │
│  └──────────────────────┘     │
└───────────────────────────────┘

Progression:

Single pod, single service, dev cluster
Single pod, single service, staging
Multiple pods, single service, staging
Single pod in production (with abort)
Whole service in production (with abort)
Multiple services / cross-AZ in production

Move up the ladder only after the previous level is boring — no surprises three runs in a row.

Abort Conditions

Every production experiment defines automatic abort. Examples:

Error rate > 1% for 30 seconds
P99 latency > 500 ms for 1 minute
Any pages fired
Manual button pressed by on-call

In Chaos Mesh, use a StatusCheck for this; in Litmus, a probe; in Gremlin, halt conditions. The principle is the same: a human or automated guard with a hand on the kill switch, always.

Steady State

Define the system's normal in measurable terms before the experiment:

Requests per second
Error rate
Latency percentiles (P50, P99)
Queue depth, consumer lag
CPU/memory utilization
Business metrics (checkouts/min, signups/min)

Capture these from your real observability stack — Prometheus, Datadog, CloudWatch. If steady state isn't measurable, don't run the experiment; you have no way to detect impact.

Game Days

A game day is a scheduled, multi-team chaos exercise. Format:

Before (1 week):

Pick a scenario ("region failover," "DB primary loss," "Redis cluster split-brain")
Write hypotheses
Define abort conditions
Invite affected teams and incident responders

During (2-4 hours):

Brief everyone on scope and abort
Inject the fault
Observe and respond as if it's a real incident
Don't fix proactively — let the system fail naturally so you see real behavior
Abort if real impact occurs

After:

Blameless retro: what worked, what didn't, what surprised us
File tickets for every gap (alerting, runbook, code, capacity)
Schedule a re-run in 3-6 months

Game days are the highest-bandwidth way to find resilience gaps that span people, processes, and code — the gaps that pure automation can't surface.

Continuous Chaos

After ad-hoc experiments mature, move to continuous: faults run on a schedule, the system stays resilient by default.

Netflix's Chaos Monkey kills random instances during business hours, every weekday. The discipline is enforced by chaos: if your service can't survive a random pod restart at 2 PM, you'll find out at 2 PM today instead of 2 AM next month.

# Continuous PodChaos - runs every hour for 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: hourly-pod-kill
spec:
  schedule: '@hourly'
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces: [production]
      labelSelectors: { tier: stateless }
    duration: '5m'

Move to continuous chaos only when:

Steady-state alerting works reliably
Auto-rollback is in place
Affected teams know and have signed off
It runs during business hours (don't page on-call at 3 AM with self-inflicted noise)

Hypothesis Catalog: Common Faults

Reusable hypotheses to inspire your own:

Compute faults

Fault	Hypothesis
Kill pod	Service routes around dead pod in < 5s
Kill node	Pods reschedule; service unaffected
CPU spike (95%)	Service degrades but doesn't crash; HPA scales out
Memory pressure	OOMKill triggers restart; no data loss
Disk full	Writes fail with clear errors; reads continue

Network faults

Fault	Hypothesis
200ms added latency	Within SLO; no cascading timeouts
50% packet loss	Retries succeed; user-facing error rate < 1%
Partition from cache	Falls back to source of truth; latency rises but no errors
Partition from DB	Reads from replica; writes fail with retry-after
DNS resolution fails	Cached entries serve traffic for 5+ min

Dependency faults

Fault	Hypothesis
3rd-party API returns 500	Circuit breaker trips; fallback served
3rd-party API slow (5s)	Timeout fires at 2s; retry; eventual success
DB primary fails	Failover < 30s; some writes fail; reads continue
Message queue down	Producer queues locally up to N msgs; alerts

Cloud-level faults

Fault	Hypothesis
Single AZ outage	Traffic shifts to other AZs; no user impact
Region failover	RTO < 15 min; data loss < 1 min (RPO)
Spot instance reclaim	Workload migrates; batch job restarts cleanly

Anti-Patterns

Random destruction without hypotheses. "We killed stuff and nothing broke" tells you nothing about specific properties. Define hypotheses or you're just enacting performance.

Skipping steady state. Without baseline metrics, you can't detect impact. You'll either think it's fine when it isn't, or panic at noise.

Starting in production. Yes, prod is the only environment that matters. But you don't get to do it day one. Earn it: staging → prod with tiny blast radius → prod with bigger blast radius. Months, not days.

No abort. Experiments without abort conditions become outages.

Running it once. A green chaos experiment a year ago says nothing about today's system. Continuous or scheduled re-runs catch regressions.

Blaming teams when experiments find gaps. The experiment didn't cause the gap; it exposed it. The gap would've fired at 3 AM otherwise. Reward the find.

What's Next

Best Practices — running chaos in production safely; org adoption; security and compliance considerations

Patterns

On this page