Steven's Knowledge

Patterns

Game days, blast radius, hypothesis-driven experiments, continuous chaos, fault scenario catalog

Patterns

The tool injects faults. The discipline turns it into learning. These patterns are how mature teams practice chaos engineering.

Hypothesis-Driven Experiments

Every experiment is a prediction that gets validated or falsified. Don't just "break things and see."

Template:

Hypothesis: If [fault] happens, then [system behavior] will occur because [reasoning]. Steady state: [Metric we measure before, during, after] Blast radius: [What's affected; how to abort] Success criteria: [What "passed" means]

Example:

Hypothesis: If we kill 1 of 3 web pods, P99 latency stays under 200ms because the load balancer drains traffic from the dead pod within 5s. Steady state: Request rate, error rate, P99 latency (all from Prometheus) Blast radius: 1 pod, 1 service, 5 minutes max. Auto-rollback if error rate > 1%. Success criteria: Error rate stays below 0.1%; latency P99 < 250ms.

If the experiment confirms the hypothesis: you've earned confidence in that behavior. If it falsifies it: you've found a bug or a wrong mental model. Both are wins.

Blast Radius

The first principle of safe chaos: start tiny and grow.

A blast radius diagram:

┌───────────────────────────────┐
│ Production                    │
│  ┌──────────────────────┐     │
│  │ Region us-east-1     │     │
│  │  ┌────────────────┐  │     │
│  │  │ Cluster A      │  │     │
│  │  │  ┌──────────┐  │  │     │
│  │  │  │ Service X│  │  │     │
│  │  │  │  ┌─────┐ │  │  │     │
│  │  │  │  │1 pod│ │  │  │     │ ← start here
│  │  │  │  └─────┘ │  │  │     │
│  │  │  └──────────┘  │  │     │
│  │  └────────────────┘  │     │
│  └──────────────────────┘     │
└───────────────────────────────┘

Progression:

  1. Single pod, single service, dev cluster
  2. Single pod, single service, staging
  3. Multiple pods, single service, staging
  4. Single pod in production (with abort)
  5. Whole service in production (with abort)
  6. Multiple services / cross-AZ in production

Move up the ladder only after the previous level is boring — no surprises three runs in a row.

Abort Conditions

Every production experiment defines automatic abort. Examples:

  • Error rate > 1% for 30 seconds
  • P99 latency > 500 ms for 1 minute
  • Any pages fired
  • Manual button pressed by on-call

In Chaos Mesh, use a StatusCheck for this; in Litmus, a probe; in Gremlin, halt conditions. The principle is the same: a human or automated guard with a hand on the kill switch, always.

Steady State

Define the system's normal in measurable terms before the experiment:

  • Requests per second
  • Error rate
  • Latency percentiles (P50, P99)
  • Queue depth, consumer lag
  • CPU/memory utilization
  • Business metrics (checkouts/min, signups/min)

Capture these from your real observability stack — Prometheus, Datadog, CloudWatch. If steady state isn't measurable, don't run the experiment; you have no way to detect impact.

Game Days

A game day is a scheduled, multi-team chaos exercise. Format:

Before (1 week):

  • Pick a scenario ("region failover," "DB primary loss," "Redis cluster split-brain")
  • Write hypotheses
  • Define abort conditions
  • Invite affected teams and incident responders

During (2-4 hours):

  • Brief everyone on scope and abort
  • Inject the fault
  • Observe and respond as if it's a real incident
  • Don't fix proactively — let the system fail naturally so you see real behavior
  • Abort if real impact occurs

After:

  • Blameless retro: what worked, what didn't, what surprised us
  • File tickets for every gap (alerting, runbook, code, capacity)
  • Schedule a re-run in 3-6 months

Game days are the highest-bandwidth way to find resilience gaps that span people, processes, and code — the gaps that pure automation can't surface.

Continuous Chaos

After ad-hoc experiments mature, move to continuous: faults run on a schedule, the system stays resilient by default.

Netflix's Chaos Monkey kills random instances during business hours, every weekday. The discipline is enforced by chaos: if your service can't survive a random pod restart at 2 PM, you'll find out at 2 PM today instead of 2 AM next month.

# Continuous PodChaos - runs every hour for 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: hourly-pod-kill
spec:
  schedule: '@hourly'
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces: [production]
      labelSelectors: { tier: stateless }
    duration: '5m'

Move to continuous chaos only when:

  • Steady-state alerting works reliably
  • Auto-rollback is in place
  • Affected teams know and have signed off
  • It runs during business hours (don't page on-call at 3 AM with self-inflicted noise)

Hypothesis Catalog: Common Faults

Reusable hypotheses to inspire your own:

Compute faults

FaultHypothesis
Kill podService routes around dead pod in < 5s
Kill nodePods reschedule; service unaffected
CPU spike (95%)Service degrades but doesn't crash; HPA scales out
Memory pressureOOMKill triggers restart; no data loss
Disk fullWrites fail with clear errors; reads continue

Network faults

FaultHypothesis
200ms added latencyWithin SLO; no cascading timeouts
50% packet lossRetries succeed; user-facing error rate < 1%
Partition from cacheFalls back to source of truth; latency rises but no errors
Partition from DBReads from replica; writes fail with retry-after
DNS resolution failsCached entries serve traffic for 5+ min

Dependency faults

FaultHypothesis
3rd-party API returns 500Circuit breaker trips; fallback served
3rd-party API slow (5s)Timeout fires at 2s; retry; eventual success
DB primary failsFailover < 30s; some writes fail; reads continue
Message queue downProducer queues locally up to N msgs; alerts

Cloud-level faults

FaultHypothesis
Single AZ outageTraffic shifts to other AZs; no user impact
Region failoverRTO < 15 min; data loss < 1 min (RPO)
Spot instance reclaimWorkload migrates; batch job restarts cleanly

Anti-Patterns

Random destruction without hypotheses. "We killed stuff and nothing broke" tells you nothing about specific properties. Define hypotheses or you're just enacting performance.

Skipping steady state. Without baseline metrics, you can't detect impact. You'll either think it's fine when it isn't, or panic at noise.

Starting in production. Yes, prod is the only environment that matters. But you don't get to do it day one. Earn it: staging → prod with tiny blast radius → prod with bigger blast radius. Months, not days.

No abort. Experiments without abort conditions become outages.

Running it once. A green chaos experiment a year ago says nothing about today's system. Continuous or scheduled re-runs catch regressions.

Blaming teams when experiments find gaps. The experiment didn't cause the gap; it exposed it. The gap would've fired at 3 AM otherwise. Reward the find.

What's Next

  • Best Practices — running chaos in production safely; org adoption; security and compliance considerations

On this page