Getting Started
Install Chaos Mesh on a kind cluster, kill a pod, inject network latency, observe recovery
Getting Started
This page installs Chaos Mesh on a local Kubernetes cluster and runs three small experiments: killing a pod, slowing the network, and partitioning two services. By the end you'll have the tools to start asking "what if this fails?" of your own systems.
Prerequisites
A Kubernetes cluster (Getting Started with K8s). For learning, kind:
kind create cluster --name chaos
kubectl get nodesInstall Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/v2.7.0/install.sh | bash
# Or via Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
-n chaos-mesh --create-namespace \
--version 2.7.0
# Verify
kubectl get pods -n chaos-meshOpen the dashboard:
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333
open http://localhost:2333(Token-based auth is configurable; for local dev it's permissive.)
Deploy a Target App
# demo-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector: { matchLabels: { app: web } }
template:
metadata: { labels: { app: web } }
spec:
containers:
- name: web
image: nginxdemos/hello:plain-text
ports: [{ containerPort: 80 }]
readinessProbe:
httpGet: { path: /, port: 80 }
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata: { name: web }
spec:
selector: { app: web }
ports: [{ port: 80 }]kubectl apply -f demo-app.yaml
kubectl port-forward svc/web 8080:80 &
# Hit it; observe normal operation
while true; do curl -s http://localhost:8080 | head -1; sleep 1; doneExperiment 1: Kill a Pod
A core test: does the service survive a random pod restart?
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-one-web-pod
spec:
action: pod-kill
mode: one # affect 1 random pod
selector:
namespaces: [default]
labelSelectors:
app: web
duration: '30s' # optional; auto-cleanupkubectl apply -f pod-kill.yaml
# Watch
kubectl get pods -l app=web -wYou'll see one pod killed and immediately recreated by the Deployment. Meanwhile, your curl loop should keep returning successfully — the Service routes around the dead pod.
If your curl loop sees errors, your service has a resilience gap — maybe sticky sessions, single replica behind an ingress, missing readiness probe. That's the value: you found a problem that would otherwise hit on Tuesday.
Experiment 2: Network Latency
How does your service handle 200 ms added latency between pods?
# latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: add-latency
spec:
action: delay
mode: all
selector:
namespaces: [default]
labelSelectors: { app: web }
delay:
latency: '200ms'
correlation: '50'
jitter: '50ms'
duration: '60s'kubectl apply -f latency.yamlFor 60 seconds, all pods labeled app=web see 200 ms ± 50 ms of added network latency. Observe your latency metrics — does P99 spike to the levels your SLO assumes? Does an upstream service start timing out? Are timeouts configured at all?
Experiment 3: Network Partition
Simulate a split-brain — two services that can normally talk are isolated:
# partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: partition-from-db
spec:
action: partition
mode: all
selector:
namespaces: [default]
labelSelectors: { app: web }
direction: both
target:
mode: all
selector:
namespaces: [default]
labelSelectors: { app: database }
duration: '120s'For 2 minutes, the web pods can't reach database pods. What happens?
- Best case: web pods serve cached / degraded responses; recover when partition heals.
- Bad case: web pods crash; user-facing 5xx; data inconsistencies after recovery.
The point of running this in staging is finding out which one you have, without users noticing.
Cleanup
Chaos resources auto-cleanup at duration. To manually stop:
kubectl delete podchaos kill-one-web-pod
kubectl delete networkchaos add-latency partition-from-dbEquivalent: Litmus
# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/2.16.0/litmus-2.16.0.yaml
# Run a pod-kill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: web-chaos
spec:
appinfo:
appns: default
applabel: app=web
appkind: deployment
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'Litmus has a larger experiment library and a more workflow-oriented model. Chaos Mesh has more granular fault injection.
Equivalent: AWS Fault Injection Simulator
# Define an experiment template via AWS console or CLI
# Then start it:
aws fis start-experiment --experiment-template-id EXTabc123AWS FIS handles AWS-level chaos (stop instances, throttle APIs, fail an AZ for an RDS instance). Use it for "what if AWS itself misbehaves" experiments.
A Real First Experiment for Your System
Don't start with the most exotic experiment. Start with:
- Kill a random pod in a stateless service. Should be a no-op.
- Add 100ms latency to one inter-service call. Verify your timeouts.
- Partition the cache from one of your services. Does the service degrade gracefully?
If all three pass, you have a more resilient system than you knew. If any fail, you've found weaknesses to fix.
Tear Down
kubectl delete -f demo-app.yaml
helm uninstall chaos-mesh -n chaos-mesh
kind delete cluster --name chaosWhat's Next
You can inject controlled failure. Next:
- Patterns — game days, blast radius, continuous experiments, hypothesis-driven design
- Best Practices — safety, production experiments, organizational adoption