Steven's Knowledge

Getting Started

Install Chaos Mesh on a kind cluster, kill a pod, inject network latency, observe recovery

Getting Started

This page installs Chaos Mesh on a local Kubernetes cluster and runs three small experiments: killing a pod, slowing the network, and partitioning two services. By the end you'll have the tools to start asking "what if this fails?" of your own systems.

Prerequisites

A Kubernetes cluster (Getting Started with K8s). For learning, kind:

kind create cluster --name chaos
kubectl get nodes

Install Chaos Mesh

curl -sSL https://mirrors.chaos-mesh.org/v2.7.0/install.sh | bash

# Or via Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
  -n chaos-mesh --create-namespace \
  --version 2.7.0

# Verify
kubectl get pods -n chaos-mesh

Open the dashboard:

kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333
open http://localhost:2333

(Token-based auth is configurable; for local dev it's permissive.)

Deploy a Target App

# demo-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector: { matchLabels: { app: web } }
  template:
    metadata: { labels: { app: web } }
    spec:
      containers:
        - name: web
          image: nginxdemos/hello:plain-text
          ports: [{ containerPort: 80 }]
          readinessProbe:
            httpGet: { path: /, port: 80 }
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata: { name: web }
spec:
  selector: { app: web }
  ports: [{ port: 80 }]
kubectl apply -f demo-app.yaml
kubectl port-forward svc/web 8080:80 &

# Hit it; observe normal operation
while true; do curl -s http://localhost:8080 | head -1; sleep 1; done

Experiment 1: Kill a Pod

A core test: does the service survive a random pod restart?

# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-one-web-pod
spec:
  action: pod-kill
  mode: one          # affect 1 random pod
  selector:
    namespaces: [default]
    labelSelectors:
      app: web
  duration: '30s'    # optional; auto-cleanup
kubectl apply -f pod-kill.yaml

# Watch
kubectl get pods -l app=web -w

You'll see one pod killed and immediately recreated by the Deployment. Meanwhile, your curl loop should keep returning successfully — the Service routes around the dead pod.

If your curl loop sees errors, your service has a resilience gap — maybe sticky sessions, single replica behind an ingress, missing readiness probe. That's the value: you found a problem that would otherwise hit on Tuesday.

Experiment 2: Network Latency

How does your service handle 200 ms added latency between pods?

# latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: add-latency
spec:
  action: delay
  mode: all
  selector:
    namespaces: [default]
    labelSelectors: { app: web }
  delay:
    latency: '200ms'
    correlation: '50'
    jitter: '50ms'
  duration: '60s'
kubectl apply -f latency.yaml

For 60 seconds, all pods labeled app=web see 200 ms ± 50 ms of added network latency. Observe your latency metrics — does P99 spike to the levels your SLO assumes? Does an upstream service start timing out? Are timeouts configured at all?

Experiment 3: Network Partition

Simulate a split-brain — two services that can normally talk are isolated:

# partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition-from-db
spec:
  action: partition
  mode: all
  selector:
    namespaces: [default]
    labelSelectors: { app: web }
  direction: both
  target:
    mode: all
    selector:
      namespaces: [default]
      labelSelectors: { app: database }
  duration: '120s'

For 2 minutes, the web pods can't reach database pods. What happens?

  • Best case: web pods serve cached / degraded responses; recover when partition heals.
  • Bad case: web pods crash; user-facing 5xx; data inconsistencies after recovery.

The point of running this in staging is finding out which one you have, without users noticing.

Cleanup

Chaos resources auto-cleanup at duration. To manually stop:

kubectl delete podchaos kill-one-web-pod
kubectl delete networkchaos add-latency partition-from-db

Equivalent: Litmus

# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/2.16.0/litmus-2.16.0.yaml

# Run a pod-kill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: web-chaos
spec:
  appinfo:
    appns: default
    applabel: app=web
    appkind: deployment
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'

Litmus has a larger experiment library and a more workflow-oriented model. Chaos Mesh has more granular fault injection.

Equivalent: AWS Fault Injection Simulator

# Define an experiment template via AWS console or CLI
# Then start it:
aws fis start-experiment --experiment-template-id EXTabc123

AWS FIS handles AWS-level chaos (stop instances, throttle APIs, fail an AZ for an RDS instance). Use it for "what if AWS itself misbehaves" experiments.

A Real First Experiment for Your System

Don't start with the most exotic experiment. Start with:

  1. Kill a random pod in a stateless service. Should be a no-op.
  2. Add 100ms latency to one inter-service call. Verify your timeouts.
  3. Partition the cache from one of your services. Does the service degrade gracefully?

If all three pass, you have a more resilient system than you knew. If any fail, you've found weaknesses to fix.

Tear Down

kubectl delete -f demo-app.yaml
helm uninstall chaos-mesh -n chaos-mesh
kind delete cluster --name chaos

What's Next

You can inject controlled failure. Next:

  • Patterns — game days, blast radius, continuous experiments, hypothesis-driven design
  • Best Practices — safety, production experiments, organizational adoption

On this page