Steven's Knowledge

Best Practices

Production Vault - HA topology, unsealing, auth methods, K8s integration, audit, and disaster recovery

Best Practices

Dev mode taught you the API. Production Vault is a different beast — it's stateful, it can be sealed (locked), and it sits in the critical path of every workload. A Vault outage is a deploy outage. Treat it accordingly.

Storage Backend

Vault stores its data in a configurable backend. The right choice defines your operational story:

BackendNotes
Integrated Storage (Raft)The default; built-in clustering, no external dep. Use this for new deployments.
ConsulOlder recommendation; still solid but adds an operational system to run
File / In-memorySingle-node only; testing
Cloud (S3, GCS, ...)Single-node only (no locking); fine for backup, not HA

Integrated Raft means 3 or 5 Vault nodes, no external Consul cluster, simpler ops. It's what you want.

High Availability Topology

                    ┌──────────────────┐
              ┌────►│  Vault Leader    │ ◄─── writes
              │     └──────────────────┘
              │              │ Raft replication
   ┌─────────┴────┐    ┌─────▼─────────┐    ┌──────────────┐
   │  Client      │    │ Vault Standby │    │ Vault Standby│
   │              │    └───────────────┘    └──────────────┘
   └──────────────┘             ▲                    ▲
                                └────── reads ──────┘ (with performance standbys)
  • 3 or 5 nodes in the cluster (always odd, so quorum is well-defined).
  • One leader, others standby. Reads can be served from performance standbys (Enterprise).
  • A load balancer in front, health-checking sys/health — it returns 200 for active, 429 for standby.
  • Spread across availability zones / racks — one zone failure shouldn't kill quorum.

Sealing and Unsealing

Vault encrypts everything with a master key that lives in memory. On startup or restart Vault is sealed — it can't decrypt anything until you give it the unseal keys. This is good (no plaintext on disk) and operationally awkward (the cluster won't come back without intervention).

Three ways to handle it:

MethodHow
Shamir's Secret SharingSplit unseal key into N shares; need M to unseal. Human ops.
Auto-unseal with cloud KMSAWS KMS / GCP KMS / Azure Key Vault / OCI does the unseal. What you want in production.
HSMHardware security module holds the key. Enterprise/regulated environments.

Auto-unseal example (Raft + AWS KMS):

# vault.hcl
storage "raft" {
  path    = "/vault/data"
  node_id = "vault-1"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_cert_file = "/etc/vault/tls/cert.pem"
  tls_key_file  = "/etc/vault/tls/key.pem"
}

seal "awskms" {
  region     = "us-east-1"
  kms_key_id = "alias/vault-unseal"
}

cluster_addr = "https://vault-1.example.com:8201"
api_addr     = "https://vault.example.com:8200"

Now restarts unseal automatically. Recovery keys (M-of-N) are still issued for break-glass.

Auto-unseal trades one risk for another. If your KMS goes down or you lose access to it, Vault is bricked. Print the recovery keys, distribute them to humans, store them out-of-band. Test the recovery procedure once a year.

Auth Methods

token is for humans and bootstrapping. Production workloads use an identity-aware auth method:

MethodBest for
KubernetesPods authenticate with their ServiceAccount JWT
AWS IAMEC2 / ECS / Lambda / EKS workloads with an IAM role
GCP / AzureSame idea on those clouds
OIDC / JWTGitHub Actions, GitLab CI, generic CI runners
AppRoleAnything else that can hold a secret_id (legacy fallback)
userpass / LDAP / SSOHumans

Kubernetes Auth (the most common)

vault auth enable kubernetes

vault write auth/kubernetes/config \
  kubernetes_host="https://kubernetes.default.svc" \
  token_reviewer_jwt="@/var/run/secrets/kubernetes.io/serviceaccount/token" \
  kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

vault write auth/kubernetes/role/myapp \
  bound_service_account_names=myapp \
  bound_service_account_namespaces=production \
  policies=app \
  ttl=1h

A pod with that ServiceAccount can now exchange its projected JWT for a Vault token:

# Inside the pod
JWT=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_TOKEN=$(curl -s --request POST \
  --data "{\"jwt\":\"$JWT\",\"role\":\"myapp\"}" \
  $VAULT_ADDR/v1/auth/kubernetes/login | jq -r .auth.client_token)

Or use the Vault Agent Injector — a mutating webhook that adds an init container to your pods which does the auth and writes secrets to a shared volume. No code changes in your app.

CI Auth with OIDC (GitHub Actions example)

- name: Auth to Vault
  uses: hashicorp/vault-action@v3
  with:
    url: https://vault.example.com
    method: jwt
    role: gha-deploy
    secrets: |
      secret/data/deploy   aws_access_key | AWS_ACCESS_KEY_ID ;
      secret/data/deploy   aws_secret_key | AWS_SECRET_ACCESS_KEY ;

GitHub's OIDC token is verified by Vault; no long-lived secret in GitHub. This is the modern way to wire CI to a secret store.

Audit Devices

Turn on audit logging on day one — there's no useful forensics without it.

vault audit enable file file_path=/vault/logs/audit.log
# Or stream to syslog, or to a socket → your log pipeline

Every request and response is logged (with HMAC'd sensitive fields). Ship those logs to your SIEM / ELK.

Kubernetes Integration Patterns

Three ways to get secrets to pods, ranked by sophistication:

Annotations on a Deployment tell the injector to add an init/sidecar container that auths to Vault and writes secrets to a shared volume:

metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "myapp"
    vault.hashicorp.com/agent-inject-secret-db: "database/creds/myapp-read"
    vault.hashicorp.com/agent-inject-template-db: |
      {{- with secret "database/creds/myapp-read" -}}
      DB_USER={{ .Data.username }}
      DB_PASS={{ .Data.password }}
      {{- end -}}

The app reads /vault/secrets/db — never knows Vault exists. The agent renews leases on its own.

2. External Secrets Operator + Vault

The External Secrets Operator pulls from Vault and creates native Kubernetes Secrets. Apps consume them via standard envFrom / volume mounts.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-creds
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: db-creds-secret
  data:
    - secretKey: password
      remoteRef:
        key: secret/app/db
        property: password

Trade-off: secrets land as Kubernetes Secrets (base64) — less ideal for true zero-trust, but a smooth integration path.

3. CSI Secrets Store + Vault provider

Mount secrets as files via the Kubernetes CSI Secret Store driver. Similar to the agent injector but uses CSI semantics.

For most teams, Vault Agent Injector is the right starting point; graduate to ESO if you need broader K8s ecosystem integration.

Backup and Disaster Recovery

Vault's data is the recipe for accessing everything else — losing it is catastrophic.

# Take a Raft snapshot
vault operator raft snapshot save backup-$(date +%F).snap

# Restore (replaces all data!)
vault operator raft snapshot restore backup-2026-05-21.snap

Automation:

  • Schedule snapshots every 15-60 minutes via a CronJob.
  • Ship snapshots off-cluster to S3 / GCS with encryption (separate KMS key).
  • Test restores quarterly. An untested backup is not a backup.
  • Cross-region replica for DR if you have Enterprise (Performance / DR replication).

Operational Habits

A handful that pay off:

  1. TLS for everything. Self-signed for dev; ACM / Let's Encrypt / your CA in production. Vault should never speak plain HTTP.
  2. Never use the root token from a workflow. Generate a token from the root, use it once for setup, revoke it. Day-to-day work uses identity-bound tokens.
  3. Policy as code. Policies in git, deployed by CI. vault policy write from terminals is a smell.
  4. One mount per logical concern. secret/, database/, pki/internal/, transit/customer-data/ — paths express intent.
  5. Lease TTLs short by default. 1h, not 30 days. If something can't handle 1h renewal, instrument it.
  6. Don't enable everything. Each secrets engine and auth method is an attack surface. Enable what you actively use.
  7. Monitor the unseal status. A standby that won't unseal during a leader election ruins your HA.
  8. Watch Vault token usage and failed login metrics. Anomalies = early signal of misuse.

Checklist

Production Vault checklist

  • 3 or 5 nodes with Integrated Storage (Raft), spread across AZs
  • Auto-unseal via cloud KMS or HSM
  • TLS on all listeners; cert rotation automated
  • Load balancer health-checks sys/health
  • Audit device enabled, logs shipped to a SIEM
  • No long-lived tokens in workloads; identity-based auth (K8s/AWS/OIDC) everywhere
  • Policies version-controlled and deployed by CI
  • Default lease TTLs short (≤ 1h) with renewal
  • Raft snapshots on a schedule, shipped off-cluster, encrypted
  • DR procedure tested at least annually
  • Root token sealed away after initial setup; rotate or revoke periodically
  • Monitoring: leader stability, unseal status, audit log volume, token issuance rate
  • Documented break-glass procedure (who unseals, where the keys are)

On this page