Best Practices
Production Vault - HA topology, unsealing, auth methods, K8s integration, audit, and disaster recovery
Best Practices
Dev mode taught you the API. Production Vault is a different beast — it's stateful, it can be sealed (locked), and it sits in the critical path of every workload. A Vault outage is a deploy outage. Treat it accordingly.
Storage Backend
Vault stores its data in a configurable backend. The right choice defines your operational story:
| Backend | Notes |
|---|---|
| Integrated Storage (Raft) | The default; built-in clustering, no external dep. Use this for new deployments. |
| Consul | Older recommendation; still solid but adds an operational system to run |
| File / In-memory | Single-node only; testing |
| Cloud (S3, GCS, ...) | Single-node only (no locking); fine for backup, not HA |
Integrated Raft means 3 or 5 Vault nodes, no external Consul cluster, simpler ops. It's what you want.
High Availability Topology
┌──────────────────┐
┌────►│ Vault Leader │ ◄─── writes
│ └──────────────────┘
│ │ Raft replication
┌─────────┴────┐ ┌─────▼─────────┐ ┌──────────────┐
│ Client │ │ Vault Standby │ │ Vault Standby│
│ │ └───────────────┘ └──────────────┘
└──────────────┘ ▲ ▲
└────── reads ──────┘ (with performance standbys)- 3 or 5 nodes in the cluster (always odd, so quorum is well-defined).
- One leader, others standby. Reads can be served from performance standbys (Enterprise).
- A load balancer in front, health-checking
sys/health— it returns 200 for active, 429 for standby. - Spread across availability zones / racks — one zone failure shouldn't kill quorum.
Sealing and Unsealing
Vault encrypts everything with a master key that lives in memory. On startup or restart Vault is sealed — it can't decrypt anything until you give it the unseal keys. This is good (no plaintext on disk) and operationally awkward (the cluster won't come back without intervention).
Three ways to handle it:
| Method | How |
|---|---|
| Shamir's Secret Sharing | Split unseal key into N shares; need M to unseal. Human ops. |
| Auto-unseal with cloud KMS | AWS KMS / GCP KMS / Azure Key Vault / OCI does the unseal. What you want in production. |
| HSM | Hardware security module holds the key. Enterprise/regulated environments. |
Auto-unseal example (Raft + AWS KMS):
# vault.hcl
storage "raft" {
path = "/vault/data"
node_id = "vault-1"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/etc/vault/tls/cert.pem"
tls_key_file = "/etc/vault/tls/key.pem"
}
seal "awskms" {
region = "us-east-1"
kms_key_id = "alias/vault-unseal"
}
cluster_addr = "https://vault-1.example.com:8201"
api_addr = "https://vault.example.com:8200"Now restarts unseal automatically. Recovery keys (M-of-N) are still issued for break-glass.
Auto-unseal trades one risk for another. If your KMS goes down or you lose access to it, Vault is bricked. Print the recovery keys, distribute them to humans, store them out-of-band. Test the recovery procedure once a year.
Auth Methods
token is for humans and bootstrapping. Production workloads use an identity-aware auth method:
| Method | Best for |
|---|---|
| Kubernetes | Pods authenticate with their ServiceAccount JWT |
| AWS IAM | EC2 / ECS / Lambda / EKS workloads with an IAM role |
| GCP / Azure | Same idea on those clouds |
| OIDC / JWT | GitHub Actions, GitLab CI, generic CI runners |
| AppRole | Anything else that can hold a secret_id (legacy fallback) |
| userpass / LDAP / SSO | Humans |
Kubernetes Auth (the most common)
vault auth enable kubernetes
vault write auth/kubernetes/config \
kubernetes_host="https://kubernetes.default.svc" \
token_reviewer_jwt="@/var/run/secrets/kubernetes.io/serviceaccount/token" \
kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
vault write auth/kubernetes/role/myapp \
bound_service_account_names=myapp \
bound_service_account_namespaces=production \
policies=app \
ttl=1hA pod with that ServiceAccount can now exchange its projected JWT for a Vault token:
# Inside the pod
JWT=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_TOKEN=$(curl -s --request POST \
--data "{\"jwt\":\"$JWT\",\"role\":\"myapp\"}" \
$VAULT_ADDR/v1/auth/kubernetes/login | jq -r .auth.client_token)Or use the Vault Agent Injector — a mutating webhook that adds an init container to your pods which does the auth and writes secrets to a shared volume. No code changes in your app.
CI Auth with OIDC (GitHub Actions example)
- name: Auth to Vault
uses: hashicorp/vault-action@v3
with:
url: https://vault.example.com
method: jwt
role: gha-deploy
secrets: |
secret/data/deploy aws_access_key | AWS_ACCESS_KEY_ID ;
secret/data/deploy aws_secret_key | AWS_SECRET_ACCESS_KEY ;GitHub's OIDC token is verified by Vault; no long-lived secret in GitHub. This is the modern way to wire CI to a secret store.
Audit Devices
Turn on audit logging on day one — there's no useful forensics without it.
vault audit enable file file_path=/vault/logs/audit.log
# Or stream to syslog, or to a socket → your log pipelineEvery request and response is logged (with HMAC'd sensitive fields). Ship those logs to your SIEM / ELK.
Kubernetes Integration Patterns
Three ways to get secrets to pods, ranked by sophistication:
1. Vault Agent Injector (simple, popular)
Annotations on a Deployment tell the injector to add an init/sidecar container that auths to Vault and writes secrets to a shared volume:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "myapp"
vault.hashicorp.com/agent-inject-secret-db: "database/creds/myapp-read"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "database/creds/myapp-read" -}}
DB_USER={{ .Data.username }}
DB_PASS={{ .Data.password }}
{{- end -}}The app reads /vault/secrets/db — never knows Vault exists. The agent renews leases on its own.
2. External Secrets Operator + Vault
The External Secrets Operator pulls from Vault and creates native Kubernetes Secrets. Apps consume them via standard envFrom / volume mounts.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-creds
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-creds-secret
data:
- secretKey: password
remoteRef:
key: secret/app/db
property: passwordTrade-off: secrets land as Kubernetes Secrets (base64) — less ideal for true zero-trust, but a smooth integration path.
3. CSI Secrets Store + Vault provider
Mount secrets as files via the Kubernetes CSI Secret Store driver. Similar to the agent injector but uses CSI semantics.
For most teams, Vault Agent Injector is the right starting point; graduate to ESO if you need broader K8s ecosystem integration.
Backup and Disaster Recovery
Vault's data is the recipe for accessing everything else — losing it is catastrophic.
# Take a Raft snapshot
vault operator raft snapshot save backup-$(date +%F).snap
# Restore (replaces all data!)
vault operator raft snapshot restore backup-2026-05-21.snapAutomation:
- Schedule snapshots every 15-60 minutes via a CronJob.
- Ship snapshots off-cluster to S3 / GCS with encryption (separate KMS key).
- Test restores quarterly. An untested backup is not a backup.
- Cross-region replica for DR if you have Enterprise (Performance / DR replication).
Operational Habits
A handful that pay off:
- TLS for everything. Self-signed for dev; ACM / Let's Encrypt / your CA in production. Vault should never speak plain HTTP.
- Never use the root token from a workflow. Generate a token from the root, use it once for setup, revoke it. Day-to-day work uses identity-bound tokens.
- Policy as code. Policies in git, deployed by CI.
vault policy writefrom terminals is a smell. - One mount per logical concern.
secret/,database/,pki/internal/,transit/customer-data/— paths express intent. - Lease TTLs short by default. 1h, not 30 days. If something can't handle 1h renewal, instrument it.
- Don't enable everything. Each secrets engine and auth method is an attack surface. Enable what you actively use.
- Monitor the unseal status. A standby that won't unseal during a leader election ruins your HA.
- Watch
Vault token usageandfailed loginmetrics. Anomalies = early signal of misuse.
Checklist
Production Vault checklist
- 3 or 5 nodes with Integrated Storage (Raft), spread across AZs
- Auto-unseal via cloud KMS or HSM
- TLS on all listeners; cert rotation automated
- Load balancer health-checks
sys/health - Audit device enabled, logs shipped to a SIEM
- No long-lived tokens in workloads; identity-based auth (K8s/AWS/OIDC) everywhere
- Policies version-controlled and deployed by CI
- Default lease TTLs short (≤ 1h) with renewal
- Raft snapshots on a schedule, shipped off-cluster, encrypted
- DR procedure tested at least annually
- Root token sealed away after initial setup; rotate or revoke periodically
- Monitoring: leader stability, unseal status, audit log volume, token issuance rate
- Documented break-glass procedure (who unseals, where the keys are)