Best Practices
Production cache - HA topology, sizing, eviction, persistence, observability, and the common pitfalls
Best Practices
The cache is in the critical path. A cache outage often looks like a database outage (when traffic falls through to the DB it can't handle). Treat the cache like one of your databases — redundant, observed, capacity-planned.
HA Topology
For Redis, three production modes:
| Mode | Notes |
|---|---|
| Single instance + replica | Simple; manual failover; minimum production setup |
| Sentinel | Automatic failover for primary-replica setups; routes clients to current primary |
| Cluster | Sharded with auto-failover; required for very large datasets or write-throughput beyond one node |
| Managed (ElastiCache / MemoryStore) | Cloud provider runs all of the above for you |
Sizing:
- 3 nodes minimum for cluster mode (odd quorum).
- One replica per primary at minimum; two for critical workloads.
- Spread across availability zones.
- Sentinel quorum of 3 or 5 if not running Cluster.
Sizing and Eviction
Redis memory is finite. Decide what happens when it fills up:
# redis.conf
maxmemory 8gb
maxmemory-policy allkeys-lru| Policy | Behavior |
|---|---|
noeviction | Writes fail when full. Choose this for "Redis is my data store, not cache." |
allkeys-lru | Evict least-recently-used across all keys. Default for cache use. |
allkeys-lfu | Evict least-frequently-used. Better when access patterns are stable. |
volatile-lru | LRU only among keys with TTL — protects keys you don't want evicted |
volatile-ttl | Evict the keys closest to expiring |
allkeys-random | Random; cheap; sometimes good for very flat access patterns |
Wrong choice can be catastrophic. noeviction on a cache fills up and starts rejecting writes — your application sees errors instead of cache misses. Pick allkeys-lru or allkeys-lfu for typical cache use.
Memory Headroom
Keep used memory under ~75% of maxmemory. Above that, performance degrades sharply as the LRU sampling has to work harder. Capacity plan with growth in mind.
Persistence (Or Not)
Two options, both can be on simultaneously:
| Persistence | What it does | Pros | Cons |
|---|---|---|---|
RDB (save N M) | Periodic snapshots to disk | Compact; fast restart | Loses data since last snapshot on crash |
AOF (appendonly yes) | Append every write to a log | Better durability | Larger file; rewrite needed periodically |
| Both | RDB + AOF | Belt and braces | Disk IO cost |
For a pure cache, persistence is optional — losing the cache means a slow warm-up but no data loss. For Redis as a primary store, use AOF with appendfsync everysec.
Naming Conventions
A namespace scheme saves you when debugging:
<domain>:<entity>:<id>[:<aspect>]
user:profile:42
user:session:abc-123
order:cart:42
auth:token:xyz
metrics:hourly:2026-05-21-14- Colon-separated, by convention.
- Predictable prefixes make
SCAN-based inspection useful. - Versions in keys avoid invalidation pain (
user:profile:42:v3). - Don't put TTL or expiry in the key name — that's metadata for Redis, not you.
Observability
A cache without visibility is a hidden bottleneck. Watch:
| Metric | Threshold |
|---|---|
Hit ratio (keyspace_hits / (keyspace_hits + keyspace_misses)) | < 80% = investigate; < 50% = the cache isn't pulling its weight |
| Evictions/sec | > 0 sustained = memory pressure; bump maxmemory or shorten TTLs |
| Used memory | < 75% of maxmemory |
| Connected clients | Spikes correlate with client-pool issues |
Slowlog (SLOWLOG GET 100) | KEYS *, big HGETALL, anything > 1ms |
| CMD/sec | Sudden drops = client connectivity; spikes = traffic |
| Replication lag | Replicas should be < 1s behind |
| Latency p99 | Should be sub-millisecond on healthy single-node |
The Redis exporter for Prometheus ships all of these. Hook into your existing dashboards.
Pitfalls and How to Avoid Them
| Pitfall | Symptom | Fix |
|---|---|---|
KEYS * in production | Server pause (single-threaded) | Use SCAN |
Big GET / HGETALL on huge values | Latency spike | Paginate via HSCAN, LRANGE; consider splitting the key |
| Hot key | One key saturates a single shard | Shard the key (hash slot), local caching layer in front |
| Cache stampede | DB collapses when popular key expires | Lock the refresh; jittered TTLs; soft expiry |
| Dual-write race | Cache shows old data after DB write | Invalidate AFTER DB commit; consider DEL before AND after |
| TLS termination on the proxy, not Redis | Plain-text traffic in the cluster | Enable Redis TLS, especially across AZ |
| Long-lived connections from short-lived runners | Connection storm at scale | Use a connection pool; close on shutdown |
No maxmemory | OOM kills the process | Always set maxmemory and a policy |
| Caching huge unbounded objects | Memory explosion | Cap value sizes; reject above N KB |
| Treating cache as source of truth | Data loss on restart | If you need it, use a database |
Security
- TLS between clients and Redis (
tls-port,tls-cert-file,tls-key-file). requirepass(or, better, ACL users in Redis 6+).- Bind to internal interfaces only (
bind 10.0.0.5), never public IPs. - Disable dangerous commands in production:
rename-command FLUSHALL "", same forKEYS,DEBUG,CONFIG. - Per-environment instances — don't share Redis between staging and production "for cost."
Connection Management
Redis connections are cheap to keep, expensive to churn:
| Practice | Why |
|---|---|
| Use a client connection pool | Avoid handshake overhead per request |
SO_KEEPALIVE | Detect dead connections faster |
Reasonable timeouts on BLPOP/BRPOP | Don't pin a connection forever |
| One connection per blocking operation | Don't block your shared pool |
| Pipelining for bulk operations | Drastically reduce RTT cost |
Multi-Region / Geo
Redis is not geo-distributed by default. Strategies:
- Active/passive with replication across regions; failover on outage.
- Per-region independent caches (recommended) — accept the "warm-up" cost over the complexity of cross-region writes.
- Active/active via Redis Enterprise CRDTs; complex; usually not worth it for a cache.
If you need active/active state, that's a job for a real database, not Redis.
When to Reach for Something Else
- Local memoization (in-process LRU like
lru-cache) is faster than Redis for hot data shared only within one process. - Two-tier cache: in-process LRU in front of Redis covers the very hottest keys.
- Search workloads belong in Search systems, not Redis.
- Pub/sub at scale — Redis Pub/Sub doesn't persist; use Message Queues.
- Vector / ANN — Redis has a vector module, but dedicated vector DBs (pgvector, Qdrant, Weaviate) are usually better.
Checklist
Production cache checklist
- HA topology (replicas, Sentinel, or Cluster) with automated failover
-
maxmemoryandmaxmemory-policyconfigured explicitly - Memory usage monitored; alert on > 75% of
maxmemory - Hit ratio, evictions, latency, slowlog scraped to Prometheus
- Persistence strategy chosen (RDB / AOF / both / off) for the use case
- TLS in transit; ACL users for clients; bind to internal interfaces
- Dangerous commands renamed or disabled
- Naming convention documented; namespaces per service
- Client connection pool tuned; pipelining for bulk reads
- Cache stampede protection on the hottest endpoints
- Negative cache for missing keys
- No
KEYS *calls in production code - Disaster recovery: tested restore from snapshot; runbook for cache-cold scenarios
- Per-environment instances; no shared cache across envs