Steven's Knowledge

Best Practices

Production cache - HA topology, sizing, eviction, persistence, observability, and the common pitfalls

Best Practices

The cache is in the critical path. A cache outage often looks like a database outage (when traffic falls through to the DB it can't handle). Treat the cache like one of your databases — redundant, observed, capacity-planned.

HA Topology

For Redis, three production modes:

ModeNotes
Single instance + replicaSimple; manual failover; minimum production setup
SentinelAutomatic failover for primary-replica setups; routes clients to current primary
ClusterSharded with auto-failover; required for very large datasets or write-throughput beyond one node
Managed (ElastiCache / MemoryStore)Cloud provider runs all of the above for you

Sizing:

  • 3 nodes minimum for cluster mode (odd quorum).
  • One replica per primary at minimum; two for critical workloads.
  • Spread across availability zones.
  • Sentinel quorum of 3 or 5 if not running Cluster.

Sizing and Eviction

Redis memory is finite. Decide what happens when it fills up:

# redis.conf
maxmemory 8gb
maxmemory-policy allkeys-lru
PolicyBehavior
noevictionWrites fail when full. Choose this for "Redis is my data store, not cache."
allkeys-lruEvict least-recently-used across all keys. Default for cache use.
allkeys-lfuEvict least-frequently-used. Better when access patterns are stable.
volatile-lruLRU only among keys with TTL — protects keys you don't want evicted
volatile-ttlEvict the keys closest to expiring
allkeys-randomRandom; cheap; sometimes good for very flat access patterns

Wrong choice can be catastrophic. noeviction on a cache fills up and starts rejecting writes — your application sees errors instead of cache misses. Pick allkeys-lru or allkeys-lfu for typical cache use.

Memory Headroom

Keep used memory under ~75% of maxmemory. Above that, performance degrades sharply as the LRU sampling has to work harder. Capacity plan with growth in mind.

Persistence (Or Not)

Two options, both can be on simultaneously:

PersistenceWhat it doesProsCons
RDB (save N M)Periodic snapshots to diskCompact; fast restartLoses data since last snapshot on crash
AOF (appendonly yes)Append every write to a logBetter durabilityLarger file; rewrite needed periodically
BothRDB + AOFBelt and bracesDisk IO cost

For a pure cache, persistence is optional — losing the cache means a slow warm-up but no data loss. For Redis as a primary store, use AOF with appendfsync everysec.

Naming Conventions

A namespace scheme saves you when debugging:

<domain>:<entity>:<id>[:<aspect>]

user:profile:42
user:session:abc-123
order:cart:42
auth:token:xyz
metrics:hourly:2026-05-21-14
  • Colon-separated, by convention.
  • Predictable prefixes make SCAN-based inspection useful.
  • Versions in keys avoid invalidation pain (user:profile:42:v3).
  • Don't put TTL or expiry in the key name — that's metadata for Redis, not you.

Observability

A cache without visibility is a hidden bottleneck. Watch:

MetricThreshold
Hit ratio (keyspace_hits / (keyspace_hits + keyspace_misses))< 80% = investigate; < 50% = the cache isn't pulling its weight
Evictions/sec> 0 sustained = memory pressure; bump maxmemory or shorten TTLs
Used memory< 75% of maxmemory
Connected clientsSpikes correlate with client-pool issues
Slowlog (SLOWLOG GET 100)KEYS *, big HGETALL, anything > 1ms
CMD/secSudden drops = client connectivity; spikes = traffic
Replication lagReplicas should be < 1s behind
Latency p99Should be sub-millisecond on healthy single-node

The Redis exporter for Prometheus ships all of these. Hook into your existing dashboards.

Pitfalls and How to Avoid Them

PitfallSymptomFix
KEYS * in productionServer pause (single-threaded)Use SCAN
Big GET / HGETALL on huge valuesLatency spikePaginate via HSCAN, LRANGE; consider splitting the key
Hot keyOne key saturates a single shardShard the key (hash slot), local caching layer in front
Cache stampedeDB collapses when popular key expiresLock the refresh; jittered TTLs; soft expiry
Dual-write raceCache shows old data after DB writeInvalidate AFTER DB commit; consider DEL before AND after
TLS termination on the proxy, not RedisPlain-text traffic in the clusterEnable Redis TLS, especially across AZ
Long-lived connections from short-lived runnersConnection storm at scaleUse a connection pool; close on shutdown
No maxmemoryOOM kills the processAlways set maxmemory and a policy
Caching huge unbounded objectsMemory explosionCap value sizes; reject above N KB
Treating cache as source of truthData loss on restartIf you need it, use a database

Security

  • TLS between clients and Redis (tls-port, tls-cert-file, tls-key-file).
  • requirepass (or, better, ACL users in Redis 6+).
  • Bind to internal interfaces only (bind 10.0.0.5), never public IPs.
  • Disable dangerous commands in production: rename-command FLUSHALL "", same for KEYS, DEBUG, CONFIG.
  • Per-environment instances — don't share Redis between staging and production "for cost."

Connection Management

Redis connections are cheap to keep, expensive to churn:

PracticeWhy
Use a client connection poolAvoid handshake overhead per request
SO_KEEPALIVEDetect dead connections faster
Reasonable timeouts on BLPOP/BRPOPDon't pin a connection forever
One connection per blocking operationDon't block your shared pool
Pipelining for bulk operationsDrastically reduce RTT cost

Multi-Region / Geo

Redis is not geo-distributed by default. Strategies:

  • Active/passive with replication across regions; failover on outage.
  • Per-region independent caches (recommended) — accept the "warm-up" cost over the complexity of cross-region writes.
  • Active/active via Redis Enterprise CRDTs; complex; usually not worth it for a cache.

If you need active/active state, that's a job for a real database, not Redis.

When to Reach for Something Else

  • Local memoization (in-process LRU like lru-cache) is faster than Redis for hot data shared only within one process.
  • Two-tier cache: in-process LRU in front of Redis covers the very hottest keys.
  • Search workloads belong in Search systems, not Redis.
  • Pub/sub at scale — Redis Pub/Sub doesn't persist; use Message Queues.
  • Vector / ANN — Redis has a vector module, but dedicated vector DBs (pgvector, Qdrant, Weaviate) are usually better.

Checklist

Production cache checklist

  • HA topology (replicas, Sentinel, or Cluster) with automated failover
  • maxmemory and maxmemory-policy configured explicitly
  • Memory usage monitored; alert on > 75% of maxmemory
  • Hit ratio, evictions, latency, slowlog scraped to Prometheus
  • Persistence strategy chosen (RDB / AOF / both / off) for the use case
  • TLS in transit; ACL users for clients; bind to internal interfaces
  • Dangerous commands renamed or disabled
  • Naming convention documented; namespaces per service
  • Client connection pool tuned; pipelining for bulk reads
  • Cache stampede protection on the hottest endpoints
  • Negative cache for missing keys
  • No KEYS * calls in production code
  • Disaster recovery: tested restore from snapshot; runbook for cache-cold scenarios
  • Per-environment instances; no shared cache across envs

On this page