Best Practices

The cache is in the critical path. A cache outage often looks like a database outage (when traffic falls through to the DB it can't handle). Treat the cache like one of your databases — redundant, observed, capacity-planned.

HA Topology

For Redis, three production modes:

Mode	Notes
Single instance + replica	Simple; manual failover; minimum production setup
Sentinel	Automatic failover for primary-replica setups; routes clients to current primary
Cluster	Sharded with auto-failover; required for very large datasets or write-throughput beyond one node
Managed (ElastiCache / MemoryStore)	Cloud provider runs all of the above for you

Sizing:

3 nodes minimum for cluster mode (odd quorum).
One replica per primary at minimum; two for critical workloads.
Spread across availability zones.
Sentinel quorum of 3 or 5 if not running Cluster.

Sizing and Eviction

Redis memory is finite. Decide what happens when it fills up:

# redis.conf
maxmemory 8gb
maxmemory-policy allkeys-lru

Policy	Behavior
`noeviction`	Writes fail when full. Choose this for "Redis is my data store, not cache."
`allkeys-lru`	Evict least-recently-used across all keys. Default for cache use.
`allkeys-lfu`	Evict least-frequently-used. Better when access patterns are stable.
`volatile-lru`	LRU only among keys with TTL — protects keys you don't want evicted
`volatile-ttl`	Evict the keys closest to expiring
`allkeys-random`	Random; cheap; sometimes good for very flat access patterns

Wrong choice can be catastrophic. noeviction on a cache fills up and starts rejecting writes — your application sees errors instead of cache misses. Pick allkeys-lru or allkeys-lfu for typical cache use.

Memory Headroom

Keep used memory under ~75% of maxmemory. Above that, performance degrades sharply as the LRU sampling has to work harder. Capacity plan with growth in mind.

Persistence (Or Not)

Two options, both can be on simultaneously:

Persistence	What it does	Pros	Cons
RDB (`save N M`)	Periodic snapshots to disk	Compact; fast restart	Loses data since last snapshot on crash
AOF (`appendonly yes`)	Append every write to a log	Better durability	Larger file; rewrite needed periodically
Both	RDB + AOF	Belt and braces	Disk IO cost

For a pure cache, persistence is optional — losing the cache means a slow warm-up but no data loss. For Redis as a primary store, use AOF with appendfsync everysec.

Naming Conventions

A namespace scheme saves you when debugging:

<domain>:<entity>:<id>[:<aspect>]

user:profile:42
user:session:abc-123
order:cart:42
auth:token:xyz
metrics:hourly:2026-05-21-14

Colon-separated, by convention.
Predictable prefixes make SCAN-based inspection useful.
Versions in keys avoid invalidation pain (user:profile:42:v3).
Don't put TTL or expiry in the key name — that's metadata for Redis, not you.

Observability

A cache without visibility is a hidden bottleneck. Watch:

Metric	Threshold
Hit ratio (`keyspace_hits / (keyspace_hits + keyspace_misses)`)	< 80% = investigate; < 50% = the cache isn't pulling its weight
Evictions/sec	> 0 sustained = memory pressure; bump `maxmemory` or shorten TTLs
Used memory	< 75% of `maxmemory`
Connected clients	Spikes correlate with client-pool issues
Slowlog (`SLOWLOG GET 100`)	`KEYS *`, big `HGETALL`, anything > 1ms
CMD/sec	Sudden drops = client connectivity; spikes = traffic
Replication lag	Replicas should be < 1s behind
Latency p99	Should be sub-millisecond on healthy single-node

The Redis exporter for Prometheus ships all of these. Hook into your existing dashboards.

Pitfalls and How to Avoid Them

Pitfall	Symptom	Fix
*`KEYS ` in production**	Server pause (single-threaded)	Use `SCAN`
Big `GET` / `HGETALL` on huge values	Latency spike	Paginate via `HSCAN`, `LRANGE`; consider splitting the key
Hot key	One key saturates a single shard	Shard the key (`hash slot`), local caching layer in front
Cache stampede	DB collapses when popular key expires	Lock the refresh; jittered TTLs; soft expiry
Dual-write race	Cache shows old data after DB write	Invalidate AFTER DB commit; consider `DEL` before AND after
TLS termination on the proxy, not Redis	Plain-text traffic in the cluster	Enable Redis TLS, especially across AZ
Long-lived connections from short-lived runners	Connection storm at scale	Use a connection pool; close on shutdown
No `maxmemory`	OOM kills the process	Always set `maxmemory` and a policy
Caching huge unbounded objects	Memory explosion	Cap value sizes; reject above N KB
Treating cache as source of truth	Data loss on restart	If you need it, use a database

Security

TLS between clients and Redis (tls-port, tls-cert-file, tls-key-file).
requirepass (or, better, ACL users in Redis 6+).
Bind to internal interfaces only (bind 10.0.0.5), never public IPs.
Disable dangerous commands in production: rename-command FLUSHALL "", same for KEYS, DEBUG, CONFIG.
Per-environment instances — don't share Redis between staging and production "for cost."

Connection Management

Redis connections are cheap to keep, expensive to churn:

Practice	Why
Use a client connection pool	Avoid handshake overhead per request
`SO_KEEPALIVE`	Detect dead connections faster
Reasonable timeouts on `BLPOP`/`BRPOP`	Don't pin a connection forever
One connection per blocking operation	Don't block your shared pool
Pipelining for bulk operations	Drastically reduce RTT cost

Multi-Region / Geo

Redis is not geo-distributed by default. Strategies:

Active/passive with replication across regions; failover on outage.
Per-region independent caches (recommended) — accept the "warm-up" cost over the complexity of cross-region writes.
Active/active via Redis Enterprise CRDTs; complex; usually not worth it for a cache.

If you need active/active state, that's a job for a real database, not Redis.

When to Reach for Something Else

Local memoization (in-process LRU like lru-cache) is faster than Redis for hot data shared only within one process.
Two-tier cache: in-process LRU in front of Redis covers the very hottest keys.
Search workloads belong in Search systems, not Redis.
Pub/sub at scale — Redis Pub/Sub doesn't persist; use Message Queues.
Vector / ANN — Redis has a vector module, but dedicated vector DBs (pgvector, Qdrant, Weaviate) are usually better.

Checklist

Best Practices

HA Topology

Sizing and Eviction

Memory Headroom

Persistence (Or Not)

Naming Conventions

Observability

Pitfalls and How to Avoid Them

Security

Connection Management

Multi-Region / Geo

When to Reach for Something Else

Checklist

Best Practices

On this page