Best Practices

Production DNS - TTL strategy, redundancy, monitoring, DDoS protection, anti-patterns

Best Practices

DNS is in the critical path of literally every request. A bad change to a DNS zone can take you offline for hours while caches expire. Treat it as a production system, not a control panel.

TTL Strategy

The single most-tuned knob. Some patterns that work:

Record	Suggested TTL	Why
Apex `A`/`AAAA` (pointing at CDN)	300s (5 min)	Possible to need fast failover
`CNAME` (pointing at a hostname)	300-3600s	Stable, but you'll change it eventually
`MX`	3600s (1 hour)	Rarely changes
`NS` at apex	86400s (1 day)	Almost never changes; long TTL is fine
`TXT` for SPF/DKIM/DMARC	3600s	Stable; minor changes
`TXT` for verification	300s	Set, verify, leave
`CAA`	86400s	Stable
Failover targets	60s	Tight failover loops; cost: query volume

The Pre-Change Drop

Always lower TTL before a planned record change:

T-72h:  current TTL is 3600, set new TTL to 300
T-24h:  most resolvers globally now honor 5-min TTL
T:      make the actual change
T+1h:   verify propagation
T+24h:  raise TTL back to 3600

Skip this and your "5-minute change" creates an hour of partial outage for anyone whose cached value still hits the old endpoint.

Redundancy

Single point of failure: your DNS provider. AWS Route 53 has had outages. Cloudflare has had outages. The mitigations:

Multiple Authoritative Nameservers in Diverse Networks

Cloudflare uses anycast + many POPs. That's good. But if their anycast announcement gets withdrawn (it has happened), they all become unreachable simultaneously.

The fix: secondary DNS at a different provider. Tools like OctoDNS publish to two providers; your registrar's NS records list both. A resolver tries one; if it doesn't respond, tries the other.

example.com   NS   ns1.cloudflare.com.
example.com   NS   ns2.cloudflare.com.
example.com   NS   ns1.route53.amazonaws.com.
example.com   NS   ns2.route53.amazonaws.com.

Cost: 2× DNS costs, ops effort to keep zones in sync. Reserved for "DNS outage costs me serious money."

Long TTLs for Critical Records

Counterintuitively, long TTLs help redundancy — if your DNS provider goes down, recursive resolvers globally still have cached answers. Trade against the cost: slow changes when you need them.

A common compromise: 86400s TTL on NS records, 3600s on app records, 300s on records that may need to change.

Monitoring

DNS-specific monitoring you probably don't have but should:

Signal	What to monitor
Authoritative server health	Query from outside; alert on NXDOMAIN, SERVFAIL, timeout
DNSSEC chain validity	Cron job that does `dig +sigchase`; alert before signature expiry
NS records at the registrar	Compare against expected list; alert on drift
TLD glue / DS records	Annual check that your TLD entry hasn't drifted
Resolution time	`dig` from multiple regions; alert on p99 spikes
Negative caching (NXDOMAIN)	Sudden NXDOMAIN spike for a known-good domain = misconfiguration

dnscheck.io, intodns.com are useful one-shots. For continuous monitoring: a small synthetic script + your existing alerting.

DDoS Protection

DNS is a popular DDoS target. The classic attack: amplification — small spoofed query, huge response, magnifies attacker's bandwidth.

Defenses:

Anycast-hosted DNS (Cloudflare, Route 53, NS1) absorbs L3/L4 floods.
Rate limiting per source at the authoritative server level.
Don't run a public open resolver — separate any recursive servers from anything public.
DNSSEC + DoT/DoH for last-mile integrity (separate from origin DDoS, but worth doing).

Self-hosted DNS on a single VM is a DDoS waiting to happen. Use a hosted provider for public-facing authoritative DNS.

Multi-Region Patterns

If you serve traffic globally:

Latency-Based Routing

DNS resolves the same name to a different IP based on which region is fastest from the resolver:

api.example.com (latency-based) ──► us-east → 203.0.113.10
                                ──► eu-west → 198.51.100.10
                                ──► ap-southeast → 192.0.2.10

Caveat: resolver location ≠ user location. A user in Singapore on Google Public DNS may resolve via a US Google resolver and get pointed at us-east. Workarounds: EDNS Client Subnet (where supported), or run multiple regional CDN endpoints fronting the same backend.

Geo Routing

queries from EU      → eu.example.com
queries from US      → us.example.com
all others           → default.example.com

Useful for data residency (EU traffic stays in EU). Same resolver-location caveat applies.

Health-Checked Failover

Active/passive with DNS-level health checks:

api.example.com (primary)   ──► 203.0.113.10
                if down:    ──► 198.51.100.10

Bound by TTL. Fast failover needs short TTLs (and matching query-volume costs).

Domain Hijacking and Lock

The nightmare scenario: someone takes over your domain registration. Defenses:

Registrar lock (also called clientTransferProhibited) on every domain.
2FA on registrar accounts — no SMS-only; use hardware key or TOTP.
Registry lock (premium feature) — requires out-of-band verification for any change.
Auto-renew on so you don't lose the domain to lapse.
Renew for 5-10 years on important domains (cheap insurance).
Monitor WHOIS for changes — services like DNSSpy alert on changes.

Subdomain Takeover

Subtle, common, dangerous: a subdomain CNAMEs to a service you've stopped using (an old Heroku app, a deleted S3 bucket, a removed Vercel project). The provider's namespace is open; an attacker registers your-old-heroku-app.herokuapp.com, and now your subdomain points at their content.

Defenses:

Delete the CNAME when you delete the service. Always.
Inventory your DNS — periodic audit of CNAMEs and what they point at.
Tools like subjack, dnstwist to scan for vulnerable records.
Reserve subdomains at services you use (some providers let you "claim" the name even if not actively using it).

Anti-Patterns

Anti-pattern	Symptom	Fix
24h TTL on production records	A bad change is a day of partial outage	300-3600s on records that change
Records changed by clicks, no audit	"Who deleted that A record?"	Zone-as-code, PR review, audit log
Apex CNAME via misuse	Half of resolvers fail	Use A/AAAA or provider flattening
`Vary: User-Agent` in DNS (i.e., trying to do app-level routing in DNS)	Hit rates collapse, weird behavior	Route in the app or LB layer
Self-hosted public DNS on a single host	DDoS takes you down	Use a hosted anycast provider
Hardcoded IPs in code instead of DNS	Migrations require redeploys	Always use DNS names
No CAA records	Any CA can issue for your domain	Add CAA
Public-facing recursive resolver	Amplification attack source	Don't run one; or restrict to known clients

Zone Hygiene

Periodic audit, every 6-12 months:

Inventory all records. Diff against what's expected.
Check CNAME targets are still live and yours.
Verify SPF/DKIM/DMARC still match current mail providers.
Verify CAA lists only the CAs you actually use.
Drop unused records — old subdomains, dead services, abandoned A records.

This is also a great use of zone-as-code: drift detection becomes "did anyone change anything since last apply?"

Checklist

Best Practices TTL Strategy The Pre-Change Drop Redundancy Multiple Authoritative Nameservers in Diverse Networks Long TTLs for Critical Records Monitoring DDoS Protection Multi-Region Patterns Latency-Based Routing Geo Routing Health-Checked Failover Domain Hijacking and Lock Subdomain Takeover Anti-Patterns Zone Hygiene Checklist

Best Practices

On this page