Best Practices
Production DNS - TTL strategy, redundancy, monitoring, DDoS protection, anti-patterns
Best Practices
DNS is in the critical path of literally every request. A bad change to a DNS zone can take you offline for hours while caches expire. Treat it as a production system, not a control panel.
TTL Strategy
The single most-tuned knob. Some patterns that work:
| Record | Suggested TTL | Why |
|---|---|---|
Apex A/AAAA (pointing at CDN) | 300s (5 min) | Possible to need fast failover |
CNAME (pointing at a hostname) | 300-3600s | Stable, but you'll change it eventually |
MX | 3600s (1 hour) | Rarely changes |
NS at apex | 86400s (1 day) | Almost never changes; long TTL is fine |
TXT for SPF/DKIM/DMARC | 3600s | Stable; minor changes |
TXT for verification | 300s | Set, verify, leave |
CAA | 86400s | Stable |
| Failover targets | 60s | Tight failover loops; cost: query volume |
The Pre-Change Drop
Always lower TTL before a planned record change:
T-72h: current TTL is 3600, set new TTL to 300
T-24h: most resolvers globally now honor 5-min TTL
T: make the actual change
T+1h: verify propagation
T+24h: raise TTL back to 3600Skip this and your "5-minute change" creates an hour of partial outage for anyone whose cached value still hits the old endpoint.
Redundancy
Single point of failure: your DNS provider. AWS Route 53 has had outages. Cloudflare has had outages. The mitigations:
Multiple Authoritative Nameservers in Diverse Networks
Cloudflare uses anycast + many POPs. That's good. But if their anycast announcement gets withdrawn (it has happened), they all become unreachable simultaneously.
The fix: secondary DNS at a different provider. Tools like OctoDNS publish to two providers; your registrar's NS records list both. A resolver tries one; if it doesn't respond, tries the other.
example.com NS ns1.cloudflare.com.
example.com NS ns2.cloudflare.com.
example.com NS ns1.route53.amazonaws.com.
example.com NS ns2.route53.amazonaws.com.Cost: 2× DNS costs, ops effort to keep zones in sync. Reserved for "DNS outage costs me serious money."
Long TTLs for Critical Records
Counterintuitively, long TTLs help redundancy — if your DNS provider goes down, recursive resolvers globally still have cached answers. Trade against the cost: slow changes when you need them.
A common compromise: 86400s TTL on NS records, 3600s on app records, 300s on records that may need to change.
Monitoring
DNS-specific monitoring you probably don't have but should:
| Signal | What to monitor |
|---|---|
| Authoritative server health | Query from outside; alert on NXDOMAIN, SERVFAIL, timeout |
| DNSSEC chain validity | Cron job that does dig +sigchase; alert before signature expiry |
| NS records at the registrar | Compare against expected list; alert on drift |
| TLD glue / DS records | Annual check that your TLD entry hasn't drifted |
| Resolution time | dig from multiple regions; alert on p99 spikes |
| Negative caching (NXDOMAIN) | Sudden NXDOMAIN spike for a known-good domain = misconfiguration |
dnscheck.io, intodns.com are useful one-shots. For continuous monitoring: a small synthetic script + your existing alerting.
DDoS Protection
DNS is a popular DDoS target. The classic attack: amplification — small spoofed query, huge response, magnifies attacker's bandwidth.
Defenses:
- Anycast-hosted DNS (Cloudflare, Route 53, NS1) absorbs L3/L4 floods.
- Rate limiting per source at the authoritative server level.
- Don't run a public open resolver — separate any recursive servers from anything public.
- DNSSEC + DoT/DoH for last-mile integrity (separate from origin DDoS, but worth doing).
Self-hosted DNS on a single VM is a DDoS waiting to happen. Use a hosted provider for public-facing authoritative DNS.
Multi-Region Patterns
If you serve traffic globally:
Latency-Based Routing
DNS resolves the same name to a different IP based on which region is fastest from the resolver:
api.example.com (latency-based) ──► us-east → 203.0.113.10
──► eu-west → 198.51.100.10
──► ap-southeast → 192.0.2.10Caveat: resolver location ≠ user location. A user in Singapore on Google Public DNS may resolve via a US Google resolver and get pointed at us-east. Workarounds: EDNS Client Subnet (where supported), or run multiple regional CDN endpoints fronting the same backend.
Geo Routing
queries from EU → eu.example.com
queries from US → us.example.com
all others → default.example.comUseful for data residency (EU traffic stays in EU). Same resolver-location caveat applies.
Health-Checked Failover
Active/passive with DNS-level health checks:
api.example.com (primary) ──► 203.0.113.10
if down: ──► 198.51.100.10Bound by TTL. Fast failover needs short TTLs (and matching query-volume costs).
Domain Hijacking and Lock
The nightmare scenario: someone takes over your domain registration. Defenses:
- Registrar lock (also called clientTransferProhibited) on every domain.
- 2FA on registrar accounts — no SMS-only; use hardware key or TOTP.
- Registry lock (premium feature) — requires out-of-band verification for any change.
- Auto-renew on so you don't lose the domain to lapse.
- Renew for 5-10 years on important domains (cheap insurance).
- Monitor WHOIS for changes — services like DNSSpy alert on changes.
Subdomain Takeover
Subtle, common, dangerous: a subdomain CNAMEs to a service you've stopped using (an old Heroku app, a deleted S3 bucket, a removed Vercel project). The provider's namespace is open; an attacker registers your-old-heroku-app.herokuapp.com, and now your subdomain points at their content.
Defenses:
- Delete the CNAME when you delete the service. Always.
- Inventory your DNS — periodic audit of CNAMEs and what they point at.
- Tools like
subjack,dnstwistto scan for vulnerable records. - Reserve subdomains at services you use (some providers let you "claim" the name even if not actively using it).
Anti-Patterns
| Anti-pattern | Symptom | Fix |
|---|---|---|
| 24h TTL on production records | A bad change is a day of partial outage | 300-3600s on records that change |
| Records changed by clicks, no audit | "Who deleted that A record?" | Zone-as-code, PR review, audit log |
| Apex CNAME via misuse | Half of resolvers fail | Use A/AAAA or provider flattening |
Vary: User-Agent in DNS (i.e., trying to do app-level routing in DNS) | Hit rates collapse, weird behavior | Route in the app or LB layer |
| Self-hosted public DNS on a single host | DDoS takes you down | Use a hosted anycast provider |
| Hardcoded IPs in code instead of DNS | Migrations require redeploys | Always use DNS names |
| No CAA records | Any CA can issue for your domain | Add CAA |
| Public-facing recursive resolver | Amplification attack source | Don't run one; or restrict to known clients |
Zone Hygiene
Periodic audit, every 6-12 months:
- Inventory all records. Diff against what's expected.
- Check CNAME targets are still live and yours.
- Verify SPF/DKIM/DMARC still match current mail providers.
- Verify CAA lists only the CAs you actually use.
- Drop unused records — old subdomains, dead services, abandoned A records.
This is also a great use of zone-as-code: drift detection becomes "did anyone change anything since last apply?"
Checklist
Production DNS checklist
- Zone managed as code (OctoDNS / Terraform / dnscontrol) in version control
- Authoritative DNS at a reputable provider (Cloudflare / Route 53 / NS1)
- Secondary DNS at a different provider for critical domains
- CAA records restricting which CAs may issue certs
- DNSSEC enabled
- SPF + DKIM + DMARC configured if you send mail
- Registrar lock (clientTransferProhibited) on
- 2FA (hardware key or TOTP) on registrar accounts
- Auto-renew on; long renewal period for important domains
- TTLs set per record type per criticality
- Pre-change TTL drop documented as a runbook
- Synthetic monitoring of DNS resolution from multiple regions
- Periodic audit of CNAME targets (subdomain takeover)
- WHOIS / NS-record monitoring with alerts on change
- Reverse DNS set for mail-sending IPs