Steven's Knowledge

Best Practices

Production DNS - TTL strategy, redundancy, monitoring, DDoS protection, anti-patterns

Best Practices

DNS is in the critical path of literally every request. A bad change to a DNS zone can take you offline for hours while caches expire. Treat it as a production system, not a control panel.

TTL Strategy

The single most-tuned knob. Some patterns that work:

RecordSuggested TTLWhy
Apex A/AAAA (pointing at CDN)300s (5 min)Possible to need fast failover
CNAME (pointing at a hostname)300-3600sStable, but you'll change it eventually
MX3600s (1 hour)Rarely changes
NS at apex86400s (1 day)Almost never changes; long TTL is fine
TXT for SPF/DKIM/DMARC3600sStable; minor changes
TXT for verification300sSet, verify, leave
CAA86400sStable
Failover targets60sTight failover loops; cost: query volume

The Pre-Change Drop

Always lower TTL before a planned record change:

T-72h:  current TTL is 3600, set new TTL to 300
T-24h:  most resolvers globally now honor 5-min TTL
T:      make the actual change
T+1h:   verify propagation
T+24h:  raise TTL back to 3600

Skip this and your "5-minute change" creates an hour of partial outage for anyone whose cached value still hits the old endpoint.

Redundancy

Single point of failure: your DNS provider. AWS Route 53 has had outages. Cloudflare has had outages. The mitigations:

Multiple Authoritative Nameservers in Diverse Networks

Cloudflare uses anycast + many POPs. That's good. But if their anycast announcement gets withdrawn (it has happened), they all become unreachable simultaneously.

The fix: secondary DNS at a different provider. Tools like OctoDNS publish to two providers; your registrar's NS records list both. A resolver tries one; if it doesn't respond, tries the other.

example.com   NS   ns1.cloudflare.com.
example.com   NS   ns2.cloudflare.com.
example.com   NS   ns1.route53.amazonaws.com.
example.com   NS   ns2.route53.amazonaws.com.

Cost: 2× DNS costs, ops effort to keep zones in sync. Reserved for "DNS outage costs me serious money."

Long TTLs for Critical Records

Counterintuitively, long TTLs help redundancy — if your DNS provider goes down, recursive resolvers globally still have cached answers. Trade against the cost: slow changes when you need them.

A common compromise: 86400s TTL on NS records, 3600s on app records, 300s on records that may need to change.

Monitoring

DNS-specific monitoring you probably don't have but should:

SignalWhat to monitor
Authoritative server healthQuery from outside; alert on NXDOMAIN, SERVFAIL, timeout
DNSSEC chain validityCron job that does dig +sigchase; alert before signature expiry
NS records at the registrarCompare against expected list; alert on drift
TLD glue / DS recordsAnnual check that your TLD entry hasn't drifted
Resolution timedig from multiple regions; alert on p99 spikes
Negative caching (NXDOMAIN)Sudden NXDOMAIN spike for a known-good domain = misconfiguration

dnscheck.io, intodns.com are useful one-shots. For continuous monitoring: a small synthetic script + your existing alerting.

DDoS Protection

DNS is a popular DDoS target. The classic attack: amplification — small spoofed query, huge response, magnifies attacker's bandwidth.

Defenses:

  • Anycast-hosted DNS (Cloudflare, Route 53, NS1) absorbs L3/L4 floods.
  • Rate limiting per source at the authoritative server level.
  • Don't run a public open resolver — separate any recursive servers from anything public.
  • DNSSEC + DoT/DoH for last-mile integrity (separate from origin DDoS, but worth doing).

Self-hosted DNS on a single VM is a DDoS waiting to happen. Use a hosted provider for public-facing authoritative DNS.

Multi-Region Patterns

If you serve traffic globally:

Latency-Based Routing

DNS resolves the same name to a different IP based on which region is fastest from the resolver:

api.example.com (latency-based) ──► us-east → 203.0.113.10
                                ──► eu-west → 198.51.100.10
                                ──► ap-southeast → 192.0.2.10

Caveat: resolver location ≠ user location. A user in Singapore on Google Public DNS may resolve via a US Google resolver and get pointed at us-east. Workarounds: EDNS Client Subnet (where supported), or run multiple regional CDN endpoints fronting the same backend.

Geo Routing

queries from EU      → eu.example.com
queries from US      → us.example.com
all others           → default.example.com

Useful for data residency (EU traffic stays in EU). Same resolver-location caveat applies.

Health-Checked Failover

Active/passive with DNS-level health checks:

api.example.com (primary)   ──► 203.0.113.10
                if down:    ──► 198.51.100.10

Bound by TTL. Fast failover needs short TTLs (and matching query-volume costs).

Domain Hijacking and Lock

The nightmare scenario: someone takes over your domain registration. Defenses:

  • Registrar lock (also called clientTransferProhibited) on every domain.
  • 2FA on registrar accounts — no SMS-only; use hardware key or TOTP.
  • Registry lock (premium feature) — requires out-of-band verification for any change.
  • Auto-renew on so you don't lose the domain to lapse.
  • Renew for 5-10 years on important domains (cheap insurance).
  • Monitor WHOIS for changes — services like DNSSpy alert on changes.

Subdomain Takeover

Subtle, common, dangerous: a subdomain CNAMEs to a service you've stopped using (an old Heroku app, a deleted S3 bucket, a removed Vercel project). The provider's namespace is open; an attacker registers your-old-heroku-app.herokuapp.com, and now your subdomain points at their content.

Defenses:

  • Delete the CNAME when you delete the service. Always.
  • Inventory your DNS — periodic audit of CNAMEs and what they point at.
  • Tools like subjack, dnstwist to scan for vulnerable records.
  • Reserve subdomains at services you use (some providers let you "claim" the name even if not actively using it).

Anti-Patterns

Anti-patternSymptomFix
24h TTL on production recordsA bad change is a day of partial outage300-3600s on records that change
Records changed by clicks, no audit"Who deleted that A record?"Zone-as-code, PR review, audit log
Apex CNAME via misuseHalf of resolvers failUse A/AAAA or provider flattening
Vary: User-Agent in DNS (i.e., trying to do app-level routing in DNS)Hit rates collapse, weird behaviorRoute in the app or LB layer
Self-hosted public DNS on a single hostDDoS takes you downUse a hosted anycast provider
Hardcoded IPs in code instead of DNSMigrations require redeploysAlways use DNS names
No CAA recordsAny CA can issue for your domainAdd CAA
Public-facing recursive resolverAmplification attack sourceDon't run one; or restrict to known clients

Zone Hygiene

Periodic audit, every 6-12 months:

  • Inventory all records. Diff against what's expected.
  • Check CNAME targets are still live and yours.
  • Verify SPF/DKIM/DMARC still match current mail providers.
  • Verify CAA lists only the CAs you actually use.
  • Drop unused records — old subdomains, dead services, abandoned A records.

This is also a great use of zone-as-code: drift detection becomes "did anyone change anything since last apply?"

Checklist

Production DNS checklist

  • Zone managed as code (OctoDNS / Terraform / dnscontrol) in version control
  • Authoritative DNS at a reputable provider (Cloudflare / Route 53 / NS1)
  • Secondary DNS at a different provider for critical domains
  • CAA records restricting which CAs may issue certs
  • DNSSEC enabled
  • SPF + DKIM + DMARC configured if you send mail
  • Registrar lock (clientTransferProhibited) on
  • 2FA (hardware key or TOTP) on registrar accounts
  • Auto-renew on; long renewal period for important domains
  • TTLs set per record type per criticality
  • Pre-change TTL drop documented as a runbook
  • Synthetic monitoring of DNS resolution from multiple regions
  • Periodic audit of CNAME targets (subdomain takeover)
  • WHOIS / NS-record monitoring with alerts on change
  • Reverse DNS set for mail-sending IPs

On this page