Best Practices
Production VPN / zero trust - key management, observability, scaling, security hardening, pitfalls
Best Practices
Networking is in the critical path of literally every internal request. These practices keep mesh VPN and zero trust solid in production.
Tie Network Access to Identity
The single highest-leverage principle. Network membership = IdP group membership.
- Use SSO (Google Workspace, Okta, Microsoft Entra, Azure AD) as the identity source.
- Group membership in your IdP determines tailnet ACLs / Access policies.
- Removing someone from the IdP removes their network access — don't manage a separate user list.
When the on-call rotation changes, you update an IdP group. The ACL refers to the group; access updates instantly.
Auth Keys: Use Sparingly, Rotate Aggressively
Tailscale auth keys (and similar provisioning tokens) bypass interactive login. They're convenient and dangerous.
| Practice | Why |
|---|---|
| Ephemeral auth keys for CI / runners | Devices auto-deregister; one less stale node |
| Pre-approved auth keys for production servers | Bootstrap without console access |
| Short expiry (24-72h) for short-lived workloads | Limits blast radius if leaked |
| Reusable keys only for production fleet | And only with strict tag enforcement |
| Tag-locked auth keys | A leaked key can only enroll devices with specific tags |
| Rotate periodically | Even for long-lived keys |
Treat auth keys like API keys — store them in your secret manager, never commit them.
Device Approval Workflow
For production-bound devices, manual approval before they're in the network:
- Device runs
tailscale up. - Admin gets notified.
- Admin approves in the Tailscale admin UI.
- Device joins with allowed tags.
For CI / ephemeral workers, auto-approve via tag:
"autoApprovers": {
"exitNode": ["group:admins"],
"routes": {
"10.0.0.0/16": ["tag:subnet-router"]
}
}SSH Through the Mesh
Tailscale SSH replaces traditional SSH key management:
- No
~/.ssh/authorized_keysto maintain. - Identity from your IdP.
- ACL controls who can ssh where.
- Optional session recording for audit.
- Optional check mode (re-auth before sensitive actions).
For production:
"ssh": [
// On-call gets root only with re-auth in last 5 minutes
{
"action": "check",
"src": ["group:on-call"],
"dst": ["tag:prod"],
"users": ["root"],
"checkPeriod": "5m"
},
// Engineers get their own user, no re-auth required
{
"action": "accept",
"src": ["group:engineers"],
"dst": ["tag:staging"],
"users": ["autogroup:nonroot"]
}
]The check action prompts the user to re-authenticate before granting access — a step-up auth pattern.
Observability
What to monitor:
| Signal | Why |
|---|---|
| Connection attempts denied by ACL | Wrong policy or attack? |
| New device registrations | Authorized? Tagged correctly? |
| Exit node usage | Who's routing through what; cost & policy |
| Subnet router traffic | Volume / latency to "behind the mesh" |
| DERP relay fallback rate | Direct connections failing? NAT issue? |
| MFA challenges per Access app | Spike = phishing? |
Tailscale ships logs to S3 / your endpoint (Stream Logs feature). Cloudflare Access logs to Logpush. Pipe both to your SIEM / ELK.
Scaling Considerations
Mesh VPN scales differently than perimeter VPN:
| Axis | Concern |
|---|---|
| Devices in the tailnet | ACL push time grows; coordination cost |
| Cross-device connections | Each pair needs key exchange |
| DERP / TURN relay traffic | Falls back when peer-to-peer fails — bandwidth cost |
| ACL complexity | Large ACL JSON gets unreadable; refactor |
For very large deployments (1,000+ devices): consider Tailscale Enterprise tier or Twingate which has different scaling characteristics. Headscale (self-hosted) struggles past a few hundred devices without tuning.
Most teams don't hit these limits. Mesh comfortable at ~5,000 devices on the SaaS plans.
Cloudflare Tunnel: Production
A few production practices:
- Run
cloudflaredas a service / Docker container, not a foreground process. systemd or Docker for restart. - Multiple tunnel replicas for HA —
cloudflaredconnects from multiple hosts; Cloudflare load-balances. - Don't put a single tunnel on a single dev laptop for anything important.
- Restrict tunnel scope —
ingressconfig should be explicit, no wildcards that expose unintended services. - Combine with Access policies for any sensitive route.
- Monitor connection health — Cloudflare dashboard shows tunnel uptime.
For Kubernetes:
# cloudflared as a Deployment
apiVersion: apps/v1
kind: Deployment
metadata: { name: cloudflared }
spec:
replicas: 2
template:
spec:
containers:
- name: cloudflared
image: cloudflare/cloudflared:latest
args: ["tunnel", "--config", "/etc/cloudflared/config.yml", "run"]
volumeMounts:
- { name: config, mountPath: /etc/cloudflared }
volumes:
- name: config
configMap: { name: cloudflared-config }Two replicas, ConfigMap for the tunnel config, secret for the credentials.
Security Hardening
Practices that close common gaps:
Tailscale
- Enable SSO with MFA at the IdP level.
- Session limits — force re-auth every N days. Tailscale "Session expiry" setting.
- Disable Tailscale Funnel unless you actively use it (it exposes a tailnet device to the public internet).
- Tagged-device approval policy — require manual approval for new devices in production tags.
- Disable auto key expiry override for end users — they shouldn't be able to extend their session indefinitely.
- Two-person rule for changing ACLs (via Git PR, not the admin UI directly).
Cloudflare Tunnel + Access
- All app routes behind Access — no "I'll add Access later" for any sensitive route.
- Service tokens for service-to-service through Access; not "open it up for the IP."
- MFA required on all Access policies.
- WAF in front of Access for additional layer.
- Logs to SIEM, not just the Cloudflare dashboard.
Key Management
Mesh VPN distributes a private key per device. Lose the key → lose the device's identity.
- OS keychain storage (Tailscale does this by default).
- Rotate when devices change owners.
- Revoke immediately when devices are lost or stolen — Tailscale admin UI → delete device.
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Default-allow ACLs | Anyone in tailnet can ssh anywhere | Tighten ACLs; default-deny |
| Auth keys committed to git | Anyone with the repo can join | Treat as secrets |
| Cloudflare Tunnel without Access | "Internal app" publicly reachable | Add Access policy |
| Untagged production devices | Can't write strict ACLs | Tag every device |
| Forgot to enable MagicDNS | Connection strings break randomly | Enable in admin UI |
| One Tailscale SSO without MFA | Phishing → tailnet access | Enforce MFA at IdP |
| Sharing tailnets across orgs | Permission soup | Federation or separate tailnets |
| Mesh + traditional VPN coexisting forever | Maintenance burden | Migration plan; sunset the old VPN |
| No alerting on new device registration | Compromise goes unnoticed | Log + alert |
| Long-lived auth keys never rotated | Stale credentials | Periodic rotation |
When Things Go Wrong
Quick diagnosis:
# What's my tailscale state?
tailscale status
tailscale ip
tailscale netcheck # checks connectivity to coordination server + DERP
# Can I reach this peer?
tailscale ping my-server
# What ACL rules apply?
tailscale debug acl-show
# cloudflared health
cloudflared tunnel info my-tunnelFor broader connectivity issues, tailscale netcheck is the first stop — tells you NAT status, packet loss, latency to coordination servers and relays.
Migration Strategies
From a legacy VPN to mesh:
- Run both in parallel for weeks. Mesh first via specific users / use cases.
- Migrate by service — staging → internal tools → prod, in order of risk.
- Don't decommission the old VPN until you've gone through at least one quarter without falling back.
- Plan for the disaster case — if the new system fails, what's the fallback?
From "I expose stuff publicly with a reverse proxy" to "Cloudflare Tunnel":
- Set up the tunnel for one service in parallel with the existing reverse proxy.
- Cut over DNS for that service.
- Verify a week of normal operation.
- Repeat per service.
- Decommission the reverse proxy.
Checklist
Production zero-trust networking checklist
- Identity provider integrated (Google Workspace, Okta, Azure)
- MFA enforced at IdP for all users
- Default-deny ACLs; explicit allow rules per group
- Devices tagged by role; tag-restricted operations
- Auth keys are ephemeral or short-lived; no permanent keys
- Production device registration requires approval
- Cloudflare Tunnel runs as a service with HA replicas
- Every Cloudflare Tunnel route protected by Access policy
- Access policies require MFA
- Tailscale SSH or equivalent replaces traditional SSH keys
- Session recording for production SSH sessions
- Tailnet logs and Access logs streamed to SIEM
- Alert on: new device registrations, ACL denials, exit-node usage changes
- Periodic ACL review (quarterly)
- Documented migration plan from any legacy VPN
- Break-glass procedure when SaaS coordination server is unavailable