Best Practices

Production VPN / zero trust - key management, observability, scaling, security hardening, pitfalls

Best Practices

Networking is in the critical path of literally every internal request. These practices keep mesh VPN and zero trust solid in production.

Tie Network Access to Identity

The single highest-leverage principle. Network membership = IdP group membership.

Use SSO (Google Workspace, Okta, Microsoft Entra, Azure AD) as the identity source.
Group membership in your IdP determines tailnet ACLs / Access policies.
Removing someone from the IdP removes their network access — don't manage a separate user list.

When the on-call rotation changes, you update an IdP group. The ACL refers to the group; access updates instantly.

Auth Keys: Use Sparingly, Rotate Aggressively

Tailscale auth keys (and similar provisioning tokens) bypass interactive login. They're convenient and dangerous.

Practice	Why
Ephemeral auth keys for CI / runners	Devices auto-deregister; one less stale node
Pre-approved auth keys for production servers	Bootstrap without console access
Short expiry (24-72h) for short-lived workloads	Limits blast radius if leaked
Reusable keys only for production fleet	And only with strict tag enforcement
Tag-locked auth keys	A leaked key can only enroll devices with specific tags
Rotate periodically	Even for long-lived keys

Treat auth keys like API keys — store them in your secret manager, never commit them.

Device Approval Workflow

For production-bound devices, manual approval before they're in the network:

Device runs tailscale up.
Admin gets notified.
Admin approves in the Tailscale admin UI.
Device joins with allowed tags.

For CI / ephemeral workers, auto-approve via tag:

"autoApprovers": {
  "exitNode": ["group:admins"],
  "routes": {
    "10.0.0.0/16": ["tag:subnet-router"]
  }
}

SSH Through the Mesh

Tailscale SSH replaces traditional SSH key management:

No ~/.ssh/authorized_keys to maintain.
Identity from your IdP.
ACL controls who can ssh where.
Optional session recording for audit.
Optional check mode (re-auth before sensitive actions).

For production:

"ssh": [
  // On-call gets root only with re-auth in last 5 minutes
  {
    "action": "check",
    "src": ["group:on-call"],
    "dst": ["tag:prod"],
    "users": ["root"],
    "checkPeriod": "5m"
  },
  // Engineers get their own user, no re-auth required
  {
    "action": "accept",
    "src": ["group:engineers"],
    "dst": ["tag:staging"],
    "users": ["autogroup:nonroot"]
  }
]

The check action prompts the user to re-authenticate before granting access — a step-up auth pattern.

Observability

What to monitor:

Signal	Why
Connection attempts denied by ACL	Wrong policy or attack?
New device registrations	Authorized? Tagged correctly?
Exit node usage	Who's routing through what; cost & policy
Subnet router traffic	Volume / latency to "behind the mesh"
DERP relay fallback rate	Direct connections failing? NAT issue?
MFA challenges per Access app	Spike = phishing?

Tailscale ships logs to S3 / your endpoint (Stream Logs feature). Cloudflare Access logs to Logpush. Pipe both to your SIEM / ELK.

Scaling Considerations

Mesh VPN scales differently than perimeter VPN:

Axis	Concern
Devices in the tailnet	ACL push time grows; coordination cost
Cross-device connections	Each pair needs key exchange
DERP / TURN relay traffic	Falls back when peer-to-peer fails — bandwidth cost
ACL complexity	Large ACL JSON gets unreadable; refactor

For very large deployments (1,000+ devices): consider Tailscale Enterprise tier or Twingate which has different scaling characteristics. Headscale (self-hosted) struggles past a few hundred devices without tuning.

Most teams don't hit these limits. Mesh comfortable at ~5,000 devices on the SaaS plans.

Cloudflare Tunnel: Production

A few production practices:

Run cloudflared as a service / Docker container, not a foreground process. systemd or Docker for restart.
Multiple tunnel replicas for HA — cloudflared connects from multiple hosts; Cloudflare load-balances.
Don't put a single tunnel on a single dev laptop for anything important.
Restrict tunnel scope — ingress config should be explicit, no wildcards that expose unintended services.
Combine with Access policies for any sensitive route.
Monitor connection health — Cloudflare dashboard shows tunnel uptime.

For Kubernetes:

# cloudflared as a Deployment
apiVersion: apps/v1
kind: Deployment
metadata: { name: cloudflared }
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: cloudflared
          image: cloudflare/cloudflared:latest
          args: ["tunnel", "--config", "/etc/cloudflared/config.yml", "run"]
          volumeMounts:
            - { name: config, mountPath: /etc/cloudflared }
      volumes:
        - name: config
          configMap: { name: cloudflared-config }

Two replicas, ConfigMap for the tunnel config, secret for the credentials.

Security Hardening

Practices that close common gaps:

Tailscale

Enable SSO with MFA at the IdP level.
Session limits — force re-auth every N days. Tailscale "Session expiry" setting.
Disable Tailscale Funnel unless you actively use it (it exposes a tailnet device to the public internet).
Tagged-device approval policy — require manual approval for new devices in production tags.
Disable auto key expiry override for end users — they shouldn't be able to extend their session indefinitely.
Two-person rule for changing ACLs (via Git PR, not the admin UI directly).

Cloudflare Tunnel + Access

All app routes behind Access — no "I'll add Access later" for any sensitive route.
Service tokens for service-to-service through Access; not "open it up for the IP."
MFA required on all Access policies.
WAF in front of Access for additional layer.
Logs to SIEM, not just the Cloudflare dashboard.

Key Management

Mesh VPN distributes a private key per device. Lose the key → lose the device's identity.

OS keychain storage (Tailscale does this by default).
Rotate when devices change owners.
Revoke immediately when devices are lost or stolen — Tailscale admin UI → delete device.

Common Pitfalls

Pitfall	Symptom	Fix
Default-allow ACLs	Anyone in tailnet can ssh anywhere	Tighten ACLs; default-deny
Auth keys committed to git	Anyone with the repo can join	Treat as secrets
Cloudflare Tunnel without Access	"Internal app" publicly reachable	Add Access policy
Untagged production devices	Can't write strict ACLs	Tag every device
Forgot to enable MagicDNS	Connection strings break randomly	Enable in admin UI
One Tailscale SSO without MFA	Phishing → tailnet access	Enforce MFA at IdP
Sharing tailnets across orgs	Permission soup	Federation or separate tailnets
Mesh + traditional VPN coexisting forever	Maintenance burden	Migration plan; sunset the old VPN
No alerting on new device registration	Compromise goes unnoticed	Log + alert
Long-lived auth keys never rotated	Stale credentials	Periodic rotation

When Things Go Wrong

Quick diagnosis:

# What's my tailscale state?
tailscale status
tailscale ip
tailscale netcheck         # checks connectivity to coordination server + DERP

# Can I reach this peer?
tailscale ping my-server

# What ACL rules apply?
tailscale debug acl-show

# cloudflared health
cloudflared tunnel info my-tunnel

For broader connectivity issues, tailscale netcheck is the first stop — tells you NAT status, packet loss, latency to coordination servers and relays.

Migration Strategies

From a legacy VPN to mesh:

Run both in parallel for weeks. Mesh first via specific users / use cases.
Migrate by service — staging → internal tools → prod, in order of risk.
Don't decommission the old VPN until you've gone through at least one quarter without falling back.
Plan for the disaster case — if the new system fails, what's the fallback?

From "I expose stuff publicly with a reverse proxy" to "Cloudflare Tunnel":

Set up the tunnel for one service in parallel with the existing reverse proxy.
Cut over DNS for that service.
Verify a week of normal operation.
Repeat per service.
Decommission the reverse proxy.

Checklist

Best Practices Tie Network Access to Identity Auth Keys: Use Sparingly, Rotate Aggressively Device Approval Workflow SSH Through the Mesh Observability Scaling Considerations Cloudflare Tunnel: Production Security Hardening Tailscale Cloudflare Tunnel + Access Key Management Common Pitfalls When Things Go Wrong Migration Strategies Checklist

Best Practices

On this page