Steven's Knowledge

Best Practices

Production VPN / zero trust - key management, observability, scaling, security hardening, pitfalls

Best Practices

Networking is in the critical path of literally every internal request. These practices keep mesh VPN and zero trust solid in production.

Tie Network Access to Identity

The single highest-leverage principle. Network membership = IdP group membership.

  • Use SSO (Google Workspace, Okta, Microsoft Entra, Azure AD) as the identity source.
  • Group membership in your IdP determines tailnet ACLs / Access policies.
  • Removing someone from the IdP removes their network access — don't manage a separate user list.

When the on-call rotation changes, you update an IdP group. The ACL refers to the group; access updates instantly.

Auth Keys: Use Sparingly, Rotate Aggressively

Tailscale auth keys (and similar provisioning tokens) bypass interactive login. They're convenient and dangerous.

PracticeWhy
Ephemeral auth keys for CI / runnersDevices auto-deregister; one less stale node
Pre-approved auth keys for production serversBootstrap without console access
Short expiry (24-72h) for short-lived workloadsLimits blast radius if leaked
Reusable keys only for production fleetAnd only with strict tag enforcement
Tag-locked auth keysA leaked key can only enroll devices with specific tags
Rotate periodicallyEven for long-lived keys

Treat auth keys like API keys — store them in your secret manager, never commit them.

Device Approval Workflow

For production-bound devices, manual approval before they're in the network:

  1. Device runs tailscale up.
  2. Admin gets notified.
  3. Admin approves in the Tailscale admin UI.
  4. Device joins with allowed tags.

For CI / ephemeral workers, auto-approve via tag:

"autoApprovers": {
  "exitNode": ["group:admins"],
  "routes": {
    "10.0.0.0/16": ["tag:subnet-router"]
  }
}

SSH Through the Mesh

Tailscale SSH replaces traditional SSH key management:

  • No ~/.ssh/authorized_keys to maintain.
  • Identity from your IdP.
  • ACL controls who can ssh where.
  • Optional session recording for audit.
  • Optional check mode (re-auth before sensitive actions).

For production:

"ssh": [
  // On-call gets root only with re-auth in last 5 minutes
  {
    "action": "check",
    "src": ["group:on-call"],
    "dst": ["tag:prod"],
    "users": ["root"],
    "checkPeriod": "5m"
  },
  // Engineers get their own user, no re-auth required
  {
    "action": "accept",
    "src": ["group:engineers"],
    "dst": ["tag:staging"],
    "users": ["autogroup:nonroot"]
  }
]

The check action prompts the user to re-authenticate before granting access — a step-up auth pattern.

Observability

What to monitor:

SignalWhy
Connection attempts denied by ACLWrong policy or attack?
New device registrationsAuthorized? Tagged correctly?
Exit node usageWho's routing through what; cost & policy
Subnet router trafficVolume / latency to "behind the mesh"
DERP relay fallback rateDirect connections failing? NAT issue?
MFA challenges per Access appSpike = phishing?

Tailscale ships logs to S3 / your endpoint (Stream Logs feature). Cloudflare Access logs to Logpush. Pipe both to your SIEM / ELK.

Scaling Considerations

Mesh VPN scales differently than perimeter VPN:

AxisConcern
Devices in the tailnetACL push time grows; coordination cost
Cross-device connectionsEach pair needs key exchange
DERP / TURN relay trafficFalls back when peer-to-peer fails — bandwidth cost
ACL complexityLarge ACL JSON gets unreadable; refactor

For very large deployments (1,000+ devices): consider Tailscale Enterprise tier or Twingate which has different scaling characteristics. Headscale (self-hosted) struggles past a few hundred devices without tuning.

Most teams don't hit these limits. Mesh comfortable at ~5,000 devices on the SaaS plans.

Cloudflare Tunnel: Production

A few production practices:

  • Run cloudflared as a service / Docker container, not a foreground process. systemd or Docker for restart.
  • Multiple tunnel replicas for HA — cloudflared connects from multiple hosts; Cloudflare load-balances.
  • Don't put a single tunnel on a single dev laptop for anything important.
  • Restrict tunnel scopeingress config should be explicit, no wildcards that expose unintended services.
  • Combine with Access policies for any sensitive route.
  • Monitor connection health — Cloudflare dashboard shows tunnel uptime.

For Kubernetes:

# cloudflared as a Deployment
apiVersion: apps/v1
kind: Deployment
metadata: { name: cloudflared }
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: cloudflared
          image: cloudflare/cloudflared:latest
          args: ["tunnel", "--config", "/etc/cloudflared/config.yml", "run"]
          volumeMounts:
            - { name: config, mountPath: /etc/cloudflared }
      volumes:
        - name: config
          configMap: { name: cloudflared-config }

Two replicas, ConfigMap for the tunnel config, secret for the credentials.

Security Hardening

Practices that close common gaps:

Tailscale

  • Enable SSO with MFA at the IdP level.
  • Session limits — force re-auth every N days. Tailscale "Session expiry" setting.
  • Disable Tailscale Funnel unless you actively use it (it exposes a tailnet device to the public internet).
  • Tagged-device approval policy — require manual approval for new devices in production tags.
  • Disable auto key expiry override for end users — they shouldn't be able to extend their session indefinitely.
  • Two-person rule for changing ACLs (via Git PR, not the admin UI directly).

Cloudflare Tunnel + Access

  • All app routes behind Access — no "I'll add Access later" for any sensitive route.
  • Service tokens for service-to-service through Access; not "open it up for the IP."
  • MFA required on all Access policies.
  • WAF in front of Access for additional layer.
  • Logs to SIEM, not just the Cloudflare dashboard.

Key Management

Mesh VPN distributes a private key per device. Lose the key → lose the device's identity.

  • OS keychain storage (Tailscale does this by default).
  • Rotate when devices change owners.
  • Revoke immediately when devices are lost or stolen — Tailscale admin UI → delete device.

Common Pitfalls

PitfallSymptomFix
Default-allow ACLsAnyone in tailnet can ssh anywhereTighten ACLs; default-deny
Auth keys committed to gitAnyone with the repo can joinTreat as secrets
Cloudflare Tunnel without Access"Internal app" publicly reachableAdd Access policy
Untagged production devicesCan't write strict ACLsTag every device
Forgot to enable MagicDNSConnection strings break randomlyEnable in admin UI
One Tailscale SSO without MFAPhishing → tailnet accessEnforce MFA at IdP
Sharing tailnets across orgsPermission soupFederation or separate tailnets
Mesh + traditional VPN coexisting foreverMaintenance burdenMigration plan; sunset the old VPN
No alerting on new device registrationCompromise goes unnoticedLog + alert
Long-lived auth keys never rotatedStale credentialsPeriodic rotation

When Things Go Wrong

Quick diagnosis:

# What's my tailscale state?
tailscale status
tailscale ip
tailscale netcheck         # checks connectivity to coordination server + DERP

# Can I reach this peer?
tailscale ping my-server

# What ACL rules apply?
tailscale debug acl-show

# cloudflared health
cloudflared tunnel info my-tunnel

For broader connectivity issues, tailscale netcheck is the first stop — tells you NAT status, packet loss, latency to coordination servers and relays.

Migration Strategies

From a legacy VPN to mesh:

  1. Run both in parallel for weeks. Mesh first via specific users / use cases.
  2. Migrate by service — staging → internal tools → prod, in order of risk.
  3. Don't decommission the old VPN until you've gone through at least one quarter without falling back.
  4. Plan for the disaster case — if the new system fails, what's the fallback?

From "I expose stuff publicly with a reverse proxy" to "Cloudflare Tunnel":

  1. Set up the tunnel for one service in parallel with the existing reverse proxy.
  2. Cut over DNS for that service.
  3. Verify a week of normal operation.
  4. Repeat per service.
  5. Decommission the reverse proxy.

Checklist

Production zero-trust networking checklist

  • Identity provider integrated (Google Workspace, Okta, Azure)
  • MFA enforced at IdP for all users
  • Default-deny ACLs; explicit allow rules per group
  • Devices tagged by role; tag-restricted operations
  • Auth keys are ephemeral or short-lived; no permanent keys
  • Production device registration requires approval
  • Cloudflare Tunnel runs as a service with HA replicas
  • Every Cloudflare Tunnel route protected by Access policy
  • Access policies require MFA
  • Tailscale SSH or equivalent replaces traditional SSH keys
  • Session recording for production SSH sessions
  • Tailnet logs and Access logs streamed to SIEM
  • Alert on: new device registrations, ACL denials, exit-node usage changes
  • Periodic ACL review (quarterly)
  • Documented migration plan from any legacy VPN
  • Break-glass procedure when SaaS coordination server is unavailable

On this page