Steven's Knowledge

Best Practices

Production service mesh - rollout strategy, sizing, observability, debugging, upgrade path, and when to walk away

Best Practices

A mesh is a platform commitment. These are the patterns that separate "we installed Istio" from "we operate a mesh in production."

Roll Out by Namespace, Not All at Once

Both Istio and Linkerd let you mesh selectively — by namespace label (Istio) or annotation (Linkerd inject). Never mesh the whole cluster on day one.

A sensible sequence:

  1. Pick one stateless HTTP service.
  2. Mesh it, compare meshed vs unmeshed metrics for a week.
  3. Mesh that namespace; observe under real traffic.
  4. Roll forward namespace by namespace. Stop and investigate if anything regresses.

Database namespaces, infra DaemonSets, and anything using non-HTTP protocols (Kafka clients, gRPC streaming with quirks) deserve extra caution.

Don't Mesh Stateful or Latency-Sensitive Workloads Blindly

  • Stateful sets (Postgres, Kafka, etcd, Redis) — meshing them rarely helps and can corrupt state if retries fire on non-idempotent writes. Most teams mesh the clients, not the database pods.
  • Latency-critical paths — the sidecar adds 1-3 ms per hop. If your service has a P99 budget of 5 ms, that hurts.
  • Long-lived connections (HTTP/2 streams, websockets) — meshes proxy fine but trace and metric data can mislead.

When in doubt: bench it unmeshed, then meshed, with realistic load.

Sizing the Sidecar

Every meshed pod gets a sidecar. At 100 services × 5 replicas each = 500 sidecars. At 50 MB each (Istio classic), that's 25 GB of memory just for proxies.

SettingWhat to tune
sidecar.istio.io/proxyCPU / proxyMemoryRight-size per workload — small services need much less
Linkerd proxy resourcesSet requests/limits via annotations
Ambient mode (Istio)Removes per-pod cost entirely
Sidecar CRD (Istio)Restricts what config is pushed to each sidecar — huge memory win in big clusters

In big clusters the istiod / control-plane sizing matters more than the sidecars. Watch its CPU and config-push latency.

Upgrade Strategy

Meshes touch every meshed pod. Plan upgrades like data migrations.

  • Read the release notes. Both projects have minor versions that change CRDs or default behavior.
  • Test in a non-prod cluster first with realistic workload.
  • Upgrade control plane first, then data plane (proxy) versions.
  • Rolling restart to pick up new sidecars. Make sure your PDBs and maxUnavailable settings can absorb the churn.
  • Pin sidecar version if you need to delay proxy upgrades.

Istio's revisions and Linkerd's --prune upgrades make this tractable but not automatic.

Security: Don't Stop at mTLS

mTLS gives you encrypted transport and verified identities. By itself it doesn't say who's allowed to call whom. Always add authorization policy:

Istio

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: api-only-from-web
  namespace: production
spec:
  selector:
    matchLabels: { app: api }
  action: ALLOW
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/production/sa/web"]

Linkerd

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: api-server
  namespace: production
spec:
  podSelector:
    matchLabels: { app: api }
  port: 8080
  proxyProtocol: HTTP/1
---
apiVersion: policy.linkerd.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: api-only-from-web
  namespace: production
spec:
  targetRef: { kind: Server, name: api-server }
  requiredAuthenticationRefs:
    - kind: MeshTLSAuthentication
      name: web-sa

Start in audit / permissive mode, see what would have been denied, then flip to enforce.

Observability

A mesh emits a lot. Wire it into what you already run:

SignalWhere it goes
MetricsPrometheus — meshes ship dashboards out of the box
TracesJaeger / Tempo / a SaaS — meshes propagate / generate spans (B3 / W3C trace context)
Access logsYour logging pipeline — see ELK
TopologyKiali (Istio), Linkerd Viz dashboard

Don't double-instrument. If the mesh emits the L7 metrics, drop the app-level HTTP middleware that does the same. One source of truth.

Debugging Skills That Pay Off

When something breaks in a mesh, you need to look at proxies, not just apps:

# Istio
istioctl proxy-config cluster <pod>         # what does this proxy think upstreams look like?
istioctl proxy-config routes <pod>          # how is it routing?
istioctl analyze                            # static analysis of config
istioctl pc log <pod> --level debug         # crank up Envoy logging

# Linkerd
linkerd viz tap deploy/web                  # live L7 traffic
linkerd viz stat -n my-ns deploy            # success rates and latency
linkerd diagnostics proxy-metrics           # raw proxy stats
linkerd check                               # everything-is-fine sanity check

A common pattern: app says "connection refused"; the proxy says "no upstream endpoints" because a DestinationRule selector is wrong. Always check the proxy view.

Cost Awareness

Meshes are expensive on three axes:

  1. Compute — sidecars × pods × replicas; control plane.
  2. Latency — 1-3 ms per call adds up across chains of microservices.
  3. People — someone has to learn Envoy, debug xDS, manage upgrades.

If you have a small team and a handful of services, the simpler answer is often:

  • Kubernetes NetworkPolicies for L3/L4 segmentation.
  • An API Gateway for north-south.
  • App-level libraries for retries and timeouts.
  • mTLS via cert-manager + an internal CA for the few services that truly need it.

Adopt a mesh when the cross-cutting concerns are causing real pain and you have the platform-team capacity to own it.

When to Pull the Plug

Signs you should reconsider:

  • Mesh-related incidents outnumber app-related ones.
  • Nobody on call understands the proxy stack well enough to debug it.
  • Latency budget for end users is being eaten by sidecar hops.
  • Feature usage is "just mTLS" — which you could get from a cheaper solution.

A common landing: migrate to Istio Ambient or roll back to NetworkPolicies + per-service mTLS. Both are legitimate.

Checklist

Production service-mesh checklist

  • Rolled out namespace-by-namespace, not cluster-wide
  • Stateful workloads either unmeshed or carefully validated
  • Sidecar resource requests/limits set per workload class
  • Authorization policies in audit mode, then enforced
  • Control plane HA (2+ replicas) and right-sized
  • Cert rotation tested (kill the issuer, watch what happens)
  • Upgrade runbook with rolling-restart plan and PDBs
  • Mesh metrics → Prometheus; dashboards in Grafana
  • Trace context propagation verified end-to-end
  • Egress traffic policy explicit (don't leave it default-open)
  • On-call team trained on istioctl proxy-config / linkerd viz debugging
  • Documented "how to mesh / unmesh a workload" for service teams

On this page