Best Practices
Production service mesh - rollout strategy, sizing, observability, debugging, upgrade path, and when to walk away
Best Practices
A mesh is a platform commitment. These are the patterns that separate "we installed Istio" from "we operate a mesh in production."
Roll Out by Namespace, Not All at Once
Both Istio and Linkerd let you mesh selectively — by namespace label (Istio) or annotation (Linkerd inject). Never mesh the whole cluster on day one.
A sensible sequence:
- Pick one stateless HTTP service.
- Mesh it, compare meshed vs unmeshed metrics for a week.
- Mesh that namespace; observe under real traffic.
- Roll forward namespace by namespace. Stop and investigate if anything regresses.
Database namespaces, infra DaemonSets, and anything using non-HTTP protocols (Kafka clients, gRPC streaming with quirks) deserve extra caution.
Don't Mesh Stateful or Latency-Sensitive Workloads Blindly
- Stateful sets (Postgres, Kafka, etcd, Redis) — meshing them rarely helps and can corrupt state if retries fire on non-idempotent writes. Most teams mesh the clients, not the database pods.
- Latency-critical paths — the sidecar adds 1-3 ms per hop. If your service has a P99 budget of 5 ms, that hurts.
- Long-lived connections (HTTP/2 streams, websockets) — meshes proxy fine but trace and metric data can mislead.
When in doubt: bench it unmeshed, then meshed, with realistic load.
Sizing the Sidecar
Every meshed pod gets a sidecar. At 100 services × 5 replicas each = 500 sidecars. At 50 MB each (Istio classic), that's 25 GB of memory just for proxies.
| Setting | What to tune |
|---|---|
sidecar.istio.io/proxyCPU / proxyMemory | Right-size per workload — small services need much less |
| Linkerd proxy resources | Set requests/limits via annotations |
| Ambient mode (Istio) | Removes per-pod cost entirely |
Sidecar CRD (Istio) | Restricts what config is pushed to each sidecar — huge memory win in big clusters |
In big clusters the istiod / control-plane sizing matters more than the sidecars. Watch its CPU and config-push latency.
Upgrade Strategy
Meshes touch every meshed pod. Plan upgrades like data migrations.
- Read the release notes. Both projects have minor versions that change CRDs or default behavior.
- Test in a non-prod cluster first with realistic workload.
- Upgrade control plane first, then data plane (proxy) versions.
- Rolling restart to pick up new sidecars. Make sure your PDBs and
maxUnavailablesettings can absorb the churn. - Pin sidecar version if you need to delay proxy upgrades.
Istio's revisions and Linkerd's --prune upgrades make this tractable but not automatic.
Security: Don't Stop at mTLS
mTLS gives you encrypted transport and verified identities. By itself it doesn't say who's allowed to call whom. Always add authorization policy:
Istio
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: api-only-from-web
namespace: production
spec:
selector:
matchLabels: { app: api }
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/production/sa/web"]Linkerd
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: api-server
namespace: production
spec:
podSelector:
matchLabels: { app: api }
port: 8080
proxyProtocol: HTTP/1
---
apiVersion: policy.linkerd.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: api-only-from-web
namespace: production
spec:
targetRef: { kind: Server, name: api-server }
requiredAuthenticationRefs:
- kind: MeshTLSAuthentication
name: web-saStart in audit / permissive mode, see what would have been denied, then flip to enforce.
Observability
A mesh emits a lot. Wire it into what you already run:
| Signal | Where it goes |
|---|---|
| Metrics | Prometheus — meshes ship dashboards out of the box |
| Traces | Jaeger / Tempo / a SaaS — meshes propagate / generate spans (B3 / W3C trace context) |
| Access logs | Your logging pipeline — see ELK |
| Topology | Kiali (Istio), Linkerd Viz dashboard |
Don't double-instrument. If the mesh emits the L7 metrics, drop the app-level HTTP middleware that does the same. One source of truth.
Debugging Skills That Pay Off
When something breaks in a mesh, you need to look at proxies, not just apps:
# Istio
istioctl proxy-config cluster <pod> # what does this proxy think upstreams look like?
istioctl proxy-config routes <pod> # how is it routing?
istioctl analyze # static analysis of config
istioctl pc log <pod> --level debug # crank up Envoy logging
# Linkerd
linkerd viz tap deploy/web # live L7 traffic
linkerd viz stat -n my-ns deploy # success rates and latency
linkerd diagnostics proxy-metrics # raw proxy stats
linkerd check # everything-is-fine sanity checkA common pattern: app says "connection refused"; the proxy says "no upstream endpoints" because a DestinationRule selector is wrong. Always check the proxy view.
Cost Awareness
Meshes are expensive on three axes:
- Compute — sidecars × pods × replicas; control plane.
- Latency — 1-3 ms per call adds up across chains of microservices.
- People — someone has to learn Envoy, debug xDS, manage upgrades.
If you have a small team and a handful of services, the simpler answer is often:
- Kubernetes NetworkPolicies for L3/L4 segmentation.
- An API Gateway for north-south.
- App-level libraries for retries and timeouts.
- mTLS via cert-manager + an internal CA for the few services that truly need it.
Adopt a mesh when the cross-cutting concerns are causing real pain and you have the platform-team capacity to own it.
When to Pull the Plug
Signs you should reconsider:
- Mesh-related incidents outnumber app-related ones.
- Nobody on call understands the proxy stack well enough to debug it.
- Latency budget for end users is being eaten by sidecar hops.
- Feature usage is "just mTLS" — which you could get from a cheaper solution.
A common landing: migrate to Istio Ambient or roll back to NetworkPolicies + per-service mTLS. Both are legitimate.
Checklist
Production service-mesh checklist
- Rolled out namespace-by-namespace, not cluster-wide
- Stateful workloads either unmeshed or carefully validated
- Sidecar resource requests/limits set per workload class
- Authorization policies in audit mode, then enforced
- Control plane HA (2+ replicas) and right-sized
- Cert rotation tested (kill the issuer, watch what happens)
- Upgrade runbook with rolling-restart plan and PDBs
- Mesh metrics → Prometheus; dashboards in Grafana
- Trace context propagation verified end-to-end
- Egress traffic policy explicit (don't leave it default-open)
- On-call team trained on
istioctl proxy-config/linkerd vizdebugging - Documented "how to mesh / unmesh a workload" for service teams