Best Practices
Production API gateway - HA topology, versioning, observability, security hardening, anti-patterns, ops habits
Best Practices
The gateway is a single point of failure by design — every external request goes through it. Treat it like one: redundant, observable, boring.
HA Topology
┌──────────────┐
│ Cloud LB │ health-checks each node
└──────┬───────┘
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Gateway1 │ │ Gateway2 │ │ Gateway3 │ stateless
└──────────┘ └──────────┘ └──────────┘
└─────────────────┼─────────────────┘
▼
┌──────────────┐
│ Backends │
└──────────────┘- Stateless gateway instances so any LB can serve any request.
- State (rate-limit counters, sessions) in Redis or similar.
- At least 2 instances, ideally 3+, across availability zones.
- Health-check endpoint (e.g.
/statuson Kong) for the LB. - PodDisruptionBudgets on Kubernetes so rolling updates don't drop the gateway to zero.
Config-as-Code
Don't manage gateway state by clicking the admin API. Declarative config + GitOps:
# Kong: deck (declarative configuration tool)
deck gateway diff kong.yml # see what would change
deck gateway sync kong.yml # apply
# Envoy Gateway: standard Kubernetes resources, applied by your GitOps controller
kubectl apply -f gateway/Every change goes through PR review. The admin API is for humans investigating, not for automated workflows.
Versioning
External clients hate breaking changes. Strategies that work:
| Strategy | Example |
|---|---|
| Path versioning | /v1/users, /v2/users |
| Header versioning | X-API-Version: 2 |
| Hostname versioning | v2.api.example.com |
| Accept header | Accept: application/vnd.example.v2+json |
Path versioning is the most pragmatic — visible to humans, easy to route in the gateway:
services:
- { name: users-v1, url: http://users-v1:8080, routes: [{ paths: ["/v1/users"] }] }
- { name: users-v2, url: http://users-v2:8080, routes: [{ paths: ["/v2/users"] }] }When deprecating, return Deprecation and Sunset headers for months before turning a version off. Watch the gateway logs to find the last few callers.
Observability
A gateway sees everything. Make sure you can see what it sees:
| Signal | What to capture |
|---|---|
| Access logs | One line per request, structured JSON, including correlation ID, consumer, route, status, latency, upstream |
| Metrics | Request rate, error rate, latency histograms per route + consumer |
| Traces | Start the trace at the gateway, propagate traceparent / b3 headers downstream |
| Errors | Stack of upstream errors, gateway-side errors (rate limits, auth failures), separately counted |
Wire these into Prometheus & Grafana. The "golden dashboard" for a gateway:
- Total RPS and error rate (the biggest panel)
- P50 / P95 / P99 latency
- Top routes by RPS, by error rate, by latency
- 4xx vs 5xx breakdown (4xx = client; 5xx = you)
- Per-consumer rate-limit hit rate
- Upstream health status
Set SLO-based alerts on the gateway, not on each backend. The user only ever sees what the gateway returned.
Security Hardening
A small checklist that closes most footguns:
- Run on a hardened image. Distroless or a vendor-blessed minimal image.
- TLS termination at the gateway with modern ciphers; backends speak mTLS internally if zero-trust.
- Disable HTTP/1.0 and old TLS versions. TLS 1.2+ only.
- Rate-limit
/login-style endpoints aggressively. Often a separate, tighter limit than the rest. - Bind the admin API to an internal network only. Never the public internet.
- Audit log every config change. Who, when, what diff.
- Bot mitigation on user-facing routes. Cloudflare / Fastly / a WAF in front, or Kong/Envoy plugins.
- Body-size limits. A 100 MB POST shouldn't reach your backend by accident.
- Header limits. Reject pathological header counts / sizes.
- CORS narrow. No
*in production credentials-bearing endpoints.
Common Anti-Patterns
| Anti-pattern | Symptom | Fix |
|---|---|---|
| Business logic in plugins | Custom Lua / WASM doing data joins or workflow | Move logic to a BFF or domain service |
| One gateway plugin per "policy" | Hundreds of plugins; config diffs are unreviewable | Group by service; templated policies |
| The gateway and every service implementing auth | Inconsistencies | Auth at gateway, identity propagated as headers/mTLS |
| Gateway pulled into every release train | Coupled to app schedule | Decouple gateway changes from app deploys |
| One gigantic shared gateway | "We can't change anything without breaking someone" | Per-domain / per-tenant gateway tiers |
When to Split
For most teams, one gateway tier is enough for a long time. Reasons to split:
- Different SLOs. Partner B2B API needs 99.99%; internal admin tool can be 99.9%.
- Different attack surface. Public mobile API vs internal-only LAN.
- Different teams. Platform team can't review every commercial-API config change.
- Geographical regions. Latency and data-residency requirements.
A common pattern: a public-edge gateway (Cloudflare + WAF) → a per-domain gateway (Kong / Envoy) → services. Each layer does its own job.
Operational Habits
A few that pay off:
- Track config drift.
deck gateway dump | git diffperiodically. - Capacity-test before launch. Real load against a staging gateway. Measure your gateway, not just upstream.
- Test failure paths. Backend down → does the gateway 502 cleanly? Slow upstream → does the gateway time out at the right point?
- Tail the access log during incidents. It's your single source of truth for "did the request even arrive?"
- Document the policies. "Why is rate limit X 100/min?" should have an answer in git, not Slack history.
Checklist
Production-ready API gateway checklist
- 2+ instances across availability zones, behind an L4 LB
- Stateless gateway nodes; state in Redis / shared store
- Declarative config in git; admin API not used in automation
- TLS at the edge; modern ciphers; no TLS 1.0/1.1
- Auth (JWT / OIDC / API key / mTLS) enforced for all non-public routes
- Rate limits per consumer and per IP; aggressive on
/login-style endpoints - CORS narrow; specific origins only
- Body size and header limits set
- Admin API on internal network only
- Structured access logs + metrics + traces shipped to your observability stack
- Correlation ID injected at the edge; propagated to backends
- PodDisruptionBudget so upgrades don't take the gateway down
- Versioning strategy documented; deprecation policy with
Sunsetheaders - WAF / bot mitigation in front (Cloudflare / Fastly / managed AWS)
- SLO and burn-rate alerts based on gateway-side metrics