Best Practices

Production API gateway - HA topology, versioning, observability, security hardening, anti-patterns, ops habits

Best Practices

The gateway is a single point of failure by design — every external request goes through it. Treat it like one: redundant, observable, boring.

HA Topology

                  ┌──────────────┐
                  │  Cloud LB    │  health-checks each node
                  └──────┬───────┘
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
  ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ Gateway1 │     │ Gateway2 │     │ Gateway3 │   stateless
  └──────────┘     └──────────┘     └──────────┘
       └─────────────────┼─────────────────┘
                         ▼
                  ┌──────────────┐
                  │   Backends   │
                  └──────────────┘

Stateless gateway instances so any LB can serve any request.
State (rate-limit counters, sessions) in Redis or similar.
At least 2 instances, ideally 3+, across availability zones.
Health-check endpoint (e.g. /status on Kong) for the LB.
PodDisruptionBudgets on Kubernetes so rolling updates don't drop the gateway to zero.

Config-as-Code

Don't manage gateway state by clicking the admin API. Declarative config + GitOps:

# Kong: deck (declarative configuration tool)
deck gateway diff kong.yml          # see what would change
deck gateway sync kong.yml          # apply

# Envoy Gateway: standard Kubernetes resources, applied by your GitOps controller
kubectl apply -f gateway/

Every change goes through PR review. The admin API is for humans investigating, not for automated workflows.

Versioning

External clients hate breaking changes. Strategies that work:

Strategy	Example
Path versioning	`/v1/users`, `/v2/users`
Header versioning	`X-API-Version: 2`
Hostname versioning	`v2.api.example.com`
Accept header	`Accept: application/vnd.example.v2+json`

Path versioning is the most pragmatic — visible to humans, easy to route in the gateway:

services:
  - { name: users-v1, url: http://users-v1:8080, routes: [{ paths: ["/v1/users"] }] }
  - { name: users-v2, url: http://users-v2:8080, routes: [{ paths: ["/v2/users"] }] }

When deprecating, return Deprecation and Sunset headers for months before turning a version off. Watch the gateway logs to find the last few callers.

Observability

A gateway sees everything. Make sure you can see what it sees:

Signal	What to capture
Access logs	One line per request, structured JSON, including correlation ID, consumer, route, status, latency, upstream
Metrics	Request rate, error rate, latency histograms per route + consumer
Traces	Start the trace at the gateway, propagate `traceparent` / `b3` headers downstream
Errors	Stack of upstream errors, gateway-side errors (rate limits, auth failures), separately counted

Wire these into Prometheus & Grafana. The "golden dashboard" for a gateway:

Total RPS and error rate (the biggest panel)
P50 / P95 / P99 latency
Top routes by RPS, by error rate, by latency
4xx vs 5xx breakdown (4xx = client; 5xx = you)
Per-consumer rate-limit hit rate
Upstream health status

Set SLO-based alerts on the gateway, not on each backend. The user only ever sees what the gateway returned.

Security Hardening

A small checklist that closes most footguns:

Run on a hardened image. Distroless or a vendor-blessed minimal image.
TLS termination at the gateway with modern ciphers; backends speak mTLS internally if zero-trust.
Disable HTTP/1.0 and old TLS versions. TLS 1.2+ only.
Rate-limit /login-style endpoints aggressively. Often a separate, tighter limit than the rest.
Bind the admin API to an internal network only. Never the public internet.
Audit log every config change. Who, when, what diff.
Bot mitigation on user-facing routes. Cloudflare / Fastly / a WAF in front, or Kong/Envoy plugins.
Body-size limits. A 100 MB POST shouldn't reach your backend by accident.
Header limits. Reject pathological header counts / sizes.
CORS narrow. No * in production credentials-bearing endpoints.

Common Anti-Patterns

Anti-pattern	Symptom	Fix
Business logic in plugins	Custom Lua / WASM doing data joins or workflow	Move logic to a BFF or domain service
One gateway plugin per "policy"	Hundreds of plugins; config diffs are unreviewable	Group by service; templated policies
The gateway and every service implementing auth	Inconsistencies	Auth at gateway, identity propagated as headers/mTLS
Gateway pulled into every release train	Coupled to app schedule	Decouple gateway changes from app deploys
One gigantic shared gateway	"We can't change anything without breaking someone"	Per-domain / per-tenant gateway tiers

When to Split

For most teams, one gateway tier is enough for a long time. Reasons to split:

Different SLOs. Partner B2B API needs 99.99%; internal admin tool can be 99.9%.
Different attack surface. Public mobile API vs internal-only LAN.
Different teams. Platform team can't review every commercial-API config change.
Geographical regions. Latency and data-residency requirements.

A common pattern: a public-edge gateway (Cloudflare + WAF) → a per-domain gateway (Kong / Envoy) → services. Each layer does its own job.

Operational Habits

A few that pay off:

Track config drift. deck gateway dump | git diff periodically.
Capacity-test before launch. Real load against a staging gateway. Measure your gateway, not just upstream.
Test failure paths. Backend down → does the gateway 502 cleanly? Slow upstream → does the gateway time out at the right point?
Tail the access log during incidents. It's your single source of truth for "did the request even arrive?"
Document the policies. "Why is rate limit X 100/min?" should have an answer in git, not Slack history.

Best Practices

Best Practices

HA Topology

Config-as-Code

Versioning

Observability

Security Hardening

Common Anti-Patterns

When to Split

Operational Habits

Checklist

On this page