Steven's Knowledge

Best Practices

Production API gateway - HA topology, versioning, observability, security hardening, anti-patterns, ops habits

Best Practices

The gateway is a single point of failure by design — every external request goes through it. Treat it like one: redundant, observable, boring.

HA Topology

                  ┌──────────────┐
                  │  Cloud LB    │  health-checks each node
                  └──────┬───────┘
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
  ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ Gateway1 │     │ Gateway2 │     │ Gateway3 │   stateless
  └──────────┘     └──────────┘     └──────────┘
       └─────────────────┼─────────────────┘

                  ┌──────────────┐
                  │   Backends   │
                  └──────────────┘
  • Stateless gateway instances so any LB can serve any request.
  • State (rate-limit counters, sessions) in Redis or similar.
  • At least 2 instances, ideally 3+, across availability zones.
  • Health-check endpoint (e.g. /status on Kong) for the LB.
  • PodDisruptionBudgets on Kubernetes so rolling updates don't drop the gateway to zero.

Config-as-Code

Don't manage gateway state by clicking the admin API. Declarative config + GitOps:

# Kong: deck (declarative configuration tool)
deck gateway diff kong.yml          # see what would change
deck gateway sync kong.yml          # apply

# Envoy Gateway: standard Kubernetes resources, applied by your GitOps controller
kubectl apply -f gateway/

Every change goes through PR review. The admin API is for humans investigating, not for automated workflows.

Versioning

External clients hate breaking changes. Strategies that work:

StrategyExample
Path versioning/v1/users, /v2/users
Header versioningX-API-Version: 2
Hostname versioningv2.api.example.com
Accept headerAccept: application/vnd.example.v2+json

Path versioning is the most pragmatic — visible to humans, easy to route in the gateway:

services:
  - { name: users-v1, url: http://users-v1:8080, routes: [{ paths: ["/v1/users"] }] }
  - { name: users-v2, url: http://users-v2:8080, routes: [{ paths: ["/v2/users"] }] }

When deprecating, return Deprecation and Sunset headers for months before turning a version off. Watch the gateway logs to find the last few callers.

Observability

A gateway sees everything. Make sure you can see what it sees:

SignalWhat to capture
Access logsOne line per request, structured JSON, including correlation ID, consumer, route, status, latency, upstream
MetricsRequest rate, error rate, latency histograms per route + consumer
TracesStart the trace at the gateway, propagate traceparent / b3 headers downstream
ErrorsStack of upstream errors, gateway-side errors (rate limits, auth failures), separately counted

Wire these into Prometheus & Grafana. The "golden dashboard" for a gateway:

  • Total RPS and error rate (the biggest panel)
  • P50 / P95 / P99 latency
  • Top routes by RPS, by error rate, by latency
  • 4xx vs 5xx breakdown (4xx = client; 5xx = you)
  • Per-consumer rate-limit hit rate
  • Upstream health status

Set SLO-based alerts on the gateway, not on each backend. The user only ever sees what the gateway returned.

Security Hardening

A small checklist that closes most footguns:

  • Run on a hardened image. Distroless or a vendor-blessed minimal image.
  • TLS termination at the gateway with modern ciphers; backends speak mTLS internally if zero-trust.
  • Disable HTTP/1.0 and old TLS versions. TLS 1.2+ only.
  • Rate-limit /login-style endpoints aggressively. Often a separate, tighter limit than the rest.
  • Bind the admin API to an internal network only. Never the public internet.
  • Audit log every config change. Who, when, what diff.
  • Bot mitigation on user-facing routes. Cloudflare / Fastly / a WAF in front, or Kong/Envoy plugins.
  • Body-size limits. A 100 MB POST shouldn't reach your backend by accident.
  • Header limits. Reject pathological header counts / sizes.
  • CORS narrow. No * in production credentials-bearing endpoints.

Common Anti-Patterns

Anti-patternSymptomFix
Business logic in pluginsCustom Lua / WASM doing data joins or workflowMove logic to a BFF or domain service
One gateway plugin per "policy"Hundreds of plugins; config diffs are unreviewableGroup by service; templated policies
The gateway and every service implementing authInconsistenciesAuth at gateway, identity propagated as headers/mTLS
Gateway pulled into every release trainCoupled to app scheduleDecouple gateway changes from app deploys
One gigantic shared gateway"We can't change anything without breaking someone"Per-domain / per-tenant gateway tiers

When to Split

For most teams, one gateway tier is enough for a long time. Reasons to split:

  • Different SLOs. Partner B2B API needs 99.99%; internal admin tool can be 99.9%.
  • Different attack surface. Public mobile API vs internal-only LAN.
  • Different teams. Platform team can't review every commercial-API config change.
  • Geographical regions. Latency and data-residency requirements.

A common pattern: a public-edge gateway (Cloudflare + WAF) → a per-domain gateway (Kong / Envoy) → services. Each layer does its own job.

Operational Habits

A few that pay off:

  1. Track config drift. deck gateway dump | git diff periodically.
  2. Capacity-test before launch. Real load against a staging gateway. Measure your gateway, not just upstream.
  3. Test failure paths. Backend down → does the gateway 502 cleanly? Slow upstream → does the gateway time out at the right point?
  4. Tail the access log during incidents. It's your single source of truth for "did the request even arrive?"
  5. Document the policies. "Why is rate limit X 100/min?" should have an answer in git, not Slack history.

Checklist

Production-ready API gateway checklist

  • 2+ instances across availability zones, behind an L4 LB
  • Stateless gateway nodes; state in Redis / shared store
  • Declarative config in git; admin API not used in automation
  • TLS at the edge; modern ciphers; no TLS 1.0/1.1
  • Auth (JWT / OIDC / API key / mTLS) enforced for all non-public routes
  • Rate limits per consumer and per IP; aggressive on /login-style endpoints
  • CORS narrow; specific origins only
  • Body size and header limits set
  • Admin API on internal network only
  • Structured access logs + metrics + traces shipped to your observability stack
  • Correlation ID injected at the edge; propagated to backends
  • PodDisruptionBudget so upgrades don't take the gateway down
  • Versioning strategy documented; deprecation policy with Sunset headers
  • WAF / bot mitigation in front (Cloudflare / Fastly / managed AWS)
  • SLO and burn-rate alerts based on gateway-side metrics

On this page