Best Practices

Production-ready Kubernetes - RBAC, security, GitOps, manifest management, and operational habits

Best Practices

The features in earlier pages let you run a cluster. This page is about running one safely and repeatably at scale.

Manifest Management

Hand-written YAML doesn't scale past one environment. Pick a tool:

Tool	Approach	Best for
Kustomize	Patch a base for each env	Built into kubectl (`-k`); minimal cognitive load
Helm	Templated charts + values	Reusable packages, community charts
CDK8s / Pulumi	Real programming language	Strong typing; complex logic
Jsonnet	Functional templating	Large orgs with shared standards

The Kustomize pattern:

manifests/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── staging/
    │   ├── kustomization.yaml
    │   └── replicas.yaml
    └── production/
        ├── kustomization.yaml
        └── replicas.yaml

kubectl apply -k manifests/overlays/production

GitOps

Stop running kubectl apply from laptops. Git is the source of truth; an operator reconciles the cluster to match.

Tool	Notes
ArgoCD	UI-driven, sync on schedule or push, Application CRDs
Flux	CLI-first, GitOps Toolkit, pull-based reconciliation

A typical setup:

One git repo (or directory) per environment.
Argo/Flux watches it; any commit becomes a cluster change.
Changes go through PR review.
Out-of-band cluster edits get auto-corrected (or trigger an alert).

# ArgoCD Application — points the cluster at a git directory
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-server-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/infra.git
    targetRevision: main
    path: manifests/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

RBAC: Least Privilege

Every API call is authenticated and authorized. Don't use cluster-admin for normal work.

# Read-only access to one namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: viewer
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-viewer
  namespace: production
subjects:
  - kind: User
    name: alice@example.com
roleRef:
  kind: Role
  name: viewer
  apiGroup: rbac.authorization.k8s.io

Workload identity

Pods that talk to the K8s API need a ServiceAccount:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cert-renewer
  namespace: production
---
# Bind it to a Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cert-renewer
  namespace: production
subjects:
  - kind: ServiceAccount
    name: cert-renewer
    namespace: production
roleRef:
  kind: Role
  name: secret-writer
  apiGroup: rbac.authorization.k8s.io

For cloud APIs (AWS, GCP, Azure), use Workload Identity — your cloud provider trusts a specific ServiceAccount and issues short-lived credentials. No more long-lived keys in Secrets.

Pod Security

spec:
  securityContext:
    runAsNonRoot: true                     # refuse to start as root
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile: { type: RuntimeDefault }
  containers:
    - name: app
      image: myregistry/app:v1.2.3
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true       # writes go to mounted volumes only
        capabilities:
          drop: ["ALL"]
      resources:                           # ALWAYS set these
        requests: { cpu: "250m", memory: "256Mi" }
        limits:   { cpu: "1000m", memory: "512Mi" }

Enforce these at admission with Pod Security Admission (built-in) or Kyverno / OPA Gatekeeper for richer policies.

Image Hygiene

Habit	Why
Pin tags, never `latest`	Rolling updates need to know "what's new"
Use digests in production (`@sha256:...`)	The same tag can point to different images over time
Scan images in CI (Trivy, Grype)	Block known CVEs before they ship
Sign images (Cosign / Sigstore)	Verify provenance at admission
Distroless / minimal base	Smaller blast radius, fewer CVEs
Private registry with caching	Faster pulls, less Docker Hub rate-limiting

Observability

You can't operate what you can't see. Minimum:

Logs — ship from every pod (Fluent Bit DaemonSet → Loki / Elasticsearch / a SaaS).
Metrics — see Prometheus & Grafana.
Traces — OpenTelemetry SDK in apps, OTel Collector in cluster, ship to Jaeger / Tempo / a SaaS.
Events — kubectl get events is useful but ephemeral; ship them too.

Resource Hygiene

Practice	Why
Resource `requests` on every container	Without them, the scheduler packs blindly; HPA breaks
Resource `limits` on every container	Without them, one bad pod starves the node
Don't overcommit memory	OOMKills are unrecoverable; CPU throttling is recoverable
LimitRange per namespace	Catch missing requests/limits at admission
ResourceQuota per namespace	Caps total CPU/memory/storage for one team
Cluster Autoscaler	Add/remove nodes based on pending pods

# Namespace-wide caps
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi"
    limits.cpu: "200"
    limits.memory: "400Gi"
    pods: "200"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: production-defaults
  namespace: production
spec:
  limits:
    - type: Container
      defaultRequest: { cpu: "100m", memory: "128Mi" }
      default:        { cpu: "500m", memory: "512Mi" }

Multi-tenancy

For shared clusters (multiple teams, multiple environments):

One namespace per app + env at minimum.
NetworkPolicies default-deny + explicit allowlists.
ResourceQuota + LimitRange per namespace.
RBAC: humans get Roles in their namespaces, not ClusterRoles.
PriorityClass for critical workloads so they survive eviction.

Operational Habits

A handful that pay off:

Never edit live resources. Always change git → reconcile. Drift is debt.
One thing per PR. Replicas, env, image — separate changes, separate rollouts.
Test deploys on staging first. Same Kustomize/Helm overlay, different values.
Practice rollbacks. kubectl rollout undo should be muscle memory.
Capacity-plan with kubectl top and HPA history, not gut feel.
Run game days. Kill a pod, drain a node, fail a region. See what breaks.
Pin and upgrade. K8s versions, controller versions, Helm chart versions.
Don't run databases on K8s unless someone owns it. Managed RDS/Cloud SQL is almost always the right call.

Best Practices

Best Practices

Manifest Management

GitOps

RBAC: Least Privilege

Workload identity

Pod Security

Image Hygiene

Observability

Resource Hygiene

Multi-tenancy

Operational Habits

Checklist

On this page