Best Practices

Policy lifecycle, performance, debugging, working with compliance teams, common pitfalls, scaling across teams

Best Practices

The operational realities of running policy-as-code without making engineering miserable.

The Policy Lifecycle

Every policy goes through:

Propose — RFC-style document: what rule, why, who's affected, exception path.
Implement + test — code, unit tests, sample violations.
Audit mode — runs but doesn't block; collect metrics for at least 2 weeks.
Warn mode — blocks would happen; affected teams informed.
Enforce mode — blocks for real.
Maintain — review hits monthly; sunset rules that fire constantly (broken) or never (irrelevant).

Skipping audit/warn is the single most common cause of policy-induced incidents.

Performance

OPA / Gatekeeper / Kyverno evaluate on every admission request. Slow policies = slow API server = slow everything.

Keep rules simple. Deep walk()s and nested loops are slow.
Use indices. Where possible, structure rules so OPA can build a hash lookup instead of scanning.
Cache external data in OPA bundles, not via per-request HTTP calls.
Measure: Gatekeeper exposes Prometheus metrics for admission latency. Alert on P99 > 500ms.

A rule that takes 100ms to evaluate, called on every Pod creation, will eventually be the bottleneck. Test policies at scale (synthetic load against admission webhook) before enforcing in prod.

Debugging

When a policy fires unexpectedly:

OPA REPL: opa run policy.rego lets you load the policy and run input := {...} interactively.
opa eval: opa eval --data policy.rego --input bad.json 'data.policy.deny' prints why.
Gatekeeper logs: kubectl logs -n gatekeeper-system deployment/gatekeeper-controller-manager — see what's being denied and why.
Kyverno policy-reporter: visual dashboard of violations and trends.

For an engineer hitting a policy block, the error message is the only signal. Always include the why and a link to the policy source:

deny[msg] {
  ...
  msg := sprintf("Container %v runs as root (forbidden by no-root-containers policy). See: https://policies.example.com/no-root", [container.name])
}

A "denied" with no context wastes an hour of engineer time.

Working with Compliance

Policy-as-code is your most powerful compliance tool — if you connect it to the framework.

For SOC2 / ISO 27001 / FedRAMP / PCI:

Map each control to one or more policies. "CC6.1 - Logical access" → "K8s RBAC required", "Pod security context required", "Image signature required."
Generate evidence automatically: a daily report of policy hits, exemptions, and audit-mode violations becomes your compliance evidence.
Treat exceptions as audited: every policy bypass is logged with reason + duration + approver.

Compliance auditors love this because it turns "we have a policy" into "we have a policy and here's evidence it's enforced." Many controls collapse into a single Gatekeeper constraint.

Multi-Tenancy Considerations

Different teams may need different policies. Patterns:

Tier-based: tier=production namespaces get strict policies; tier=experimental gets relaxed.
Team labels: Constraint scoped to namespaces labeled team=platform vs. team=data.
Project-level OPA bundles: each team's policies pulled from their own folder, evaluated against their own resources.

For very large orgs, a policy hub-and-spoke: central team owns core policies (security, compliance, the must-have-by-corporate); teams add team-specific policies that extend (not override) the core.

Disaster Recovery: Don't Block Yourself Out

A policy that's too strict can lock you out. Patterns to avoid this:

failurePolicy: Ignore on the validating webhook in early days — if the webhook itself fails, allow the request (fail open).
Once mature: failurePolicy: Fail (deny on webhook failure) — closes the loop.
Always except cluster operators: policy.kubernetes.io/exempt: cluster-admin style labels.
Have a "policy override" namespace that's exempt from most rules, for emergency reactive deploys.

A real-world story: a policy required image signatures. The image signing service went down. The cluster couldn't pull any image. The policy denied even the rescue image used to fix the signing service. Have an escape hatch.

Cost-Benefit per Policy

Not every rule is worth a policy. Quick test before writing:

Question	Answer
What real damage would a violation cause?	If "nothing significant" — skip
How often will it fire?	If "constantly" — pattern problem, not a policy gap
Can we make the right thing easy instead?	Often yes (Terraform module, scaffold)
Will engineers thank you or curse you?	If the latter — calibrate

Policy is one tool; documentation, guardrails-as-libraries, default-safe templates are all alternatives. Pick the lowest-friction tool that achieves the goal.

Scaling Across Teams

10 teams = different practice from 100. Patterns at scale:

Central platform team owns the policy engine, including upgrades and infrastructure.
Distributed teams write team-specific policies that extend the central ones.
Office hours — a regular slot for teams to discuss exemptions, propose changes.
Self-service exemption workflow — PR template that includes justification, expiration, approver list. Auto-merges if approver signs.
Quarterly policy review — every active policy reviewed: still needed? Still right? Should it be tighter?

Audit Trail

Every policy decision should be loggable. Configure:

Gatekeeper: audit logs to stdout, scraped by your logging system.
Kyverno: built-in reports CRDs; integrate with policy-reporter.
OPA: decision logs to remote endpoint or local file.

What to keep: the input (sanitized), the decision (allow/deny/why), the user/SA, the timestamp. Retain 12+ months for compliance.

Common Pitfalls

Audit mode skipped. A new policy goes straight to enforce. First prod incident teaches you why audit exists.

No exemption process. Engineers either route around the policy or work gets stuck. Build the exemption process before turning enforcement on.

Policy rot. Rules that fire 1000× a day and no one cares. Either the rule is broken or the convention changed; either way, fix it. Don't let alert blindness happen at the policy layer.

Vendor lock-in to wrong tool. Building everything in Sentinel for Terraform Cloud, then later wanting cross-cutting policies — painful migration. Default to OPA unless you have a strong reason otherwise.

Policies invisible to engineers. Engineers shouldn't need a Slack archeology project to find which rule blocked them. Surface policy docs in the error message and in your IDP / Backstage portal.

Treating compliance as the only user. Compliance is one consumer; engineers are the other. Optimize for both, especially the latter.

Webhook timeout. Default webhook timeout is short (10s in K8s). A slow policy times out and either fails-open (bad) or fails-closed (cluster broken). Monitor admission latency.

No tests on policy upgrades. The policy engine itself (OPA, Gatekeeper, Kyverno) gets upgraded. Test that your existing policies still behave the same way before rolling.

Checklist

What's Next

You have a policy-as-code practice. Connect it to:

GitOps — policies themselves deployed via GitOps; bundles in Git
Secrets — policies enforce "no secrets in plaintext"
Service Mesh — mesh-level policies (mTLS required, authz rules) complement K8s admission
CI/CD — conftest in CI gates merges
Supply Chain Security — image signature policies

Best Practices

On this page