Best Practices
Policy lifecycle, performance, debugging, working with compliance teams, common pitfalls, scaling across teams
Best Practices
The operational realities of running policy-as-code without making engineering miserable.
The Policy Lifecycle
Every policy goes through:
- Propose — RFC-style document: what rule, why, who's affected, exception path.
- Implement + test — code, unit tests, sample violations.
- Audit mode — runs but doesn't block; collect metrics for at least 2 weeks.
- Warn mode — blocks would happen; affected teams informed.
- Enforce mode — blocks for real.
- Maintain — review hits monthly; sunset rules that fire constantly (broken) or never (irrelevant).
Skipping audit/warn is the single most common cause of policy-induced incidents.
Performance
OPA / Gatekeeper / Kyverno evaluate on every admission request. Slow policies = slow API server = slow everything.
- Keep rules simple. Deep
walk()s and nested loops are slow. - Use indices. Where possible, structure rules so OPA can build a hash lookup instead of scanning.
- Cache external data in OPA bundles, not via per-request HTTP calls.
- Measure: Gatekeeper exposes Prometheus metrics for admission latency. Alert on P99 > 500ms.
A rule that takes 100ms to evaluate, called on every Pod creation, will eventually be the bottleneck. Test policies at scale (synthetic load against admission webhook) before enforcing in prod.
Debugging
When a policy fires unexpectedly:
- OPA REPL:
opa run policy.regolets you load the policy and runinput := {...}interactively. opa eval:opa eval --data policy.rego --input bad.json 'data.policy.deny'prints why.- Gatekeeper logs:
kubectl logs -n gatekeeper-system deployment/gatekeeper-controller-manager— see what's being denied and why. - Kyverno
policy-reporter: visual dashboard of violations and trends.
For an engineer hitting a policy block, the error message is the only signal. Always include the why and a link to the policy source:
deny[msg] {
...
msg := sprintf("Container %v runs as root (forbidden by no-root-containers policy). See: https://policies.example.com/no-root", [container.name])
}A "denied" with no context wastes an hour of engineer time.
Working with Compliance
Policy-as-code is your most powerful compliance tool — if you connect it to the framework.
For SOC2 / ISO 27001 / FedRAMP / PCI:
- Map each control to one or more policies. "CC6.1 - Logical access" → "K8s RBAC required", "Pod security context required", "Image signature required."
- Generate evidence automatically: a daily report of policy hits, exemptions, and audit-mode violations becomes your compliance evidence.
- Treat exceptions as audited: every policy bypass is logged with reason + duration + approver.
Compliance auditors love this because it turns "we have a policy" into "we have a policy and here's evidence it's enforced." Many controls collapse into a single Gatekeeper constraint.
Multi-Tenancy Considerations
Different teams may need different policies. Patterns:
- Tier-based:
tier=productionnamespaces get strict policies;tier=experimentalgets relaxed. - Team labels: Constraint scoped to namespaces labeled
team=platformvs.team=data. - Project-level OPA bundles: each team's policies pulled from their own folder, evaluated against their own resources.
For very large orgs, a policy hub-and-spoke: central team owns core policies (security, compliance, the must-have-by-corporate); teams add team-specific policies that extend (not override) the core.
Disaster Recovery: Don't Block Yourself Out
A policy that's too strict can lock you out. Patterns to avoid this:
failurePolicy: Ignoreon the validating webhook in early days — if the webhook itself fails, allow the request (fail open).- Once mature:
failurePolicy: Fail(deny on webhook failure) — closes the loop. - Always except cluster operators:
policy.kubernetes.io/exempt: cluster-adminstyle labels. - Have a "policy override" namespace that's exempt from most rules, for emergency reactive deploys.
A real-world story: a policy required image signatures. The image signing service went down. The cluster couldn't pull any image. The policy denied even the rescue image used to fix the signing service. Have an escape hatch.
Cost-Benefit per Policy
Not every rule is worth a policy. Quick test before writing:
| Question | Answer |
|---|---|
| What real damage would a violation cause? | If "nothing significant" — skip |
| How often will it fire? | If "constantly" — pattern problem, not a policy gap |
| Can we make the right thing easy instead? | Often yes (Terraform module, scaffold) |
| Will engineers thank you or curse you? | If the latter — calibrate |
Policy is one tool; documentation, guardrails-as-libraries, default-safe templates are all alternatives. Pick the lowest-friction tool that achieves the goal.
Scaling Across Teams
10 teams = different practice from 100. Patterns at scale:
- Central platform team owns the policy engine, including upgrades and infrastructure.
- Distributed teams write team-specific policies that extend the central ones.
- Office hours — a regular slot for teams to discuss exemptions, propose changes.
- Self-service exemption workflow — PR template that includes justification, expiration, approver list. Auto-merges if approver signs.
- Quarterly policy review — every active policy reviewed: still needed? Still right? Should it be tighter?
Audit Trail
Every policy decision should be loggable. Configure:
- Gatekeeper: audit logs to stdout, scraped by your logging system.
- Kyverno: built-in reports CRDs; integrate with policy-reporter.
- OPA: decision logs to remote endpoint or local file.
What to keep: the input (sanitized), the decision (allow/deny/why), the user/SA, the timestamp. Retain 12+ months for compliance.
Common Pitfalls
Audit mode skipped. A new policy goes straight to enforce. First prod incident teaches you why audit exists.
No exemption process. Engineers either route around the policy or work gets stuck. Build the exemption process before turning enforcement on.
Policy rot. Rules that fire 1000× a day and no one cares. Either the rule is broken or the convention changed; either way, fix it. Don't let alert blindness happen at the policy layer.
Vendor lock-in to wrong tool. Building everything in Sentinel for Terraform Cloud, then later wanting cross-cutting policies — painful migration. Default to OPA unless you have a strong reason otherwise.
Policies invisible to engineers. Engineers shouldn't need a Slack archeology project to find which rule blocked them. Surface policy docs in the error message and in your IDP / Backstage portal.
Treating compliance as the only user. Compliance is one consumer; engineers are the other. Optimize for both, especially the latter.
Webhook timeout. Default webhook timeout is short (10s in K8s). A slow policy times out and either fails-open (bad) or fails-closed (cluster broken). Monitor admission latency.
No tests on policy upgrades. The policy engine itself (OPA, Gatekeeper, Kyverno) gets upgraded. Test that your existing policies still behave the same way before rolling.
Checklist
Policy-as-code production readiness:
- Policy source in Git, reviewed via PR
- Every policy has unit tests (positive + negative cases)
- CI runs
opa test/conftest test/ equivalent - Audit mode used before enforce mode (≥ 2 weeks)
- Error messages include why and a link to the policy doc
- Exemption process defined and used (not bypassed)
- Webhook failure policy considered (fail-open vs fail-closed)
- Admission latency monitored (P99 < 500ms)
- System namespaces excluded from user policies
- Audit logs retained (≥ 12 months for compliance)
- Each compliance control mapped to one or more policies
- Quarterly review of active policies (relevance, hit rate)
- Escape hatch for emergency cluster operations
- Policy upgrades tested before applying
What's Next
You have a policy-as-code practice. Connect it to:
- GitOps — policies themselves deployed via GitOps; bundles in Git
- Secrets — policies enforce "no secrets in plaintext"
- Service Mesh — mesh-level policies (mTLS required, authz rules) complement K8s admission
- CI/CD — conftest in CI gates merges
- Supply Chain Security — image signature policies