Steven's Knowledge

Patterns

Policy structure, bundles, testing, mutation, exceptions, gradual rollout, library structure, cross-cutting policies

Patterns

The patterns that turn ad-hoc rules into a maintainable policy system.

Policy as a Function

A policy is a pure function: input → decision. Treat it like code:

  • Single responsibility per rule. "Pods must not run as root" — not "Pods must not run as root AND must have labels AND must use approved images." Three rules, three reasons.
  • Tested like code. Every rule has positive (should deny) and negative (should allow) test cases.
  • Reviewed like code. PRs into the policy repo go through code review.
  • Released like code. Don't push to prod directly; stage and observe first.

Bundles and Discovery

OPA loads policies from "bundles" — signed tarballs served over HTTP. The policy file lives in Git; CI builds the bundle; OPA pulls.

policy-repo/
├── kubernetes/
│   ├── required_labels.rego
│   ├── no_privileged.rego
│   └── ...
├── terraform/
│   └── s3_no_public.rego
└── tests/
    └── ...

CI:

opa test policy-repo/                # all tests pass
opa build -b policy-repo/ -o bundle.tar.gz
aws s3 cp bundle.tar.gz s3://policy-bundles/v1.2.3/

OPA / Gatekeeper / clients pull s3://policy-bundles/latest/ periodically. Updating policy = updating Git = updating the bundle.

Testing Policies

Two layers:

  1. Unit tests: opa test runs Rego test rules. Fast, hermetic.

    test_root_user_denied {
      deny[_] with input as {"request": {"object": {"spec": {"containers": [{"securityContext": {"runAsUser": 0}}]}}}}
    }
    
    test_non_root_user_allowed {
      count(deny) == 0 with input as {"request": {"object": {"spec": {"containers": [{"securityContext": {"runAsUser": 1000}}]}}}}
    }
  2. Integration tests: spin up Gatekeeper in CI, apply known-bad and known-good resources, assert that bad ones fail and good ones succeed.

    kubectl apply -f testdata/bad-pod.yaml && exit 1   # should fail
    kubectl apply -f testdata/good-pod.yaml || exit 1  # should succeed

Coverage matters: every policy needs both positive and negative tests. Missing the negative case is how you accidentally block legitimate workloads.

Mutation Before Validation

When you can fix it automatically, do — don't make engineers fix things you could fix yourself:

  • Default runAsNonRoot: true on pods that don't specify it
  • Inject team label based on the namespace
  • Add resource requests/limits at sensible defaults
  • Strip privileged flags silently

A common stack: Kyverno mutates first (add labels, defaults), then Gatekeeper validates the now-augmented resource. Engineers experience fewer "your pod is missing field X" rejections.

Exceptions and Phased Rollout

Real policies have legitimate exceptions. Some patterns:

Allow-list namespaces

spec:
  match:
    excludedNamespaces: [kube-system, istio-system]

System namespaces often need things user namespaces shouldn't.

Labels-as-opt-out

Resource carries policy.example.com/skip: high-priority. Policy reads the label, skips if a real reason is given. Audit log captures which workloads opted out.

deny[msg] {
  not input.request.object.metadata.labels["policy.example.com/skip-root-check"]
  container := input.request.object.spec.containers[_]
  container.securityContext.runAsUser == 0
  msg := "Container runs as root and has no exemption"
}

Warn mode → Enforce mode

Gatekeeper has enforcementAction: warn (just log) and enforcementAction: deny (block):

spec:
  enforcementAction: warn   # for 2 weeks
  match: ...

Roll out as warn, watch the audit log for legitimate hits, then switch to deny. Avoids the "Friday afternoon outage because a policy fired in prod for the first time" pattern.

Time-boxed exceptions

Encode the expiration in the label: policy.example.com/skip-until: 2026-06-01. Policy denies if now() > skip_until. Auto-renews into enforcement.

Library Structure

For policy repos that grow past ~20 files:

policies/
├── lib/
│   ├── kubernetes/
│   │   ├── pods.rego          # helpers: containers(), is_root(), etc.
│   │   └── images.rego        # helpers: registry_of(), tag_of()
│   └── common/
│       └── labels.rego
├── rules/
│   ├── pod_security.rego      # uses lib/kubernetes/pods.rego
│   ├── image_registries.rego  # uses lib/kubernetes/images.rego
│   └── required_labels.rego
├── tests/
│   └── (mirrors rules/)
└── policy.yaml                # bundle manifest

DRY the predicates; keep rules thin. The same is_root() helper is referenced from multiple deny rules.

Cross-Cutting Policies (OPA's Big Win)

One Rego policy can run in many contexts:

  • Kubernetes admission via Gatekeeper / OPA sidecar
  • Terraform CI via conftest
  • API authorization via OPA sidecar called per request
  • Image admission via OPA + image scanner integration
# Same "no_public_buckets" policy
package public_buckets

deny[msg] {
  # K8s: check Service of type=LoadBalancer
  input.kind == "Service"
  input.spec.type == "LoadBalancer"
  not input.metadata.annotations["allow-public"]
  msg := "Public LoadBalancer requires explicit allow-public annotation"
}

deny[msg] {
  # Terraform: check S3 bucket
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  resource.change.after.acl == "public-read"
  msg := sprintf("S3 bucket %v has public ACL", [resource.address])
}

One concept ("nothing public without explicit approval"), many enforcement points.

Policy as Data

For lookup-heavy rules, separate policy (the logic) from data (the table):

# rules/allowed_images.rego
package allowed_images

deny[msg] {
  container := input.request.object.spec.containers[_]
  registry := split(container.image, "/")[0]
  not data.allowed_registries[registry]
  msg := sprintf("Image %v from disallowed registry", [container.image])
}
// data/allowed_registries.json
{
  "allowed_registries": {
    "gcr.io": true,
    "registry.company.com": true,
    "quay.io": true
  }
}

Adding a registry is a one-line PR with a known-safe shape. The logic stays stable; the data evolves.

Mutating Webhook Order Matters

In Kubernetes, multiple mutating webhooks can fire. Order matters:

  • Istio injects its sidecar
  • Linkerd adds proxy injection annotations
  • Your policy mutator adds labels
  • Network policy controller adds default-deny

If two mutators conflict, last-writer-wins. Use the reinvocationPolicy: IfNeeded on validating webhooks to re-evaluate after mutation completes.

Audit Mode for Discovery

Before writing a policy, run it in audit-only mode and see what would fire. Gatekeeper has audit mode that scans existing cluster state:

kubectl get constraints -A
# Shows current violations against each constraint, without blocking

This is how you find out what's already wrong before writing a policy that blocks new things. Often the answer is "fix what's there first, then enforce."

Composing Policies with Different Tools

A practical multi-layer stack:

LayerToolWhy
Pre-merge (Terraform)conftest in CICatch before resources exist
K8s admission (block)Gatekeeper / KyvernoStop bad workloads at the gate
K8s admission (defaults)Kyverno mutateMake right-by-default easy
Runtime (cloud config)Cloud Custodian / AWS ConfigCatch drift in cloud resources
API authorizationOPA / Cedar / OpenFGAPer-request access checks
Image signatureCosign + policyOnly signed images run

Each layer catches what the previous missed. Defense in depth.

Anti-Patterns

Policies as wikis. "Here's a YAML for a NetworkPolicy you should apply." Engineers won't. Make it a Gatekeeper constraint or a Kyverno auto-generate.

Block-only. If every policy is "deny," your engineers experience PaC as friction. Mix in mutation and warnings for soft enforcement.

Untested policies in prod. The first time the rule fires is the first time you find the false positive. Test in CI, run in warn mode in staging, then enforce.

Cross-cutting policies in vendor-specific tools. If you use Sentinel for Terraform and OPA for K8s and a custom CI script for images, the same rule lives in three places. Pick one (usually OPA) and standardize.

Policy author ≠ subject expert. A central security team writing policies for product teams without consulting them produces theoretical rules that don't fit real workflows. Co-write policies with the people who'll comply with them.

What's Next

  • Best Practices — lifecycle, performance, debugging, compliance, pitfalls

On this page