Best Practices
Alert fatigue management, response runbooks, compliance, performance, common pitfalls, scaling
Best Practices
The operational realities of runtime security that actually catches incidents without exhausting the team.
Alert Fatigue Is the Failure Mode
A loud Falco that no one reads is worse than no Falco — it gives false comfort. Discipline:
- Every alert tier has an owner and a response time
- PagerDuty pages are reserved for CRITICAL — and CRITICAL must be tuned to genuinely actionable
- Weekly review of "what's noisy" — disable, exception, or escalate-to-investigate
- Track MTTR for alerts — if NOTICE alerts sit for weeks unread, they're not real
- Suppress on known maintenance — engineer kubectl-execs during a debug session shouldn't page
The metric: % of alerts investigated within their tier's response time. Below 80% means the tiers are mis-calibrated.
Response Runbook
For each alert family, document:
# Runbook: Shell in Container Detection
## When fired
A shell process (bash, sh, zsh) spawned inside a production container.
## Step 1: Verify
- Check the alert details: which pod, which image, what command?
- Was this a planned debug session? (check on-call schedule, recent PagerDuty)
- Was a deploy in flight at that time? (check ArgoCD / CI logs)
## Step 2: Triage (15 minutes)
If unexplained:
- Capture state: `kubectl describe pod`, `kubectl logs`, `kubectl get events`
- Save Falco's full event context for the timeframe ±5 min
- Check related signals: network egress to unusual IPs from this pod?
- Check parent process: was it from a legitimate exec?
## Step 3: Contain
If suspicious:
- Isolate: scale to 0 OR network policy block egress
- Snapshot before deletion (PVC snapshot for forensics)
- Page security team
## Step 4: Investigate
- Pull all Falco events for this pod in last hour
- Inspect image: was a recent dependency updated?
- Check supply chain: signed? SBOM clean?
## Step 5: Eradicate / Recover
- Rebuild pod from clean image
- Rotate any secrets that touched the pod
- Document timelinePre-written runbooks make 3 AM triage manageable. A vague "investigate the alert" doesn't survive an actual incident.
Performance
eBPF is light, but not free. Watch:
- CPU: Falco/Tetragon should typically be
<2%CPU per node - Memory: ~100-500 MB
- Event drop rate: high event volume can exceed buffer; rules should be efficient
- Kernel hooks: many active rules = more kernel work
Tuning:
- Disable rules you don't need (especially heavyweight
evt.type=execverules in high-process-rate envs) - Use kernel-side filters where possible (Tetragon, Cilium)
- Sample noisy events
- Use
prioritythresholds to reduce userspace processing
Test under load before production rollout. Sysdig/Aqua/Tetragon all publish performance benchmarks; review for your kernel version.
Compliance Mapping
Document which controls each rule satisfies:
| Control | Rule(s) | Severity |
|---|---|---|
| PCI DSS 5.3.4 (anti-malware on systems handling card data) | crypto-mining detection, suspicious binaries | WARNING |
| PCI DSS 11.5.1 (file integrity monitoring) | sensitive file write, binary modification | WARNING |
| SOC 2 CC7.2 (monitoring of operations) | broad event coverage; SIEM ingestion | INFO+ |
| NIST 800-53 SI-4 (information system monitoring) | network anomaly, exec patterns | NOTICE+ |
| ISO 27001 A.12.4.1 (event logging) | all events to retained log | INFO+ |
A compliance audit asks: "show how you detect unauthorized access." Your answer is: "here are the Falco rules; here's the SIEM showing events; here's the response timeline for sample incidents."
Long-Term Retention
Runtime events often outlive their immediate use:
- Incident reconstruction: forensic timeline weeks/months after the event
- Compliance: 6-12+ months typical
- Pattern discovery: trends only visible over time
Tier the storage:
- Hot (last 7-30 days): SIEM hot index for fast queries
- Warm (30-180 days): cheaper SIEM tier or OpenSearch with ILM
- Cold (180 days+): S3 + Glacier; queryable via Athena when needed
Encrypt at rest. Access control: only security team queries raw events; engineers see aggregated alerts.
Multi-Cluster and Multi-Cloud
Scale considerations:
- Centralized aggregation: one SIEM ingests from many clusters/clouds
- Consistent ruleset version: rule changes deployed via GitOps to all clusters
- Per-cluster overlays: dev relaxed, prod strict, regulated environments strictest
- Identity-aware correlation: same user across clusters? Cross-account API calls?
Architecture pattern:
Cluster A ┐
Cluster B ├→ Falco/Tetragon → Kafka topic per env → SIEM
Cluster C ┘ → Long-term S3Endpoint Coverage Beyond Containers
Runtime security catches container behavior. Don't forget:
- VMs and bare metal: same eBPF tools (Falco, Tetragon, Tracee) on VMs
- Serverless / Lambda: limited visibility; use Lambda extensions, CloudTrail patterns
- Edge nodes: lightweight agents (Falco-Sidekick on edge)
- CI runners: GitHub Actions runners are common attack target; instrument them too
Comprehensive coverage requires multiple tools. Don't assume containers = your whole attack surface.
Privilege and Authentication
Runtime tools see a lot. Lock down access:
- The Falco DaemonSet has privileged container access to install eBPF probes. Audit who can read its logs / metrics.
- The SIEM with runtime events contains sensitive data (process names, file paths, internal IPs). Tight RBAC.
- Rule changes are powerful: a malicious rule could blind detection. GitOps + code review.
- Operators of runtime security are a high-trust role; separate duty from those they monitor.
Cost Considerations
Runtime security costs:
- Open source tools: zero license; operations: a real person's time
- Commercial (Sysdig, Aqua, Wiz): per-node or per-workload pricing; can be substantial at scale
- SIEM ingestion: runtime events are voluminous; can blow up SIEM cost
- Storage retention: long-term archive is cheap, but only if tiered
Optimization:
- Sample low-priority events before SIEM ingest
- Aggregate repetitive events (10k of the same alert = one alert with count)
- Drop noisy rules that don't yield investigations
- Per-environment retention: production gets long, dev gets short
Connect to FinOps for security cost attribution.
Common Pitfalls
Out-of-the-box rules in production. They fire too often; team stops paying attention. Tune for two weeks before declaring "monitored".
No SIEM. Alerts go to stdout; no one queries them. Always route to a searchable system.
Prevention without testing. Tetragon Sigkill in production breaks legitimate workloads. Detection-first; weeks-of-clean-data-then-prevention.
Single-rule blindness. A rule fires often; you disable it. Now you can't detect what it was for. Carve out exceptions, don't disable wholesale.
No runbooks. Alert fires; on-call doesn't know what to do. Pre-written runbook per alert family.
Compliance theater. Falco installed, no one tunes or watches, but the audit checkbox says "yes." The compliance value is in the response, not the install.
Forgetting workload changes. A new microservice goes live; rules tuned for the old fleet aren't tuned for it. Tuning is continuous.
Detection-only forever. After years of clean detection, never moving to prevention on known-bad. Earn prevention through trust — but earn it.
Single-vendor lock-in. Runtime security tied to a SaaS that changes pricing. Build on OSS where possible (Falco, Tetragon).
Adoption Pattern
A working adoption sequence:
- Pick a tool: Falco for breadth, Tetragon if you're on Cilium
- Deploy to non-prod: capture baseline, tune false positives
- Wire to SIEM and Slack: alert routing live
- Write 5 runbooks: top alert families
- Deploy to prod, detection only: weeks of observation, more tuning
- Start triaging real alerts: build the on-call muscle
- Add custom rules: specific to your environment
- Enable prevention on selected high-value rules: only after high confidence
- Compliance evidence: map to controls, present to auditors
Take months, not weeks. Each layer earns the next.
Checklist
Runtime security production readiness:
- Falco or Tetragon deployed on all production clusters
- Rules tuned for environment (low false positive rate)
- Events routed to SIEM with severity-based fan-out
- Slack / PagerDuty / Opsgenie integration tested
- Runbooks for top 5-10 alert families
- Long-term event retention (6-12+ months) configured
- Per-cluster rule policies documented
- Compliance mapping (rule ↔ control)
- Performance monitored (CPU < 2%, no event drops)
- Multi-cluster aggregation if applicable
- Rule changes deployed via GitOps; reviewed via PR
- Operator access to runtime data tightly controlled
- Cost monitored; sampling/aggregation if needed
- Quarterly review of alert volumes and quality
- Coverage extends beyond K8s (VMs, CI, edge as appropriate)
What's Next
You have a runtime security practice. Connect it to:
- Supply Chain Security — build-time + runtime cover the lifecycle
- Policy as Code — admission gates + runtime detection
- Service Mesh — mTLS + L7 controls complement runtime
- Observability Pipelines — security events flow through the same plumbing
- Disaster Recovery — runtime forensics inform incident response
- Chaos Engineering — security chaos drills validate detection works