Best Practices

Alert fatigue management, response runbooks, compliance, performance, common pitfalls, scaling

Best Practices

The operational realities of runtime security that actually catches incidents without exhausting the team.

Alert Fatigue Is the Failure Mode

A loud Falco that no one reads is worse than no Falco — it gives false comfort. Discipline:

Every alert tier has an owner and a response time
PagerDuty pages are reserved for CRITICAL — and CRITICAL must be tuned to genuinely actionable
Weekly review of "what's noisy" — disable, exception, or escalate-to-investigate
Track MTTR for alerts — if NOTICE alerts sit for weeks unread, they're not real
Suppress on known maintenance — engineer kubectl-execs during a debug session shouldn't page

The metric: % of alerts investigated within their tier's response time. Below 80% means the tiers are mis-calibrated.

Response Runbook

For each alert family, document:

# Runbook: Shell in Container Detection

## When fired
A shell process (bash, sh, zsh) spawned inside a production container.

## Step 1: Verify
- Check the alert details: which pod, which image, what command?
- Was this a planned debug session? (check on-call schedule, recent PagerDuty)
- Was a deploy in flight at that time? (check ArgoCD / CI logs)

## Step 2: Triage (15 minutes)
If unexplained:
- Capture state: `kubectl describe pod`, `kubectl logs`, `kubectl get events`
- Save Falco's full event context for the timeframe ±5 min
- Check related signals: network egress to unusual IPs from this pod?
- Check parent process: was it from a legitimate exec?

## Step 3: Contain
If suspicious:
- Isolate: scale to 0 OR network policy block egress
- Snapshot before deletion (PVC snapshot for forensics)
- Page security team

## Step 4: Investigate
- Pull all Falco events for this pod in last hour
- Inspect image: was a recent dependency updated?
- Check supply chain: signed? SBOM clean?

## Step 5: Eradicate / Recover
- Rebuild pod from clean image
- Rotate any secrets that touched the pod
- Document timeline

Pre-written runbooks make 3 AM triage manageable. A vague "investigate the alert" doesn't survive an actual incident.

Performance

eBPF is light, but not free. Watch:

CPU: Falco/Tetragon should typically be <2% CPU per node
Memory: ~100-500 MB
Event drop rate: high event volume can exceed buffer; rules should be efficient
Kernel hooks: many active rules = more kernel work

Tuning:

Disable rules you don't need (especially heavyweight evt.type=execve rules in high-process-rate envs)
Use kernel-side filters where possible (Tetragon, Cilium)
Sample noisy events
Use priority thresholds to reduce userspace processing

Test under load before production rollout. Sysdig/Aqua/Tetragon all publish performance benchmarks; review for your kernel version.

Compliance Mapping

Document which controls each rule satisfies:

Control	Rule(s)	Severity
PCI DSS 5.3.4 (anti-malware on systems handling card data)	crypto-mining detection, suspicious binaries	WARNING
PCI DSS 11.5.1 (file integrity monitoring)	sensitive file write, binary modification	WARNING
SOC 2 CC7.2 (monitoring of operations)	broad event coverage; SIEM ingestion	INFO+
NIST 800-53 SI-4 (information system monitoring)	network anomaly, exec patterns	NOTICE+
ISO 27001 A.12.4.1 (event logging)	all events to retained log	INFO+

A compliance audit asks: "show how you detect unauthorized access." Your answer is: "here are the Falco rules; here's the SIEM showing events; here's the response timeline for sample incidents."

Long-Term Retention

Runtime events often outlive their immediate use:

Incident reconstruction: forensic timeline weeks/months after the event
Compliance: 6-12+ months typical
Pattern discovery: trends only visible over time

Tier the storage:

Hot (last 7-30 days): SIEM hot index for fast queries
Warm (30-180 days): cheaper SIEM tier or OpenSearch with ILM
Cold (180 days+): S3 + Glacier; queryable via Athena when needed

Encrypt at rest. Access control: only security team queries raw events; engineers see aggregated alerts.

Multi-Cluster and Multi-Cloud

Scale considerations:

Centralized aggregation: one SIEM ingests from many clusters/clouds
Consistent ruleset version: rule changes deployed via GitOps to all clusters
Per-cluster overlays: dev relaxed, prod strict, regulated environments strictest
Identity-aware correlation: same user across clusters? Cross-account API calls?

Architecture pattern:

Cluster A ┐
Cluster B ├→ Falco/Tetragon → Kafka topic per env → SIEM
Cluster C ┘                                       → Long-term S3

Endpoint Coverage Beyond Containers

Runtime security catches container behavior. Don't forget:

VMs and bare metal: same eBPF tools (Falco, Tetragon, Tracee) on VMs
Serverless / Lambda: limited visibility; use Lambda extensions, CloudTrail patterns
Edge nodes: lightweight agents (Falco-Sidekick on edge)
CI runners: GitHub Actions runners are common attack target; instrument them too

Comprehensive coverage requires multiple tools. Don't assume containers = your whole attack surface.

Privilege and Authentication

Runtime tools see a lot. Lock down access:

The Falco DaemonSet has privileged container access to install eBPF probes. Audit who can read its logs / metrics.
The SIEM with runtime events contains sensitive data (process names, file paths, internal IPs). Tight RBAC.
Rule changes are powerful: a malicious rule could blind detection. GitOps + code review.
Operators of runtime security are a high-trust role; separate duty from those they monitor.

Cost Considerations

Runtime security costs:

Open source tools: zero license; operations: a real person's time
Commercial (Sysdig, Aqua, Wiz): per-node or per-workload pricing; can be substantial at scale
SIEM ingestion: runtime events are voluminous; can blow up SIEM cost
Storage retention: long-term archive is cheap, but only if tiered

Optimization:

Sample low-priority events before SIEM ingest
Aggregate repetitive events (10k of the same alert = one alert with count)
Drop noisy rules that don't yield investigations
Per-environment retention: production gets long, dev gets short

Connect to FinOps for security cost attribution.

Common Pitfalls

Out-of-the-box rules in production. They fire too often; team stops paying attention. Tune for two weeks before declaring "monitored".

No SIEM. Alerts go to stdout; no one queries them. Always route to a searchable system.

Prevention without testing. Tetragon Sigkill in production breaks legitimate workloads. Detection-first; weeks-of-clean-data-then-prevention.

Single-rule blindness. A rule fires often; you disable it. Now you can't detect what it was for. Carve out exceptions, don't disable wholesale.

No runbooks. Alert fires; on-call doesn't know what to do. Pre-written runbook per alert family.

Compliance theater. Falco installed, no one tunes or watches, but the audit checkbox says "yes." The compliance value is in the response, not the install.

Forgetting workload changes. A new microservice goes live; rules tuned for the old fleet aren't tuned for it. Tuning is continuous.

Detection-only forever. After years of clean detection, never moving to prevention on known-bad. Earn prevention through trust — but earn it.

Single-vendor lock-in. Runtime security tied to a SaaS that changes pricing. Build on OSS where possible (Falco, Tetragon).

Adoption Pattern

A working adoption sequence:

Pick a tool: Falco for breadth, Tetragon if you're on Cilium
Deploy to non-prod: capture baseline, tune false positives
Wire to SIEM and Slack: alert routing live
Write 5 runbooks: top alert families
Deploy to prod, detection only: weeks of observation, more tuning
Start triaging real alerts: build the on-call muscle
Add custom rules: specific to your environment
Enable prevention on selected high-value rules: only after high confidence
Compliance evidence: map to controls, present to auditors

Take months, not weeks. Each layer earns the next.

Checklist

What's Next

You have a runtime security practice. Connect it to:

Supply Chain Security — build-time + runtime cover the lifecycle
Policy as Code — admission gates + runtime detection
Service Mesh — mTLS + L7 controls complement runtime
Observability Pipelines — security events flow through the same plumbing
Disaster Recovery — runtime forensics inform incident response
Chaos Engineering — security chaos drills validate detection works

Best Practices

On this page