Best Practices
Tagging policy, FinOps team structure, engineering incentives, common pitfalls, governance
Best Practices
The tools and patterns are easy. The hard part is organizational: getting engineers, finance, and product to share a vocabulary and a habit.
Tagging Is Everything
Tagging is the foundation. If tags don't work, nothing works. Discipline:
- Tag at creation, not after. Enforce via Terraform module, OPA / Cloud Custodian / SCP — not "we'll add tags later" (you won't).
- Mandatory tag list is short. 5-7 tags. Anything longer fails to be enforced.
- Reject untagged resources. Service Control Policy or Terraform pre-commit blocks untagged creation.
- Audit weekly. A report of untagged resources, owners auto-pinged.
- Don't change tag semantics. Once
Team=growthmeans something, never repurpose it. Add a new tag if semantics shift.
A tagged resource is cost-attributable; an untagged one is everyone's and no one's.
FinOps Team Structure
There's no one right structure. Patterns by org size:
| Stage | Structure |
|---|---|
| Small (< $500k/yr cloud) | One platform/SRE engineer, ~20% time |
| Medium ($500k–$10M/yr) | One dedicated FinOps practitioner (often hybrid platform-finance) |
| Large (>$10M/yr) | A FinOps team (3-7), reporting to platform or CFO; embedded champions per team |
| Enterprise (>$100M/yr) | A FinOps function with VP, dedicated tooling team, partnerships with finance |
The FinOps practitioner's job is enablement, not policing. They build the dashboards, write the policies, run quarterly reviews, train engineers — they don't try to single-handedly cut every cost.
Engineering Incentives
Engineers don't optimize cost unless they see it. Approaches:
- Make cost a visible metric in service dashboards. Alongside latency and error rate.
- Add cost to engineering reviews. A new service launch includes projected cost; a quarterly review includes actual.
- Showback to teams monthly. A Slack message: "this month your team's cost was $X, +/- Y% vs last month, here are the biggest items."
- Cost in SLO conversations. "We could cut latency from 50ms to 30ms but it'd 4x cost — is that worth it?"
- Reward cost wins. Recognition in eng all-hands, just like reliability wins.
Don't:
- Punish for cost without context. A team running an experimental ML workload should expect higher spend.
- Make cost a primary OKR. It distorts incentives toward cost-cutting over value.
- Charge teams without giving them control. If a team can't pick their instance type, don't bill them for it.
Governance
| Policy | Mechanism |
|---|---|
| Require tags on creation | SCP / IAM condition / Terraform validation |
| Block expensive instance types in dev accounts | IAM condition limiting ec2:RunInstances |
| Auto-stop dev/staging at night | EventBridge schedule + Lambda |
| Auto-delete old snapshots | Lambda on a schedule |
| Block public S3 buckets (security too) | Account-level Block Public Access |
| Cap per-account spend | AWS Budgets + auto-shutdown via Lambda (extreme) |
A common pattern: dev accounts have hard policies; prod has soft alerts. Dev should be cheap by construction; prod should be visible and reviewable.
Common Pitfalls
"Just lift and shift." Lifting on-prem workloads to EC2 without re-architecting often increases cost. Cloud rewards elastic, decoupled designs.
Optimizing right after migration. Don't try to optimize during the chaos of migration. Land first, optimize 3 months later when baseline is stable.
Ignoring data transfer. Egress and cross-AZ traffic are silent bill drivers. Architect with awareness; don't discover at month-end.
Buying RIs without analysis. Vendors / consultants will push RIs without modeling your actual usage. Use Compute Optimizer or Cloudability's recommendations.
Custom dashboards instead of vendor tools. Three engineers spent 6 months building a dashboard Vantage gives you in an hour for $1k/mo. Buy unless you're at a scale where custom is genuinely cheaper.
Treating Spot as free. Spot saves money if your workload handles eviction. Engineering time to make a workload Spot-compatible can outweigh savings for small workloads.
Confusing list price with actual cost. Most large customers have negotiated EDPs / committed-use discounts. Use your rate, not on-demand list, in optimization math.
Saving money on the wrong things. Cutting CI runner costs saves $10k/yr but slows every engineer by 10 min/day — net negative.
Multi-Cloud and FOCUS
If you're on more than one cloud:
- FOCUS spec gives you a unified billing schema. Use it.
- Vantage, Cloudability, CloudZero all support multi-cloud. Build one dashboard, not three.
- Negotiate centrally. EDPs and committed-use discounts; vendor-by-vendor negotiation leaves money on the table.
- Don't pretend portability is free. Avoiding lock-in often costs more than the lock-in would.
Continuous Improvement
FinOps is a habit, not a project. Operating rhythm:
| Cadence | Activity |
|---|---|
| Daily | Anomaly alerts triaged |
| Weekly | Untagged resource report; quick-wins review |
| Monthly | Showback to teams; trend review; budget vs actual |
| Quarterly | Rightsizing pass on top-10 services; commitment review |
| Annually | EDP negotiation; full architecture cost review |
Checklist
FinOps practice readiness check:
- All cloud resources have at least 5 tags (Env, Team, Service, CostCenter, Owner)
- Cost & Usage Report (or equivalent) exported to S3/BigQuery, queryable in Athena/BigQuery
- At least one cost dashboard exists, shared monthly with engineering
- Anomaly detection alerts to Slack within 24 hours
- Kubecost / OpenCost running on K8s clusters (if applicable)
- Budgets set per team / per account with forecast alerts at 80%, 100%
- At least 60% of steady compute covered by Savings Plans / commitments
- Dev/staging auto-stop nights and weekends
- S3 lifecycle policies for cold data
- Infracost or similar cost feedback in IaC PRs
- Designated FinOps practitioner (or % of someone's time)
- Monthly cost review meeting on the calendar
What's Next
You have a FinOps practice. Connect it to:
- Monitoring — service-level metrics drive rightsizing decisions
- IaC — tagging, policies, and Infracost feedback live in Terraform
- CI/CD — cost diffs in PRs; auto-stop in pipelines
- Containerization — Karpenter, VPA, HPA all serve FinOps goals