Best Practices

Tagging policy, FinOps team structure, engineering incentives, common pitfalls, governance

Best Practices

The tools and patterns are easy. The hard part is organizational: getting engineers, finance, and product to share a vocabulary and a habit.

Tagging Is Everything

Tagging is the foundation. If tags don't work, nothing works. Discipline:

Tag at creation, not after. Enforce via Terraform module, OPA / Cloud Custodian / SCP — not "we'll add tags later" (you won't).
Mandatory tag list is short. 5-7 tags. Anything longer fails to be enforced.
Reject untagged resources. Service Control Policy or Terraform pre-commit blocks untagged creation.
Audit weekly. A report of untagged resources, owners auto-pinged.
Don't change tag semantics. Once Team=growth means something, never repurpose it. Add a new tag if semantics shift.

A tagged resource is cost-attributable; an untagged one is everyone's and no one's.

FinOps Team Structure

There's no one right structure. Patterns by org size:

Stage	Structure
Small (< $500k/yr cloud)	One platform/SRE engineer, ~20% time
Medium ($500k–$10M/yr)	One dedicated FinOps practitioner (often hybrid platform-finance)
Large (>$10M/yr)	A FinOps team (3-7), reporting to platform or CFO; embedded champions per team
Enterprise (>$100M/yr)	A FinOps function with VP, dedicated tooling team, partnerships with finance

The FinOps practitioner's job is enablement, not policing. They build the dashboards, write the policies, run quarterly reviews, train engineers — they don't try to single-handedly cut every cost.

Engineering Incentives

Engineers don't optimize cost unless they see it. Approaches:

Make cost a visible metric in service dashboards. Alongside latency and error rate.
Add cost to engineering reviews. A new service launch includes projected cost; a quarterly review includes actual.
Showback to teams monthly. A Slack message: "this month your team's cost was $X, +/- Y% vs last month, here are the biggest items."
Cost in SLO conversations. "We could cut latency from 50ms to 30ms but it'd 4x cost — is that worth it?"
Reward cost wins. Recognition in eng all-hands, just like reliability wins.

Don't:

Punish for cost without context. A team running an experimental ML workload should expect higher spend.
Make cost a primary OKR. It distorts incentives toward cost-cutting over value.
Charge teams without giving them control. If a team can't pick their instance type, don't bill them for it.

Governance

Policy	Mechanism
Require tags on creation	SCP / IAM condition / Terraform validation
Block expensive instance types in dev accounts	IAM condition limiting `ec2:RunInstances`
Auto-stop dev/staging at night	EventBridge schedule + Lambda
Auto-delete old snapshots	Lambda on a schedule
Block public S3 buckets (security too)	Account-level Block Public Access
Cap per-account spend	AWS Budgets + auto-shutdown via Lambda (extreme)

A common pattern: dev accounts have hard policies; prod has soft alerts. Dev should be cheap by construction; prod should be visible and reviewable.

Common Pitfalls

"Just lift and shift." Lifting on-prem workloads to EC2 without re-architecting often increases cost. Cloud rewards elastic, decoupled designs.

Optimizing right after migration. Don't try to optimize during the chaos of migration. Land first, optimize 3 months later when baseline is stable.

Ignoring data transfer. Egress and cross-AZ traffic are silent bill drivers. Architect with awareness; don't discover at month-end.

Buying RIs without analysis. Vendors / consultants will push RIs without modeling your actual usage. Use Compute Optimizer or Cloudability's recommendations.

Custom dashboards instead of vendor tools. Three engineers spent 6 months building a dashboard Vantage gives you in an hour for $1k/mo. Buy unless you're at a scale where custom is genuinely cheaper.

Treating Spot as free. Spot saves money if your workload handles eviction. Engineering time to make a workload Spot-compatible can outweigh savings for small workloads.

Confusing list price with actual cost. Most large customers have negotiated EDPs / committed-use discounts. Use your rate, not on-demand list, in optimization math.

Saving money on the wrong things. Cutting CI runner costs saves $10k/yr but slows every engineer by 10 min/day — net negative.

Multi-Cloud and FOCUS

If you're on more than one cloud:

FOCUS spec gives you a unified billing schema. Use it.
Vantage, Cloudability, CloudZero all support multi-cloud. Build one dashboard, not three.
Negotiate centrally. EDPs and committed-use discounts; vendor-by-vendor negotiation leaves money on the table.
Don't pretend portability is free. Avoiding lock-in often costs more than the lock-in would.

Continuous Improvement

FinOps is a habit, not a project. Operating rhythm:

Cadence	Activity
Daily	Anomaly alerts triaged
Weekly	Untagged resource report; quick-wins review
Monthly	Showback to teams; trend review; budget vs actual
Quarterly	Rightsizing pass on top-10 services; commitment review
Annually	EDP negotiation; full architecture cost review

Checklist

What's Next

You have a FinOps practice. Connect it to:

Monitoring — service-level metrics drive rightsizing decisions
IaC — tagging, policies, and Infracost feedback live in Terraform
CI/CD — cost diffs in PRs; auto-stop in pipelines
Containerization — Karpenter, VPA, HPA all serve FinOps goals

Best Practices

On this page