Steven's Knowledge

Best Practices

Tagging policy, FinOps team structure, engineering incentives, common pitfalls, governance

Best Practices

The tools and patterns are easy. The hard part is organizational: getting engineers, finance, and product to share a vocabulary and a habit.

Tagging Is Everything

Tagging is the foundation. If tags don't work, nothing works. Discipline:

  • Tag at creation, not after. Enforce via Terraform module, OPA / Cloud Custodian / SCP — not "we'll add tags later" (you won't).
  • Mandatory tag list is short. 5-7 tags. Anything longer fails to be enforced.
  • Reject untagged resources. Service Control Policy or Terraform pre-commit blocks untagged creation.
  • Audit weekly. A report of untagged resources, owners auto-pinged.
  • Don't change tag semantics. Once Team=growth means something, never repurpose it. Add a new tag if semantics shift.

A tagged resource is cost-attributable; an untagged one is everyone's and no one's.

FinOps Team Structure

There's no one right structure. Patterns by org size:

StageStructure
Small (< $500k/yr cloud)One platform/SRE engineer, ~20% time
Medium ($500k–$10M/yr)One dedicated FinOps practitioner (often hybrid platform-finance)
Large (>$10M/yr)A FinOps team (3-7), reporting to platform or CFO; embedded champions per team
Enterprise (>$100M/yr)A FinOps function with VP, dedicated tooling team, partnerships with finance

The FinOps practitioner's job is enablement, not policing. They build the dashboards, write the policies, run quarterly reviews, train engineers — they don't try to single-handedly cut every cost.

Engineering Incentives

Engineers don't optimize cost unless they see it. Approaches:

  • Make cost a visible metric in service dashboards. Alongside latency and error rate.
  • Add cost to engineering reviews. A new service launch includes projected cost; a quarterly review includes actual.
  • Showback to teams monthly. A Slack message: "this month your team's cost was $X, +/- Y% vs last month, here are the biggest items."
  • Cost in SLO conversations. "We could cut latency from 50ms to 30ms but it'd 4x cost — is that worth it?"
  • Reward cost wins. Recognition in eng all-hands, just like reliability wins.

Don't:

  • Punish for cost without context. A team running an experimental ML workload should expect higher spend.
  • Make cost a primary OKR. It distorts incentives toward cost-cutting over value.
  • Charge teams without giving them control. If a team can't pick their instance type, don't bill them for it.

Governance

PolicyMechanism
Require tags on creationSCP / IAM condition / Terraform validation
Block expensive instance types in dev accountsIAM condition limiting ec2:RunInstances
Auto-stop dev/staging at nightEventBridge schedule + Lambda
Auto-delete old snapshotsLambda on a schedule
Block public S3 buckets (security too)Account-level Block Public Access
Cap per-account spendAWS Budgets + auto-shutdown via Lambda (extreme)

A common pattern: dev accounts have hard policies; prod has soft alerts. Dev should be cheap by construction; prod should be visible and reviewable.

Common Pitfalls

"Just lift and shift." Lifting on-prem workloads to EC2 without re-architecting often increases cost. Cloud rewards elastic, decoupled designs.

Optimizing right after migration. Don't try to optimize during the chaos of migration. Land first, optimize 3 months later when baseline is stable.

Ignoring data transfer. Egress and cross-AZ traffic are silent bill drivers. Architect with awareness; don't discover at month-end.

Buying RIs without analysis. Vendors / consultants will push RIs without modeling your actual usage. Use Compute Optimizer or Cloudability's recommendations.

Custom dashboards instead of vendor tools. Three engineers spent 6 months building a dashboard Vantage gives you in an hour for $1k/mo. Buy unless you're at a scale where custom is genuinely cheaper.

Treating Spot as free. Spot saves money if your workload handles eviction. Engineering time to make a workload Spot-compatible can outweigh savings for small workloads.

Confusing list price with actual cost. Most large customers have negotiated EDPs / committed-use discounts. Use your rate, not on-demand list, in optimization math.

Saving money on the wrong things. Cutting CI runner costs saves $10k/yr but slows every engineer by 10 min/day — net negative.

Multi-Cloud and FOCUS

If you're on more than one cloud:

  • FOCUS spec gives you a unified billing schema. Use it.
  • Vantage, Cloudability, CloudZero all support multi-cloud. Build one dashboard, not three.
  • Negotiate centrally. EDPs and committed-use discounts; vendor-by-vendor negotiation leaves money on the table.
  • Don't pretend portability is free. Avoiding lock-in often costs more than the lock-in would.

Continuous Improvement

FinOps is a habit, not a project. Operating rhythm:

CadenceActivity
DailyAnomaly alerts triaged
WeeklyUntagged resource report; quick-wins review
MonthlyShowback to teams; trend review; budget vs actual
QuarterlyRightsizing pass on top-10 services; commitment review
AnnuallyEDP negotiation; full architecture cost review

Checklist

FinOps practice readiness check:

  • All cloud resources have at least 5 tags (Env, Team, Service, CostCenter, Owner)
  • Cost & Usage Report (or equivalent) exported to S3/BigQuery, queryable in Athena/BigQuery
  • At least one cost dashboard exists, shared monthly with engineering
  • Anomaly detection alerts to Slack within 24 hours
  • Kubecost / OpenCost running on K8s clusters (if applicable)
  • Budgets set per team / per account with forecast alerts at 80%, 100%
  • At least 60% of steady compute covered by Savings Plans / commitments
  • Dev/staging auto-stop nights and weekends
  • S3 lifecycle policies for cold data
  • Infracost or similar cost feedback in IaC PRs
  • Designated FinOps practitioner (or % of someone's time)
  • Monthly cost review meeting on the calendar

What's Next

You have a FinOps practice. Connect it to:

  • Monitoring — service-level metrics drive rightsizing decisions
  • IaC — tagging, policies, and Infracost feedback live in Terraform
  • CI/CD — cost diffs in PRs; auto-stop in pipelines
  • Containerization — Karpenter, VPA, HPA all serve FinOps goals

On this page