Steven's Knowledge

Patterns

Showback/chargeback, rightsizing loops, commitment strategy, Spot/preemptible, anomaly detection, unit economics

Patterns

Patterns that move FinOps from "look at the bill" to "spend with intent."

Showback and Chargeback

ModelDefinitionEffect
ShowbackEach team sees their cost; no internal billingAwareness; weak incentive
ChargebackEach team's cost deducts from their budgetStrong incentive; can create friction
HybridShowback by default; chargeback above a thresholdMost common in mid-size orgs

Start with showback. The first time a team sees "your service costs $40k/month" they'll find optimizations on their own. Chargeback creates real incentive but also creates gaming ("we'll just use the central platform team's resources") — only introduce when the org is mature enough.

Rightsizing Loops

Rightsizing isn't a one-time event; it's a loop:

   ┌─> Measure utilization (CPU, mem, IOPS, network)
   │       │
   │       v
   │   Identify oversized resources (utilization < 40%)
   │       │
   │       v
   │   Propose change (downsize / consolidate)
   │       │
   │       v
   │   Apply (off-peak, with rollback ready)
   │       │
   │       v
   └── Verify (no perf regression, no on-call wakeups)

For VMs: AWS Compute Optimizer, Azure Advisor, GCP Recommender. For K8s: Goldilocks (VPA recommendations) or Kubecost rightsizing.

Cadence: top 10 most expensive workloads, every quarter. Trying to rightsize everything burns engineer time on micro-savings.

Commitment Strategy

Cloud providers give big discounts for commitment:

AWS ConstructDiscountCommitmentBest for
Compute Savings Planup to 66%1-3 years, any instance family/regionSteady compute baseline
EC2 Instance Savings Planup to 72%1-3 years, specific familyPredictable workloads
Reserved Instances (RI)up to 75%1-3 years, specific configLegacy; SP usually better
Spot Instancesup to 90%Can be reclaimed in 2 minFault-tolerant workloads

A practical strategy:

  1. Baseline coverage: Compute Savings Plan for ~70% of steady-state usage (1-year, no upfront).
  2. Burst on-demand: Above baseline runs at full price; that's fine, it's bursty.
  3. Spot for batch & stateless: Karpenter on AWS, Spot.io across clouds.
  4. Re-evaluate quarterly as baseline shifts.

Don't over-commit. A 3-year RI for a service that gets deprecated in 18 months is worse than on-demand.

Spot / Preemptible Usage

Spot instances can be reclaimed at 2 minutes' notice but cost 60-90% less. Good targets:

  • Batch jobs (data pipelines, ML training, CI runners)
  • Stateless web tiers with replicas
  • Dev/staging environments
  • Anything that can checkpoint or retry

Bad targets:

  • Stateful single-instance DBs
  • Long-lived workloads that can't tolerate interruption
  • Anything where 2-min eviction breaks SLO

Tooling that makes Spot safe:

  • AWS Karpenter auto-replaces interrupted nodes within seconds
  • Spot.io orchestrates Spot + on-demand across clouds
  • Cluster Autoscaler with mixed instance policy falls back to on-demand on capacity loss

Anomaly Detection

You want to know when spend suddenly changes — not at month-end review.

  • AWS Cost Anomaly Detection (free) — ML-based; alerts via SNS
  • Vantage / CloudZero anomaly alerts — daily reports
  • Custom: Athena query on CUR; alert if any service's daily cost > 2× previous 7-day avg
WITH daily AS (
  SELECT
    date(line_item_usage_start_date) AS d,
    line_item_product_code AS svc,
    SUM(line_item_unblended_cost) AS cost
  FROM main_cur
  WHERE year = '2026'
  GROUP BY 1, 2
),
baseline AS (
  SELECT
    svc,
    AVG(cost) AS avg_cost
  FROM daily
  WHERE d BETWEEN date_add('day', -8, current_date) AND date_add('day', -1, current_date)
  GROUP BY svc
)
SELECT d.svc, d.cost, b.avg_cost, d.cost / NULLIF(b.avg_cost, 0) AS ratio
FROM daily d JOIN baseline b USING (svc)
WHERE d.d = current_date - INTERVAL '1' DAY
  AND d.cost > 100
  AND d.cost > 2 * b.avg_cost
ORDER BY ratio DESC;

Common anomaly causes: a test left running, a recursive Lambda, a CloudWatch metric stream sending to S3 in a loop, a new feature ingesting more data than expected.

Unit Economics

The most strategic FinOps practice. Pick metrics that connect cost to business value:

Business metricCost ratio
Request$ per million requests
Customer$ per active user per month
Transaction$ per checkout / per order
Build$ per CI build
Tenant (B2B SaaS)$ per customer per month, by plan tier

Build a dashboard with cost ÷ business metric over time:

$ per active user per month
0.40 ┤                                  ╭─── alert!
0.30 ┤                          ╭───────╯
0.20 ┤──────────────────────────╯
0.10 ┤
     └─────────────────────────────────────────
       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug

If the ratio is falling: scaling efficiently. If rising: investigate before total cost gets dramatic.

Architecture-Level Patterns

Some savings only come from changing the architecture:

PatternSavings
Reduce egress: cache in CDN; keep traffic in-regionEgress is often 5-15% of bill
S3 → S3 Intelligent-TieringAuto-moves cold data to IA; saves 40-95% on storage
Multi-AZ only where needed: dev doesn't need itCross-AZ traffic at $0.01/GB adds up
Lambda for spiky workloads, ECS/EKS for steadyAvoid Lambda for 24/7 work; avoid containers for 1% duty cycle
Replace NAT Gateway with VPC endpoints for S3/DynamoDBNAT is $0.045/hr + per-GB; endpoints are flat or free
Aurora Serverless for spiky DB workloadsPay per ACU-hour vs. 24/7 provisioned
Move logs to Loki / OpenSearch from CloudWatchCloudWatch Logs at $0.50/GB ingest is brutal at scale

Pick one per quarter. Don't try them all.

FinOps in CI/CD

Cost as a deploy-time concern:

  • Infracost in Terraform PRs: "this PR adds $342/month"
  • Cost guardrails: block PRs that exceed N% cost increase without approval
  • Pre-merge K8s rightsizing checks: warn if requested resources >> historical usage
# Infracost in GitHub Actions
- name: Run Infracost
  uses: infracost/actions/setup@v3
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}
- run: infracost diff --path . --format json --out-file /tmp/infracost.json
- run: infracost comment github --path /tmp/infracost.json \
    --repo $GITHUB_REPOSITORY --pull-request ${{ github.event.pull_request.number }} \
    --github-token ${{ github.token }}

A reviewer who sees "this PR adds $1.2k/month" makes a different decision than one who only sees the diff.

Anti-Patterns

Optimizing what's easy to measure. Egress is easy to see and easy to obsess over; meanwhile the $200k/month idle Redshift cluster sits ignored. Always sort by total cost.

One-time cleanups. A 3-day FinOps sprint that drops the bill 20%, then nothing for a year. Without a loop, waste regrows.

Cost-cutting at the expense of velocity. "We can't ship that, it'll add cost" is the wrong default. Cost matters; velocity matters more. Optimize what's already shipped.

RI sprawl. Buying RIs for workloads that change every quarter. Use Savings Plans (more flexible) instead.

Chargeback before showback. Forces teams to play accounting games before they even understand their cost.

What's Next

  • Best Practices — tagging policy, team structure, eng incentives, common pitfalls

On this page