Steven's Knowledge

Best Practices

Production-ready Terraform - project layout, CI/CD pipelines, security, testing, drift detection, and policy-as-code

Best Practices

The patterns here are what separate a Terraform repo that scales from one that becomes a liability.

Project Layout

A common, scalable layout:

infrastructure/
├── modules/                       # reusable building blocks
│   ├── vpc/
│   ├── eks-cluster/
│   ├── database/
│   └── application/
├── environments/
│   ├── staging/
│   │   ├── backend.tf            # key = staging/terraform.tfstate
│   │   ├── providers.tf
│   │   ├── main.tf               # composes modules
│   │   ├── variables.tf
│   │   └── terraform.tfvars      # staging values
│   └── production/
│       ├── backend.tf            # key = production/terraform.tfstate
│       └── ...
└── .github/workflows/
    ├── plan.yml                  # on PR
    └── apply.yml                 # on merge to main

Key properties:

  • One state file per environment — staging and production are independent blast radii.
  • Modules are versioned (in-repo via paths, or external via git tags / registry).
  • Environment dirs are thin: they import modules and pass values. No business logic.

CI/CD: Plan on PR, Apply on Merge

The standard pipeline:

# .github/workflows/plan.yml
name: Terraform Plan
on:
  pull_request:
    paths:
      - 'infrastructure/**'

jobs:
  plan:
    strategy:
      matrix:
        env: [staging, production]
    runs-on: ubuntu-latest
    permissions:
      id-token: write           # for AWS OIDC
      pull-requests: write      # to comment the plan
      contents: read
    defaults:
      run:
        working-directory: infrastructure/environments/${{ matrix.env }}

    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.TF_PLAN_ROLE }}     # read-only role
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.5

      - run: terraform fmt -check -recursive
      - run: terraform init
      - run: terraform validate
      - run: terraform plan -no-color -out=tfplan

      - name: Comment plan on PR
        uses: actions/github-script@v7
        with:
          script: |
            const plan = require('child_process').execSync('terraform show -no-color tfplan').toString();
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '```\n' + plan.slice(0, 60000) + '\n```',
            });
# .github/workflows/apply.yml
name: Terraform Apply
on:
  push:
    branches: [main]
    paths:
      - 'infrastructure/**'

jobs:
  apply:
    strategy:
      max-parallel: 1
      matrix:
        env: [staging, production]
    runs-on: ubuntu-latest
    environment: ${{ matrix.env }}        # requires reviewer approval for production
    permissions:
      id-token: write
      contents: read
    defaults:
      run:
        working-directory: infrastructure/environments/${{ matrix.env }}

    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.TF_APPLY_ROLE }}    # higher-privilege role
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform apply -auto-approve

Important properties:

  • OIDC, not access keys. The CI exchanges its OIDC token for a short-lived AWS role. No long-lived secrets.
  • Different role for plan vs. apply. Plan only needs read; apply needs write.
  • GitHub Environments with required reviewers. Production apply pauses for a human approval.
  • Plan output in the PR. Reviewers see the diff before approving the merge.

Security

PrincipleWhat it looks like in practice
No secrets in HCL or tfvarsPull from AWS Secrets Manager / SSM / Vault via data sources
Mark sensitive outputssensitive = true on variables and outputs containing secrets
Encrypt state at restS3 SSE, GCS CMEK, or Terraform Cloud (encrypted by default)
Restrict who can run applySeparate IAM role; gate behind CI + approval
Least-privilege provider credsPlan role: *:Describe*, *:List*, *:Get* only
Scan plans for risky changesTools like tfsec, checkov, trivy config in CI
Pin everythingTerraform version, provider versions, module versions

Sample Plan Security Scan

- name: Run tfsec
  uses: aquasecurity/tfsec-action@v1.0.3
  with:
    soft_fail: false                # block merge on findings

- name: Run checkov
  uses: bridgecrewio/checkov-action@master
  with:
    directory: infrastructure/environments/${{ matrix.env }}
    framework: terraform

Policy as Code

For organizations, scan-then-block isn't enough — you want deny by default with explicit exceptions. Three popular choices:

ToolWhere it runsLanguage
OPA / ConftestPlan output (JSON)Rego
SentinelTerraform Cloud / EnterpriseSentinel DSL
Checkov / tfsec / KICSHCL sourceBuilt-in rule packs

Example Conftest policy that forbids public S3 buckets:

# policy/s3.rego
package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  resource.change.after.acl == "public-read"
  msg := sprintf("S3 bucket %s must not be public-read", [resource.address])
}

Wired into CI:

terraform show -json tfplan > plan.json
conftest test plan.json --policy policy/

Testing

Three layers, increasing in fidelity and cost:

1. Static checks (every PR, free)

terraform fmt -check -recursive
terraform validate
tflint --recursive
tfsec .
checkov -d .

2. Module unit tests with the native test framework

Terraform 1.6+ has a built-in test framework. Tests live in .tftest.hcl files alongside the module:

# modules/database/tests/defaults.tftest.hcl
variables {
  name       = "test-db"
  vpc_id     = "vpc-12345678"
  subnet_ids = ["subnet-aaa", "subnet-bbb"]
}

run "defaults_apply_cleanly" {
  command = plan

  assert {
    condition     = aws_db_instance.this.instance_class == "db.t3.medium"
    error_message = "default instance class should be db.t3.medium"
  }

  assert {
    condition     = aws_db_instance.this.allocated_storage == 20
    error_message = "default storage should be 20 GB"
  }
}
terraform test

3. End-to-end tests with Terratest

Spin up real infrastructure in an isolated AWS account, assert against it, tear it down. Best for module releases:

// test/database_test.go
func TestDatabase(t *testing.T) {
    opts := &terraform.Options{
        TerraformDir: "../examples/basic",
    }

    defer terraform.Destroy(t, opts)
    terraform.InitAndApply(t, opts)

    endpoint := terraform.Output(t, opts, "endpoint")
    assert.NotEmpty(t, endpoint)
}

Slow and costs real money — reserve for shared modules that downstream teams depend on.

Drift Detection

Even with strict CI, things drift: someone clicks in the console, an autoscaler resizes things, AWS auto-rotates credentials. Detect drift on a schedule:

# .github/workflows/drift.yml
on:
  schedule:
    - cron: "0 6 * * *"           # daily at 06:00 UTC

jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - id: plan
        run: |
          terraform plan -detailed-exitcode -no-color -out=tfplan
        continue-on-error: true
      - if: steps.plan.outputs.exitcode == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            { "text": "🚨 Terraform drift detected in ${{ github.repository }}" }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

plan -detailed-exitcode returns 0 (no changes), 1 (error), or 2 (changes) — perfect for scripting.

Operating Tips

A handful of habits that prevent self-inflicted incidents:

  1. Small, frequent applies. Large diffs are hard to review and slow to roll back. Keep PRs to one logical change.
  2. Read the plan before approving. Especially the -/+ (replace) and - (destroy) lines.
  3. Never terraform destroy against production. Use prevent_destroy on critical resources, and prefer surgical terraform apply -destroy -target=... if you genuinely need to remove one thing.
  4. Don't use -target to "work around" failures. It's a state surgery tool. If you need it, you have a state problem to fix.
  5. Tag everything. A default_tags block on the provider gets you 80% of the way — Project, Environment, ManagedBy = "terraform", Owner.
  6. Adopt a naming convention early. <project>-<env>-<resource> is fine. Whatever you pick, automate it via locals.
  7. Document the unobvious. A short comment on why a lifecycle.ignore_changes is there saves the next person an hour.

Checklist

Pre-production Terraform checklist

  • Remote backend with locking (S3 + DynamoDB, GCS, or TFC)
  • State file encrypted at rest, versioning enabled
  • Terraform and provider versions pinned
  • Separate state per environment
  • CI runs fmt, validate, plan, security scans on every PR
  • Plan output posted to the PR
  • Apply gated behind code review + environment approval for production
  • OIDC-based credentials in CI (no long-lived keys)
  • Critical resources marked prevent_destroy
  • default_tags on the provider
  • Daily drift detection
  • Module tests for anything shared across teams

On this page