Best Practices

The patterns here are what separate a Terraform repo that scales from one that becomes a liability.

Project Layout

A common, scalable layout:

infrastructure/
├── modules/                       # reusable building blocks
│   ├── vpc/
│   ├── eks-cluster/
│   ├── database/
│   └── application/
├── environments/
│   ├── staging/
│   │   ├── backend.tf            # key = staging/terraform.tfstate
│   │   ├── providers.tf
│   │   ├── main.tf               # composes modules
│   │   ├── variables.tf
│   │   └── terraform.tfvars      # staging values
│   └── production/
│       ├── backend.tf            # key = production/terraform.tfstate
│       └── ...
└── .github/workflows/
    ├── plan.yml                  # on PR
    └── apply.yml                 # on merge to main

Key properties:

One state file per environment — staging and production are independent blast radii.
Modules are versioned (in-repo via paths, or external via git tags / registry).
Environment dirs are thin: they import modules and pass values. No business logic.

CI/CD: Plan on PR, Apply on Merge

The standard pipeline:

# .github/workflows/plan.yml
name: Terraform Plan
on:
  pull_request:
    paths:
      - 'infrastructure/**'

jobs:
  plan:
    strategy:
      matrix:
        env: [staging, production]
    runs-on: ubuntu-latest
    permissions:
      id-token: write           # for AWS OIDC
      pull-requests: write      # to comment the plan
      contents: read
    defaults:
      run:
        working-directory: infrastructure/environments/${{ matrix.env }}

    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.TF_PLAN_ROLE }}     # read-only role
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.5

      - run: terraform fmt -check -recursive
      - run: terraform init
      - run: terraform validate
      - run: terraform plan -no-color -out=tfplan

      - name: Comment plan on PR
        uses: actions/github-script@v7
        with:
          script: |
            const plan = require('child_process').execSync('terraform show -no-color tfplan').toString();
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '```\n' + plan.slice(0, 60000) + '\n```',
            });

# .github/workflows/apply.yml
name: Terraform Apply
on:
  push:
    branches: [main]
    paths:
      - 'infrastructure/**'

jobs:
  apply:
    strategy:
      max-parallel: 1
      matrix:
        env: [staging, production]
    runs-on: ubuntu-latest
    environment: ${{ matrix.env }}        # requires reviewer approval for production
    permissions:
      id-token: write
      contents: read
    defaults:
      run:
        working-directory: infrastructure/environments/${{ matrix.env }}

    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.TF_APPLY_ROLE }}    # higher-privilege role
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform apply -auto-approve

Important properties:

OIDC, not access keys. The CI exchanges its OIDC token for a short-lived AWS role. No long-lived secrets.
Different role for plan vs. apply. Plan only needs read; apply needs write.
GitHub Environments with required reviewers. Production apply pauses for a human approval.
Plan output in the PR. Reviewers see the diff before approving the merge.

Security

Principle	What it looks like in practice
No secrets in HCL or tfvars	Pull from AWS Secrets Manager / SSM / Vault via `data` sources
Mark sensitive outputs	`sensitive = true` on variables and outputs containing secrets
Encrypt state at rest	S3 SSE, GCS CMEK, or Terraform Cloud (encrypted by default)
Restrict who can run apply	Separate IAM role; gate behind CI + approval
Least-privilege provider creds	Plan role: `:Describe`, `:List`, `:Get` only
Scan plans for risky changes	Tools like `tfsec`, `checkov`, `trivy config` in CI
Pin everything	Terraform version, provider versions, module versions

Sample Plan Security Scan

- name: Run tfsec
  uses: aquasecurity/tfsec-action@v1.0.3
  with:
    soft_fail: false                # block merge on findings

- name: Run checkov
  uses: bridgecrewio/checkov-action@master
  with:
    directory: infrastructure/environments/${{ matrix.env }}
    framework: terraform

Policy as Code

For organizations, scan-then-block isn't enough — you want deny by default with explicit exceptions. Three popular choices:

Tool	Where it runs	Language
OPA / Conftest	Plan output (JSON)	Rego
Sentinel	Terraform Cloud / Enterprise	Sentinel DSL
Checkov / tfsec / KICS	HCL source	Built-in rule packs

Example Conftest policy that forbids public S3 buckets:

# policy/s3.rego
package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  resource.change.after.acl == "public-read"
  msg := sprintf("S3 bucket %s must not be public-read", [resource.address])
}

Wired into CI:

terraform show -json tfplan > plan.json
conftest test plan.json --policy policy/

Testing

Three layers, increasing in fidelity and cost:

1. Static checks (every PR, free)

terraform fmt -check -recursive
terraform validate
tflint --recursive
tfsec .
checkov -d .

2. Module unit tests with the native test framework

Terraform 1.6+ has a built-in test framework. Tests live in .tftest.hcl files alongside the module:

# modules/database/tests/defaults.tftest.hcl
variables {
  name       = "test-db"
  vpc_id     = "vpc-12345678"
  subnet_ids = ["subnet-aaa", "subnet-bbb"]
}

run "defaults_apply_cleanly" {
  command = plan

  assert {
    condition     = aws_db_instance.this.instance_class == "db.t3.medium"
    error_message = "default instance class should be db.t3.medium"
  }

  assert {
    condition     = aws_db_instance.this.allocated_storage == 20
    error_message = "default storage should be 20 GB"
  }
}

terraform test

3. End-to-end tests with Terratest

Spin up real infrastructure in an isolated AWS account, assert against it, tear it down. Best for module releases:

// test/database_test.go
func TestDatabase(t *testing.T) {
    opts := &terraform.Options{
        TerraformDir: "../examples/basic",
    }

    defer terraform.Destroy(t, opts)
    terraform.InitAndApply(t, opts)

    endpoint := terraform.Output(t, opts, "endpoint")
    assert.NotEmpty(t, endpoint)
}

Slow and costs real money — reserve for shared modules that downstream teams depend on.

Drift Detection

Even with strict CI, things drift: someone clicks in the console, an autoscaler resizes things, AWS auto-rotates credentials. Detect drift on a schedule:

# .github/workflows/drift.yml
on:
  schedule:
    - cron: "0 6 * * *"           # daily at 06:00 UTC

jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - id: plan
        run: |
          terraform plan -detailed-exitcode -no-color -out=tfplan
        continue-on-error: true
      - if: steps.plan.outputs.exitcode == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            { "text": "🚨 Terraform drift detected in ${{ github.repository }}" }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

plan -detailed-exitcode returns 0 (no changes), 1 (error), or 2 (changes) — perfect for scripting.

Operating Tips

A handful of habits that prevent self-inflicted incidents:

Small, frequent applies. Large diffs are hard to review and slow to roll back. Keep PRs to one logical change.
Read the plan before approving. Especially the -/+ (replace) and - (destroy) lines.
Never terraform destroy against production. Use prevent_destroy on critical resources, and prefer surgical terraform apply -destroy -target=... if you genuinely need to remove one thing.
Don't use -target to "work around" failures. It's a state surgery tool. If you need it, you have a state problem to fix.
Tag everything. A default_tags block on the provider gets you 80% of the way — Project, Environment, ManagedBy = "terraform", Owner.
Adopt a naming convention early. <project>-<env>-<resource> is fine. Whatever you pick, automate it via locals.
Document the unobvious. A short comment on why a lifecycle.ignore_changes is there saves the next person an hour.

Checklist

Best Practices

Project Layout

CI/CD: Plan on PR, Apply on Merge

Security

Sample Plan Security Scan

Policy as Code

Testing

1. Static checks (every PR, free)

2. Module unit tests with the native test framework

3. End-to-end tests with Terratest

Drift Detection

Operating Tips

Checklist

Best Practices

On this page