Best Practices
Production-ready Terraform - project layout, CI/CD pipelines, security, testing, drift detection, and policy-as-code
Best Practices
The patterns here are what separate a Terraform repo that scales from one that becomes a liability.
Project Layout
A common, scalable layout:
infrastructure/
├── modules/ # reusable building blocks
│ ├── vpc/
│ ├── eks-cluster/
│ ├── database/
│ └── application/
├── environments/
│ ├── staging/
│ │ ├── backend.tf # key = staging/terraform.tfstate
│ │ ├── providers.tf
│ │ ├── main.tf # composes modules
│ │ ├── variables.tf
│ │ └── terraform.tfvars # staging values
│ └── production/
│ ├── backend.tf # key = production/terraform.tfstate
│ └── ...
└── .github/workflows/
├── plan.yml # on PR
└── apply.yml # on merge to mainKey properties:
- One state file per environment — staging and production are independent blast radii.
- Modules are versioned (in-repo via paths, or external via git tags / registry).
- Environment dirs are thin: they import modules and pass values. No business logic.
CI/CD: Plan on PR, Apply on Merge
The standard pipeline:
# .github/workflows/plan.yml
name: Terraform Plan
on:
pull_request:
paths:
- 'infrastructure/**'
jobs:
plan:
strategy:
matrix:
env: [staging, production]
runs-on: ubuntu-latest
permissions:
id-token: write # for AWS OIDC
pull-requests: write # to comment the plan
contents: read
defaults:
run:
working-directory: infrastructure/environments/${{ matrix.env }}
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.TF_PLAN_ROLE }} # read-only role
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.5
- run: terraform fmt -check -recursive
- run: terraform init
- run: terraform validate
- run: terraform plan -no-color -out=tfplan
- name: Comment plan on PR
uses: actions/github-script@v7
with:
script: |
const plan = require('child_process').execSync('terraform show -no-color tfplan').toString();
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '```\n' + plan.slice(0, 60000) + '\n```',
});# .github/workflows/apply.yml
name: Terraform Apply
on:
push:
branches: [main]
paths:
- 'infrastructure/**'
jobs:
apply:
strategy:
max-parallel: 1
matrix:
env: [staging, production]
runs-on: ubuntu-latest
environment: ${{ matrix.env }} # requires reviewer approval for production
permissions:
id-token: write
contents: read
defaults:
run:
working-directory: infrastructure/environments/${{ matrix.env }}
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.TF_APPLY_ROLE }} # higher-privilege role
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform apply -auto-approveImportant properties:
- OIDC, not access keys. The CI exchanges its OIDC token for a short-lived AWS role. No long-lived secrets.
- Different role for plan vs. apply. Plan only needs read; apply needs write.
- GitHub Environments with required reviewers. Production apply pauses for a human approval.
- Plan output in the PR. Reviewers see the diff before approving the merge.
Security
| Principle | What it looks like in practice |
|---|---|
| No secrets in HCL or tfvars | Pull from AWS Secrets Manager / SSM / Vault via data sources |
| Mark sensitive outputs | sensitive = true on variables and outputs containing secrets |
| Encrypt state at rest | S3 SSE, GCS CMEK, or Terraform Cloud (encrypted by default) |
| Restrict who can run apply | Separate IAM role; gate behind CI + approval |
| Least-privilege provider creds | Plan role: *:Describe*, *:List*, *:Get* only |
| Scan plans for risky changes | Tools like tfsec, checkov, trivy config in CI |
| Pin everything | Terraform version, provider versions, module versions |
Sample Plan Security Scan
- name: Run tfsec
uses: aquasecurity/tfsec-action@v1.0.3
with:
soft_fail: false # block merge on findings
- name: Run checkov
uses: bridgecrewio/checkov-action@master
with:
directory: infrastructure/environments/${{ matrix.env }}
framework: terraformPolicy as Code
For organizations, scan-then-block isn't enough — you want deny by default with explicit exceptions. Three popular choices:
| Tool | Where it runs | Language |
|---|---|---|
| OPA / Conftest | Plan output (JSON) | Rego |
| Sentinel | Terraform Cloud / Enterprise | Sentinel DSL |
| Checkov / tfsec / KICS | HCL source | Built-in rule packs |
Example Conftest policy that forbids public S3 buckets:
# policy/s3.rego
package main
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
resource.change.after.acl == "public-read"
msg := sprintf("S3 bucket %s must not be public-read", [resource.address])
}Wired into CI:
terraform show -json tfplan > plan.json
conftest test plan.json --policy policy/Testing
Three layers, increasing in fidelity and cost:
1. Static checks (every PR, free)
terraform fmt -check -recursive
terraform validate
tflint --recursive
tfsec .
checkov -d .2. Module unit tests with the native test framework
Terraform 1.6+ has a built-in test framework. Tests live in .tftest.hcl files alongside the module:
# modules/database/tests/defaults.tftest.hcl
variables {
name = "test-db"
vpc_id = "vpc-12345678"
subnet_ids = ["subnet-aaa", "subnet-bbb"]
}
run "defaults_apply_cleanly" {
command = plan
assert {
condition = aws_db_instance.this.instance_class == "db.t3.medium"
error_message = "default instance class should be db.t3.medium"
}
assert {
condition = aws_db_instance.this.allocated_storage == 20
error_message = "default storage should be 20 GB"
}
}terraform test3. End-to-end tests with Terratest
Spin up real infrastructure in an isolated AWS account, assert against it, tear it down. Best for module releases:
// test/database_test.go
func TestDatabase(t *testing.T) {
opts := &terraform.Options{
TerraformDir: "../examples/basic",
}
defer terraform.Destroy(t, opts)
terraform.InitAndApply(t, opts)
endpoint := terraform.Output(t, opts, "endpoint")
assert.NotEmpty(t, endpoint)
}Slow and costs real money — reserve for shared modules that downstream teams depend on.
Drift Detection
Even with strict CI, things drift: someone clicks in the console, an autoscaler resizes things, AWS auto-rotates credentials. Detect drift on a schedule:
# .github/workflows/drift.yml
on:
schedule:
- cron: "0 6 * * *" # daily at 06:00 UTC
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- id: plan
run: |
terraform plan -detailed-exitcode -no-color -out=tfplan
continue-on-error: true
- if: steps.plan.outputs.exitcode == '2'
uses: slackapi/slack-github-action@v1
with:
payload: |
{ "text": "🚨 Terraform drift detected in ${{ github.repository }}" }
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}plan -detailed-exitcode returns 0 (no changes), 1 (error), or 2 (changes) — perfect for scripting.
Operating Tips
A handful of habits that prevent self-inflicted incidents:
- Small, frequent applies. Large diffs are hard to review and slow to roll back. Keep PRs to one logical change.
- Read the plan before approving. Especially the
-/+(replace) and-(destroy) lines. - Never
terraform destroyagainst production. Useprevent_destroyon critical resources, and prefer surgicalterraform apply -destroy -target=...if you genuinely need to remove one thing. - Don't use
-targetto "work around" failures. It's a state surgery tool. If you need it, you have a state problem to fix. - Tag everything. A
default_tagsblock on the provider gets you 80% of the way — Project, Environment, ManagedBy = "terraform", Owner. - Adopt a naming convention early.
<project>-<env>-<resource>is fine. Whatever you pick, automate it via locals. - Document the unobvious. A short comment on why a
lifecycle.ignore_changesis there saves the next person an hour.
Checklist
Pre-production Terraform checklist
- Remote backend with locking (S3 + DynamoDB, GCS, or TFC)
- State file encrypted at rest, versioning enabled
- Terraform and provider versions pinned
- Separate state per environment
- CI runs
fmt,validate,plan, security scans on every PR - Plan output posted to the PR
- Apply gated behind code review + environment approval for production
- OIDC-based credentials in CI (no long-lived keys)
- Critical resources marked
prevent_destroy -
default_tagson the provider - Daily drift detection
- Module tests for anything shared across teams