Best Practices
Production CI/CD - speed, caching, security, pipeline design, deployment strategies, and the cross-platform principles
Best Practices
These principles apply whether you're on GitHub Actions, GitLab CI, CircleCI, Jenkins, or anything else. The platform syntax differs; the goals don't.
Speed Is a Feature
Slow CI strangles teams. Aim for PR feedback in under 10 minutes. Above that, devs context-switch, batch up changes, stop running CI locally, and ship slower.
Profile First
Don't guess where time goes. Most platforms surface per-step timing — find the slowest steps and attack those. Common offenders:
| Slowdown | Fix |
|---|---|
| Dependency install from scratch every run | Caching (npm, pip, Go modules, Cargo, ...) |
| Single-process test runs | Parallelize across machines or test sharding |
| Re-builds of everything on every commit | Build caching (Docker layer cache, build tool cache) |
| Sequential stages where DAG would do | needs: (GitLab) or needs: (GHA) explicit dependencies |
| Hitting a slow registry / package mirror | Pull-through cache; geographically nearer mirror |
| Full E2E suite on every PR | Gate E2E behind PR labels / nightly; smaller smoke tests on PRs |
Cache Smart
Cache deterministic things (dependencies pinned by lockfile, base images, build tool caches). Don't cache side effects (test output, dynamic config). The cache key should change exactly when the cacheable content should change:
# Good: lockfile change invalidates
key: npm-${{ hashFiles('package-lock.json') }}
# Bad: branch-based; stale caches forever
key: npm-${{ github.ref_name }}Both Actions and GitLab support size-bound caches with LRU eviction. Don't cache 5GB of node_modules per branch — set sensible paths.
Parallelize Wisely
Two axes:
- Job-level: independent jobs run in parallel by default. Use
needs:to express only the real dependencies. - Test-level: split your test suite across N runners. Jest has
--shard, pytest haspytest-split, Go has-parallel.
Test sharding pays off above ~5 min of test runtime. Below that, splitting just adds overhead.
Security
No Long-Lived Cloud Credentials
The number-one win of the last few years: OIDC.
# GitHub Actions
permissions:
id-token: write
contents: read
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123:role/gha-deploy# GitLab CI
deploy:
id_tokens:
AWS_TOKEN: { aud: sts.amazonaws.com }
script:
- aws sts assume-role-with-web-identity ...The CI platform mints a short-lived JWT; AWS/GCP/Azure trusts it under specific conditions (this repo, this branch, this environment). Stop storing AWS_ACCESS_KEY_ID in CI secrets.
Limit Secret Access
Both platforms separate:
- Repository-level secrets — visible to any workflow / pipeline in the repo.
- Environment-level secrets — visible only when the job declares that environment.
- Protected branch / tag restrictions — secrets won't load on a PR from a fork.
A common pattern: production secrets are environment-scoped to production, and the environment requires reviewer approval. A malicious or accidental change can't reach production secrets without a human gate.
Watch What Untrusted Code Can Do
Pull requests from forks can execute code in your CI. Both platforms default to not exposing secrets to fork PRs, but the runner itself executes attacker-controlled code. Mitigations:
- Run untrusted PR builds on separate, sandboxed runners (or hosted, never self-hosted).
- Treat the PR's pipeline as "lint + build + test only" — no deploy.
- Manually trigger trusted deploy pipelines after review (or after the PR merges).
Pin Third-Party Actions / Templates
# BAD: latest, could change tomorrow
- uses: some-marketplace/awesome-action@main
# OK: tag (mutable)
- uses: some-marketplace/awesome-action@v3
# BEST: commit SHA (immutable)
- uses: some-marketplace/awesome-action@a1b2c3d4e5f6...A compromised action runs in your pipeline with your secrets. Pin to a SHA for security-sensitive actions (cloud-credential setup, deploy steps, anything with id-token: write).
Audit the Logs
Both platforms log every workflow execution. Ship those logs off-platform to your SIEM / ELK. Watch for:
- Workflows triggered by forks attempting access to protected resources.
- Sudden spikes in secret access.
- Self-hosted runners showing up that aren't yours.
Pipeline Design
One Immutable Artifact
The cardinal rule. Build once in CI, deploy that exact artifact (with digest!) to every environment.
# CI
- build image → push as ghcr.io/myorg/api@sha256:abc...
# Deploy staging
- kubectl set image api=ghcr.io/myorg/api@sha256:abc...
# Deploy production
- kubectl set image api=ghcr.io/myorg/api@sha256:abc...If staging and production deploy different builds, you're not testing what you ship. Use digests, not tags, to pin.
Promote Through Environments
PR → build, lint, test
main → build, deploy to staging, run smoke tests
release → promote (re-tag) staging image, deploy to productionPromotions are fast because they don't rebuild — they just re-tag and re-deploy. The same @sha256:abc... that ran in staging for 48 hours now runs in production.
Use Environment Approvals as the Production Gate
GitHub Environments and GitLab CI Environments both support manual approval gates. Require them for production:
- Reviewer must approve before the job runs.
- The approval is logged with the workflow run.
- A deploy can't sneak through outside of working hours by accident.
Roll Forward, Not Back, by Default
When a deploy is bad:
- Preferred: revert the commit in git → CI deploys the previous artifact. Same controls, same audit trail.
- Emergency:
kubectl rollout undoor equivalent — fast, but skips your CI gates. Document each instance.
Don't develop a culture of cluster-side hotfixes; they accumulate and become tech debt.
Avoid Pipeline-as-Logic
CI YAML is a config language, not a program. When you find yourself doing:
- Loops in YAML
- Conditional
ifchains 10 deep - Shell scripts spread across 5 inline
run:blocks
… extract that logic into a real script (./scripts/deploy.sh, ./scripts/release.py) that humans can run locally. The CI pipeline just calls it. You can reproduce CI failures on your laptop.
Deployment Strategies
A short tour of the patterns CI/CD enables:
| Strategy | How |
|---|---|
| Rolling update | Replace N old pods with new, gradually (Kubernetes Deployment default) |
| Blue/Green | Stand up new version completely; switch traffic atomically |
| Canary | New version takes 1%/5%/25% of traffic; monitor; expand |
| Feature flags | Deploy code but keep features off; flip per user / org |
Feature flags decouple deployment from release. A risky feature can be deployed dark, enabled for one internal user, then ramped up — even though it's the same artifact in production all along.
Observability of the Pipeline
Track CI/CD itself, not just what it ships:
| Metric | Why |
|---|---|
| Pipeline success rate | Trending down = something's flaky; investigate |
| P50 / P95 pipeline duration | The user feedback loop |
| Deploys per day | DORA metric — high-performing teams deploy frequently |
| Lead time for changes | PR open → deployed to prod |
| Mean time to recovery (MTTR) | Deploy broke → fixed |
| Change failure rate | % of deploys that needed rollback |
The four DORA metrics correlate strongly with team performance. Instrument them.
Pipeline Hygiene
A handful of habits:
- Pin tool versions. Node 20, Python 3.12, Terraform 1.7.5 — exact, not
latest. - Reproduce locally. Whatever CI does, you should be able to run on your laptop.
- Treat flaky tests as bugs. Each flake erodes trust. Quarantine, then fix, then re-enable.
- Pipelines fail loudly.
set -euo pipefailin shell, fail-fast in matrix jobs. - Don't disable failing tests. If you must, file a ticket and put a date on the skip.
- Run security scans in CI. SAST, dependency scan, container image scan — all platforms have built-in or integrated options.
- One-button rollback. Make it easier than rolling forward when needed.
- Document the deployment process in the repo. A new engineer should follow it without Slack help.
Checklist
Production-ready CI/CD checklist
- Pipeline definitions in version control with the code they build
- PR feedback in < 10 min (lint + unit tests at minimum)
- Caching for dependencies and build outputs
- DAG (
needs:) used to avoid unnecessary stage barriers - OIDC for cloud credentials; no long-lived
AWS_ACCESS_KEY_IDetc - Secrets scoped to environment, not repo-wide
- Third-party actions / templates pinned by SHA
- Build once, deploy the same digest to every environment
- Production deploys gated behind manual approval (environment)
- Smoke tests after deploy; automated rollback on failure
- Pipeline logs shipped to a SIEM
- DORA metrics tracked (deploy frequency, lead time, MTTR, failure rate)
- Tool versions and base images pinned, not floating
- Self-hosted runners (if any) ephemeral and sandboxed
- Deployment / rollback runbook in the repo