Steven's Knowledge

Best Practices

Production CI/CD - speed, caching, security, pipeline design, deployment strategies, and the cross-platform principles

Best Practices

These principles apply whether you're on GitHub Actions, GitLab CI, CircleCI, Jenkins, or anything else. The platform syntax differs; the goals don't.

Speed Is a Feature

Slow CI strangles teams. Aim for PR feedback in under 10 minutes. Above that, devs context-switch, batch up changes, stop running CI locally, and ship slower.

Profile First

Don't guess where time goes. Most platforms surface per-step timing — find the slowest steps and attack those. Common offenders:

SlowdownFix
Dependency install from scratch every runCaching (npm, pip, Go modules, Cargo, ...)
Single-process test runsParallelize across machines or test sharding
Re-builds of everything on every commitBuild caching (Docker layer cache, build tool cache)
Sequential stages where DAG would doneeds: (GitLab) or needs: (GHA) explicit dependencies
Hitting a slow registry / package mirrorPull-through cache; geographically nearer mirror
Full E2E suite on every PRGate E2E behind PR labels / nightly; smaller smoke tests on PRs

Cache Smart

Cache deterministic things (dependencies pinned by lockfile, base images, build tool caches). Don't cache side effects (test output, dynamic config). The cache key should change exactly when the cacheable content should change:

# Good: lockfile change invalidates
key: npm-${{ hashFiles('package-lock.json') }}

# Bad: branch-based; stale caches forever
key: npm-${{ github.ref_name }}

Both Actions and GitLab support size-bound caches with LRU eviction. Don't cache 5GB of node_modules per branch — set sensible paths.

Parallelize Wisely

Two axes:

  • Job-level: independent jobs run in parallel by default. Use needs: to express only the real dependencies.
  • Test-level: split your test suite across N runners. Jest has --shard, pytest has pytest-split, Go has -parallel.

Test sharding pays off above ~5 min of test runtime. Below that, splitting just adds overhead.

Security

No Long-Lived Cloud Credentials

The number-one win of the last few years: OIDC.

# GitHub Actions
permissions:
  id-token: write
  contents: read

- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123:role/gha-deploy
# GitLab CI
deploy:
  id_tokens:
    AWS_TOKEN: { aud: sts.amazonaws.com }
  script:
    - aws sts assume-role-with-web-identity ...

The CI platform mints a short-lived JWT; AWS/GCP/Azure trusts it under specific conditions (this repo, this branch, this environment). Stop storing AWS_ACCESS_KEY_ID in CI secrets.

Limit Secret Access

Both platforms separate:

  • Repository-level secrets — visible to any workflow / pipeline in the repo.
  • Environment-level secrets — visible only when the job declares that environment.
  • Protected branch / tag restrictions — secrets won't load on a PR from a fork.

A common pattern: production secrets are environment-scoped to production, and the environment requires reviewer approval. A malicious or accidental change can't reach production secrets without a human gate.

Watch What Untrusted Code Can Do

Pull requests from forks can execute code in your CI. Both platforms default to not exposing secrets to fork PRs, but the runner itself executes attacker-controlled code. Mitigations:

  • Run untrusted PR builds on separate, sandboxed runners (or hosted, never self-hosted).
  • Treat the PR's pipeline as "lint + build + test only" — no deploy.
  • Manually trigger trusted deploy pipelines after review (or after the PR merges).

Pin Third-Party Actions / Templates

# BAD: latest, could change tomorrow
- uses: some-marketplace/awesome-action@main

# OK: tag (mutable)
- uses: some-marketplace/awesome-action@v3

# BEST: commit SHA (immutable)
- uses: some-marketplace/awesome-action@a1b2c3d4e5f6...

A compromised action runs in your pipeline with your secrets. Pin to a SHA for security-sensitive actions (cloud-credential setup, deploy steps, anything with id-token: write).

Audit the Logs

Both platforms log every workflow execution. Ship those logs off-platform to your SIEM / ELK. Watch for:

  • Workflows triggered by forks attempting access to protected resources.
  • Sudden spikes in secret access.
  • Self-hosted runners showing up that aren't yours.

Pipeline Design

One Immutable Artifact

The cardinal rule. Build once in CI, deploy that exact artifact (with digest!) to every environment.

# CI
- build image → push as ghcr.io/myorg/api@sha256:abc...

# Deploy staging
- kubectl set image api=ghcr.io/myorg/api@sha256:abc...

# Deploy production
- kubectl set image api=ghcr.io/myorg/api@sha256:abc...

If staging and production deploy different builds, you're not testing what you ship. Use digests, not tags, to pin.

Promote Through Environments

PR        → build, lint, test
main      → build, deploy to staging, run smoke tests
release   → promote (re-tag) staging image, deploy to production

Promotions are fast because they don't rebuild — they just re-tag and re-deploy. The same @sha256:abc... that ran in staging for 48 hours now runs in production.

Use Environment Approvals as the Production Gate

GitHub Environments and GitLab CI Environments both support manual approval gates. Require them for production:

  • Reviewer must approve before the job runs.
  • The approval is logged with the workflow run.
  • A deploy can't sneak through outside of working hours by accident.

Roll Forward, Not Back, by Default

When a deploy is bad:

  • Preferred: revert the commit in git → CI deploys the previous artifact. Same controls, same audit trail.
  • Emergency: kubectl rollout undo or equivalent — fast, but skips your CI gates. Document each instance.

Don't develop a culture of cluster-side hotfixes; they accumulate and become tech debt.

Avoid Pipeline-as-Logic

CI YAML is a config language, not a program. When you find yourself doing:

  • Loops in YAML
  • Conditional if chains 10 deep
  • Shell scripts spread across 5 inline run: blocks

… extract that logic into a real script (./scripts/deploy.sh, ./scripts/release.py) that humans can run locally. The CI pipeline just calls it. You can reproduce CI failures on your laptop.

Deployment Strategies

A short tour of the patterns CI/CD enables:

StrategyHow
Rolling updateReplace N old pods with new, gradually (Kubernetes Deployment default)
Blue/GreenStand up new version completely; switch traffic atomically
CanaryNew version takes 1%/5%/25% of traffic; monitor; expand
Feature flagsDeploy code but keep features off; flip per user / org

Feature flags decouple deployment from release. A risky feature can be deployed dark, enabled for one internal user, then ramped up — even though it's the same artifact in production all along.

Observability of the Pipeline

Track CI/CD itself, not just what it ships:

MetricWhy
Pipeline success rateTrending down = something's flaky; investigate
P50 / P95 pipeline durationThe user feedback loop
Deploys per dayDORA metric — high-performing teams deploy frequently
Lead time for changesPR open → deployed to prod
Mean time to recovery (MTTR)Deploy broke → fixed
Change failure rate% of deploys that needed rollback

The four DORA metrics correlate strongly with team performance. Instrument them.

Pipeline Hygiene

A handful of habits:

  1. Pin tool versions. Node 20, Python 3.12, Terraform 1.7.5 — exact, not latest.
  2. Reproduce locally. Whatever CI does, you should be able to run on your laptop.
  3. Treat flaky tests as bugs. Each flake erodes trust. Quarantine, then fix, then re-enable.
  4. Pipelines fail loudly. set -euo pipefail in shell, fail-fast in matrix jobs.
  5. Don't disable failing tests. If you must, file a ticket and put a date on the skip.
  6. Run security scans in CI. SAST, dependency scan, container image scan — all platforms have built-in or integrated options.
  7. One-button rollback. Make it easier than rolling forward when needed.
  8. Document the deployment process in the repo. A new engineer should follow it without Slack help.

Checklist

Production-ready CI/CD checklist

  • Pipeline definitions in version control with the code they build
  • PR feedback in < 10 min (lint + unit tests at minimum)
  • Caching for dependencies and build outputs
  • DAG (needs:) used to avoid unnecessary stage barriers
  • OIDC for cloud credentials; no long-lived AWS_ACCESS_KEY_ID etc
  • Secrets scoped to environment, not repo-wide
  • Third-party actions / templates pinned by SHA
  • Build once, deploy the same digest to every environment
  • Production deploys gated behind manual approval (environment)
  • Smoke tests after deploy; automated rollback on failure
  • Pipeline logs shipped to a SIEM
  • DORA metrics tracked (deploy frequency, lead time, MTTR, failure rate)
  • Tool versions and base images pinned, not floating
  • Self-hosted runners (if any) ephemeral and sandboxed
  • Deployment / rollback runbook in the repo

On this page