Steven's Knowledge

Best Practices

Repo structure, branch strategy, RBAC, disaster recovery, scaling to many teams, common pitfalls

Best Practices

GitOps is a discipline more than a tool. These practices keep the discipline intact as the org grows.

Repo Structure

For more than ~3 services, separate app code from deployment config:

github.com/org/app-checkout      ← team-owned, source code
github.com/org/app-payments      ← team-owned
github.com/org/infra-config      ← platform-owned, all manifests

Inside infra-config:

infra-config/
├── argocd/
│   ├── root.yaml
│   └── applications/
│       └── {one Application yaml per app}
├── apps/
│   └── checkout/
│       ├── base/
│       └── overlays/{staging,prod}/
├── infrastructure/
│   ├── ingress-nginx/
│   ├── cert-manager/
│   ├── monitoring/
│   └── policy/
└── clusters/
    ├── staging/
    └── prod-us-east/

Why one config repo: easy to audit; one CODEOWNERS file; PR review of cross-cutting changes.

Branch Strategy

The simplest model that works:

  • main: what's deployed (or about to be). Protected, requires PR review.
  • Feature branches: short-lived; PR-merged into main.
  • No develop/release branches. Trunk-based development for the config repo.

Promotion happens via directory overlays (overlays/staging vs overlays/prod), not branches. Branches encode time; directories encode environment — and environments don't change in time.

RBAC and Multi-Tenancy

You don't want every team able to deploy to every namespace. ArgoCD Project:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata: { name: team-payments, namespace: argocd }
spec:
  description: Payments team
  sourceRepos:
    - https://github.com/org/infra-config.git
  destinations:
    - { server: '*', namespace: 'payments-*' }
  clusterResourceWhitelist: []   # no cluster-scoped resources
  namespaceResourceBlacklist:
    - { group: '', kind: ResourceQuota }
  roles:
    - name: deployer
      policies:
        - p, proj:team-payments:deployer, applications, sync, team-payments/*, allow
      groups: [org:payments]

Combined with Git CODEOWNERS:

# .github/CODEOWNERS
/apps/checkout/  @org/payments
/apps/refunds/   @org/payments
/clusters/       @org/platform
/infrastructure/ @org/platform

The two layers — repo-level review and ArgoCD project — prevent a payments engineer from accidentally deploying to ingress-nginx.

Disaster Recovery

What if the cluster is wiped? GitOps makes recovery straightforward:

  1. New cluster, install ArgoCD: kubectl apply -n argocd -f .../install.yaml
  2. Apply the root Application: kubectl apply -f argocd/root.yaml
  3. Wait. Everything reconciles.

Caveats:

  • Cluster-specific secrets: Sealed Secrets keys, External Secrets Operator's auth to Vault, ArgoCD's repo creds. Back these up outside Git.
  • Stateful data: PVCs, databases. GitOps doesn't restore data. Velero or DB-native backups do.
  • ArgoCD's own state: minimal but not zero (image of current sync status). Reapplies on next sync.

Practice DR. Pick a quiet Friday afternoon, spin up a new cluster, apply your root, time how long until everything's healthy. The first time it'll surprise you.

Image Tag Strategy

Don't use latest. Use immutable, sortable tags:

  • Semver: v1.2.3 — best for releases
  • Git SHA: 1234abc — best for continuous deployment
  • Date + SHA: 2026-05-21-1234abc — sortable + traceable

Image automation policies key off these patterns. ArgoCD will tell you when a tag has updated (Sync) but you need to know which tag is correct — that's the CI's job, not GitOps's.

Notifications

When sync succeeds or fails, who knows?

# Argo CD Notifications
apiVersion: v1
kind: ConfigMap
metadata: { name: argocd-notifications-cm, namespace: argocd }
data:
  service.slack: |
    token: $slack-token
  template.app-sync-failed: |
    message: |
      Application {{.app.metadata.name}} sync failed.
      Repo: {{.app.spec.source.repoURL}}
      Revision: {{.app.status.sync.revision}}
  trigger.on-sync-failed: |
    - description: Notify on sync fail
      send: [app-sync-failed]
      when: app.status.operationState.phase in ['Error', 'Failed']

Don't page on every sync. Page on:

  • Out-of-sync > 10 minutes (drift)
  • Sync failed
  • Health degraded > 5 minutes

Performance

ArgoCD can manage thousands of applications, but defaults won't get you there. Knobs:

  • Application controller replicas: --replicas=5 (default 1; shard by app)
  • Repo server replicas: more replicas if you have many Helm charts to render
  • Resource cache: ArgoCD caches Kubernetes resources; increase controller.repo.server.parallelism and --kubectl-parallelism-limit
  • Sync waves and concurrency: argocd-cmapplication.sync.impl.timeout.seconds
  • Webhook from Git instead of polling: reduces lag from ~3 min to seconds

If you start to feel slowness, separate the control plane: ArgoCD per cluster, or sharded application controller.

Scaling to Many Teams

Patterns that don't break at 50 teams:

  • One config repo, many code repos. Platform team curates structure; teams own subdirectories via CODEOWNERS.
  • Templates and ApplicationSets. New service onboarding: create a directory from a template, ApplicationSet picks it up.
  • Self-service via PR: teams don't need ArgoCD UI access for routine deploys; they need PR access. UI for debugging only.
  • Backstage / IDP integration: app catalog generates ApplicationSets, links to ArgoCD pane.

Common Pitfalls

Forgetting Helm value overrides. ArgoCD Application references a chart; values come from valueFiles or inline. Forgetting to set environment-specific overrides means staging = prod accidentally.

Auto-prune surprise. Removing a manifest from Git deletes the resource. Great for cleanup, bad if you forgot a PVC reference. Use ArgoCD finalizers and Resource sync waves carefully.

Sync waves order subtle. The annotation argocd.argoproj.io/sync-wave: "1" orders, but doesn't wait for readiness. Use argocd.argoproj.io/hook: PreSync + Jobs for true gating.

Mixed manual and GitOps changes. A new annotation in Git, but someone also added one manually. Behavior depends on ignoreDifferences config. Pick one source of truth.

ArgoCD not in GitOps. ArgoCD itself should be deployed via GitOps (app-of-apps) for self-management. Otherwise upgrading ArgoCD is kubectl apply outside the model.

Cluster credentials in plaintext. ArgoCD stores cluster connection in Secret form by default. Treat the argocd namespace like a kingdom: tight RBAC, audit logging, network policies.

Drift-tolerant culture. "Just kubectl edit it real quick" — the moment that's accepted, GitOps fails. Make direct cluster writes painful (no kubeconfigs for prod, break-glass procedure).

Compliance and Audit

GitOps is naturally auditable:

  • Who changed prod? Git history.
  • When did we deploy v1.2.3? Find the commit that updated the image tag.
  • Show me the desired state at 02:34 UTC on May 18. git show <hash>.
  • Did this approved change actually reach prod? ArgoCD sync history matches commit SHA.

For SOC2 / FedRAMP: protect the main branch, require approval, sign commits (git commit -S), and you have most of the controls.

Checklist

GitOps production readiness:

  • Config repo separate from app code repos
  • Main branch protected, requires PR review
  • CODEOWNERS enforces team boundaries
  • ArgoCD Projects + RBAC mirror team boundaries
  • ArgoCD itself managed via app-of-apps
  • Secrets handled via SOPS / Sealed Secrets / External Secrets — none in plaintext
  • No direct kubectl apply to production (RBAC blocks it)
  • Image tags are immutable (no latest)
  • Notifications wired to Slack/Discord for sync failures
  • Webhook from Git host to ArgoCD (sub-minute sync)
  • Disaster recovery procedure tested
  • PR previews available for app teams
  • Backup of Sealed Secrets controller keys / ESO auth state

What's Next

You have a GitOps practice. Connect it to:

  • CI/CD — CI builds and pushes images; GitOps deploys them
  • Secrets — External Secrets Operator + Vault is the GitOps-friendly secret stack
  • Service Mesh — mesh config lives in Git; Argo Rollouts integrates with mesh for canaries
  • Feature Flags — flags decouple release from deploy; GitOps deploys, flags release

On this page