Steven's Knowledge

Advanced Patterns

Ansible Vault, dynamic inventory, conditionals and loops, error handling, and zero-downtime rolling deploys

Advanced Patterns

The features here turn "I can run a playbook" into "I run production deploys with this."

Ansible Vault

You'll have secrets in your repo eventually — DB passwords, API tokens, TLS keys. Vault encrypts them at rest.

# Encrypt an entire file
ansible-vault encrypt group_vars/all/secrets.yml

# Edit in-place (decrypt, edit, re-encrypt)
ansible-vault edit group_vars/all/secrets.yml

# Encrypt a single string (paste the output into YAML)
ansible-vault encrypt_string 'super-secret-token' --name 'api_token'

# Decrypt back to plaintext
ansible-vault decrypt group_vars/all/secrets.yml

The result is a YAML value Ansible can transparently decrypt:

# group_vars/all/secrets.yml (after `ansible-vault encrypt`)
$ANSIBLE_VAULT;1.1;AES256
35613165613032643962366434623031623530373132343361373465383361353161386364316632
6131386339393963313061643238666565303432316162320a3431366434623835616338306539...

Run a playbook with the vault password:

# Prompt for the password
ansible-playbook site.yml --ask-vault-pass

# Or read it from a file (used in CI)
ansible-playbook site.yml --vault-password-file ~/.vault_pass

Store the vault password in a secret manager (1Password, AWS Secrets Manager, GitHub Actions secret) — never check it in. The vault password isn't an encryption key you can rotate easily; treat it like a root credential.

Dynamic Inventory

Static YAML inventories don't scale to auto-scaling fleets. Use inventory plugins that query the cloud:

# inventory/aws_ec2.yml
---
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
keyed_groups:
  - key: tags.Role                              # group by Role tag
    prefix: role
  - key: tags.Environment
    prefix: env
hostnames:
  - tag:Name
compose:
  ansible_host: private_ip_address
# Preview what the plugin returns
ansible-inventory -i inventory/aws_ec2.yml --graph

# Use it like any other inventory
ansible-playbook -i inventory/aws_ec2.yml site.yml --limit role_webserver:&env_production

Common dynamic inventory plugins: amazon.aws.aws_ec2, google.cloud.gcp_compute, azure.azcollection.azure_rm, community.kubernetes.k8s.

Conditionals

Run a task only when a condition is true:

- name: Install Docker on Ubuntu
  apt:
    name: docker.io
    state: present
  when: ansible_distribution == "Ubuntu"

- name: Install Docker on RedHat-likes
  yum:
    name: docker
    state: present
  when: ansible_os_family == "RedHat"

- name: Only restart in production
  service:
    name: nginx
    state: restarted
  when:
    - environment == "production"
    - app_version is changed

Conditions are evaluated per host — the task is skipped on hosts that don't match, not on the whole play.

Loops

Loop a task over a list or dict:

# List of strings
- name: Install several packages
  apt:
    name: "{{ item }}"
    state: present
  loop:
    - nginx
    - postgresql-client
    - htop

# List of dicts (more readable)
- name: Create deploy users on each host
  user:
    name: "{{ item.name }}"
    groups: "{{ item.groups }}"
    shell: "{{ item.shell | default('/bin/bash') }}"
  loop:
    - { name: alice, groups: sudo }
    - { name: bob, groups: developers, shell: /usr/bin/zsh }

# Loop a dict (key + value)
- name: Set sysctl values
  ansible.posix.sysctl:
    name: "{{ item.key }}"
    value: "{{ item.value }}"
    state: present
    reload: true
  loop: "{{ sysctl_settings | dict2items }}"
  vars:
    sysctl_settings:
      vm.swappiness: 10
      net.core.somaxconn: 4096

Error Handling

Production playbooks anticipate failure:

- name: Pull image (may fail temporarily)
  docker_image:
    name: "{{ app_image }}"
    source: pull
  register: pull_result
  retries: 3
  delay: 10
  until: pull_result is success

- name: Optional cleanup task that's allowed to fail
  shell: rm -f /tmp/lockfile
  ignore_errors: true

- name: Validate config; fail loudly if invalid
  command: nginx -t
  changed_when: false
  failed_when: '"successful" not in result.stdout'
  register: result

# Try / always / rescue
- block:
    - name: Stop the old version
      service:
        name: app
        state: stopped

    - name: Swap symlink to the new release
      file:
        src: "/var/app/releases/{{ release_id }}"
        dest: /var/app/current
        state: link

    - name: Start the new version
      service:
        name: app
        state: started

  rescue:
    - name: Roll back symlink on failure
      file:
        src: "{{ previous_release }}"
        dest: /var/app/current
        state: link

    - name: Start the previous version
      service:
        name: app
        state: started

  always:
    - name: Notify deploy result
      uri:
        url: "https://hooks.example.com/deploy"
        method: POST
        body_format: json
        body: { status: "{{ 'ok' if ansible_failed_task is not defined else 'rolled-back' }}" }

Rolling Deploys

Restarting every host at once means downtime. Roll the deploy:

- name: Deploy application
  hosts: webservers
  become: true
  serial: "25%"                          # 25% of hosts at a time
  max_fail_percentage: 0                 # stop if any host fails
  any_errors_fatal: true                 # don't continue past failures

  vars:
    app_image: "myregistry/app:{{ app_version }}"

  pre_tasks:
    - name: Remove from load balancer
      uri:
        url: "http://{{ lb_api }}/deregister/{{ inventory_hostname }}"
        method: POST
      delegate_to: localhost              # run on the control node, not the target

  tasks:
    - name: Pull new image
      docker_image:
        name: "{{ app_image }}"
        source: pull

    - name: Recreate container
      docker_container:
        name: app
        image: "{{ app_image }}"
        state: started
        recreate: true
        restart_policy: unless-stopped
        ports: ["{{ app_port }}:3000"]
        env:
          NODE_ENV: production
          DATABASE_URL: "{{ database_url }}"

    - name: Wait for health check
      uri:
        url: "http://localhost:{{ app_port }}/health"
        status_code: 200
      register: health
      until: health.status == 200
      retries: 30
      delay: 2

  post_tasks:
    - name: Re-register with load balancer
      uri:
        url: "http://{{ lb_api }}/register/{{ inventory_hostname }}"
        method: POST
      delegate_to: localhost

Key knobs:

SettingWhat it does
serial: 1One host at a time (slowest, safest)
serial: "25%"A fraction of hosts at a time
serial: [1, 2, "50%"]Canary: 1, then 2, then 50%
max_fail_percentage: 0Abort the whole play if any host fails
any_errors_fatal: trueStop the play (not just one host) on first failure
delegate_to: localhostRun on the control node — useful for LB API calls

Fact Caching

Fact gathering takes seconds per host. For frequent playbook runs, cache them:

# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400

smart means "gather only if not already cached." Re-runs of a playbook against unchanged hosts skip fact-gathering entirely.

What's Next

You can write Ansible that handles real production runs. The last piece is operating it well — project layout, CI/CD, testing, Terraform integration → Best Practices.

On this page