Advanced Patterns
Ansible Vault, dynamic inventory, conditionals and loops, error handling, and zero-downtime rolling deploys
Advanced Patterns
The features here turn "I can run a playbook" into "I run production deploys with this."
Ansible Vault
You'll have secrets in your repo eventually — DB passwords, API tokens, TLS keys. Vault encrypts them at rest.
# Encrypt an entire file
ansible-vault encrypt group_vars/all/secrets.yml
# Edit in-place (decrypt, edit, re-encrypt)
ansible-vault edit group_vars/all/secrets.yml
# Encrypt a single string (paste the output into YAML)
ansible-vault encrypt_string 'super-secret-token' --name 'api_token'
# Decrypt back to plaintext
ansible-vault decrypt group_vars/all/secrets.ymlThe result is a YAML value Ansible can transparently decrypt:
# group_vars/all/secrets.yml (after `ansible-vault encrypt`)
$ANSIBLE_VAULT;1.1;AES256
35613165613032643962366434623031623530373132343361373465383361353161386364316632
6131386339393963313061643238666565303432316162320a3431366434623835616338306539...Run a playbook with the vault password:
# Prompt for the password
ansible-playbook site.yml --ask-vault-pass
# Or read it from a file (used in CI)
ansible-playbook site.yml --vault-password-file ~/.vault_passStore the vault password in a secret manager (1Password, AWS Secrets Manager, GitHub Actions secret) — never check it in. The vault password isn't an encryption key you can rotate easily; treat it like a root credential.
Dynamic Inventory
Static YAML inventories don't scale to auto-scaling fleets. Use inventory plugins that query the cloud:
# inventory/aws_ec2.yml
---
plugin: amazon.aws.aws_ec2
regions:
- us-east-1
keyed_groups:
- key: tags.Role # group by Role tag
prefix: role
- key: tags.Environment
prefix: env
hostnames:
- tag:Name
compose:
ansible_host: private_ip_address# Preview what the plugin returns
ansible-inventory -i inventory/aws_ec2.yml --graph
# Use it like any other inventory
ansible-playbook -i inventory/aws_ec2.yml site.yml --limit role_webserver:&env_productionCommon dynamic inventory plugins: amazon.aws.aws_ec2, google.cloud.gcp_compute, azure.azcollection.azure_rm, community.kubernetes.k8s.
Conditionals
Run a task only when a condition is true:
- name: Install Docker on Ubuntu
apt:
name: docker.io
state: present
when: ansible_distribution == "Ubuntu"
- name: Install Docker on RedHat-likes
yum:
name: docker
state: present
when: ansible_os_family == "RedHat"
- name: Only restart in production
service:
name: nginx
state: restarted
when:
- environment == "production"
- app_version is changedConditions are evaluated per host — the task is skipped on hosts that don't match, not on the whole play.
Loops
Loop a task over a list or dict:
# List of strings
- name: Install several packages
apt:
name: "{{ item }}"
state: present
loop:
- nginx
- postgresql-client
- htop
# List of dicts (more readable)
- name: Create deploy users on each host
user:
name: "{{ item.name }}"
groups: "{{ item.groups }}"
shell: "{{ item.shell | default('/bin/bash') }}"
loop:
- { name: alice, groups: sudo }
- { name: bob, groups: developers, shell: /usr/bin/zsh }
# Loop a dict (key + value)
- name: Set sysctl values
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
state: present
reload: true
loop: "{{ sysctl_settings | dict2items }}"
vars:
sysctl_settings:
vm.swappiness: 10
net.core.somaxconn: 4096Error Handling
Production playbooks anticipate failure:
- name: Pull image (may fail temporarily)
docker_image:
name: "{{ app_image }}"
source: pull
register: pull_result
retries: 3
delay: 10
until: pull_result is success
- name: Optional cleanup task that's allowed to fail
shell: rm -f /tmp/lockfile
ignore_errors: true
- name: Validate config; fail loudly if invalid
command: nginx -t
changed_when: false
failed_when: '"successful" not in result.stdout'
register: result
# Try / always / rescue
- block:
- name: Stop the old version
service:
name: app
state: stopped
- name: Swap symlink to the new release
file:
src: "/var/app/releases/{{ release_id }}"
dest: /var/app/current
state: link
- name: Start the new version
service:
name: app
state: started
rescue:
- name: Roll back symlink on failure
file:
src: "{{ previous_release }}"
dest: /var/app/current
state: link
- name: Start the previous version
service:
name: app
state: started
always:
- name: Notify deploy result
uri:
url: "https://hooks.example.com/deploy"
method: POST
body_format: json
body: { status: "{{ 'ok' if ansible_failed_task is not defined else 'rolled-back' }}" }Rolling Deploys
Restarting every host at once means downtime. Roll the deploy:
- name: Deploy application
hosts: webservers
become: true
serial: "25%" # 25% of hosts at a time
max_fail_percentage: 0 # stop if any host fails
any_errors_fatal: true # don't continue past failures
vars:
app_image: "myregistry/app:{{ app_version }}"
pre_tasks:
- name: Remove from load balancer
uri:
url: "http://{{ lb_api }}/deregister/{{ inventory_hostname }}"
method: POST
delegate_to: localhost # run on the control node, not the target
tasks:
- name: Pull new image
docker_image:
name: "{{ app_image }}"
source: pull
- name: Recreate container
docker_container:
name: app
image: "{{ app_image }}"
state: started
recreate: true
restart_policy: unless-stopped
ports: ["{{ app_port }}:3000"]
env:
NODE_ENV: production
DATABASE_URL: "{{ database_url }}"
- name: Wait for health check
uri:
url: "http://localhost:{{ app_port }}/health"
status_code: 200
register: health
until: health.status == 200
retries: 30
delay: 2
post_tasks:
- name: Re-register with load balancer
uri:
url: "http://{{ lb_api }}/register/{{ inventory_hostname }}"
method: POST
delegate_to: localhostKey knobs:
| Setting | What it does |
|---|---|
serial: 1 | One host at a time (slowest, safest) |
serial: "25%" | A fraction of hosts at a time |
serial: [1, 2, "50%"] | Canary: 1, then 2, then 50% |
max_fail_percentage: 0 | Abort the whole play if any host fails |
any_errors_fatal: true | Stop the play (not just one host) on first failure |
delegate_to: localhost | Run on the control node — useful for LB API calls |
Fact Caching
Fact gathering takes seconds per host. For frequent playbook runs, cache them:
# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400smart means "gather only if not already cached." Re-runs of a playbook against unchanged hosts skip fact-gathering entirely.
What's Next
You can write Ansible that handles real production runs. The last piece is operating it well — project layout, CI/CD, testing, Terraform integration → Best Practices.