Speeding up Ansible Playbook runs

Ansible is a great tool for configuration management but because of the way it’s designed a common complaint is that it’s not as fast as other tools like Salt, Chef or Puppet. This is because Ansible doesn’t have an agent that listens (although it can) on a host and uses a different type of deployment methodology that is based on SSH. This post isn’t about the pros and cons of each tool, but rather about ways to improve upon Ansible’s default configuration values. By default Ansible ships with very conservative default values. This is smart in my opinion because it offers greater compatibility out-of-the-box. Here I highlight some safe adjustments that can be made to the default configuration for improved performance (speed!)

Real World Playbook Test

For this test I’m using a real-world playbook that I use in my homelab when provisioning a new CentOS VM. It configures some basic things (hostname, ssh keys, etc), installs common packages/utilities and tunes some OS configurations.

The VM I’m running the playbook from is on a Centos 7 VM running on an ESXi 6.5 Host. The playbook will be running against 12 target VMs. The VMs it will be talking to are on the same VMNetwork. The Ansible VM has 4 vCPUs and 8GB of ram.

Before tuning Ansible, we’ll need to gather some metrics about how each playbook run performs. Fortunately in Ansible v2.0 and higher there are two built in callbacks that can be enabled: timer and profile_tasks Timer will output the total playbook run time, similar to running the time command before an ansible-playbook command. The second and more interesting of the two IMO is profile_tasks. This callback displays a nice summary of each TASK and how long it took to execute. To enable these settings edit (or create) an ansible.cfg file. You can check to see if you already have an Ansible config file by running:

1
2
3
$ ansible --version
ansible 2.5.3
config file = /home/directory/ansible/ansible.cfg

This tells you the location of the configuration file that Ansible uses and the version. If you don’t see a config file listed you can create one in the directory where your playbooks will be run.

We’re going to add the following line to the config file under the [defaults] subsection:

1
2
[defaults]
callback_whitelist = timer, profile_tasks

I’m running the follwing playbook command:

1
ansible-playbook init_centos.yml -e @group_vars/vault.yml --limit vms

Here’s the output from the playbook run with the default configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Friday 08 June 2018 16:04:29 -0400 (0:00:16.486) 0:02:04.805 ***
=================================================================
TASK | Install packages ---------------------------------- 20.73s
TASK | Start filebeat and enable service ----------------- 16.49s
TASK | Install filebeat ----------------------------------- 6.15s
Gathering Facts ------------------------------------------- 5.76s
TASK | Install rpms for Spacewalk / RHN ------------------- 5.15s
checkmk : TASK | Copy Checkmk Agent Listener -------------- 2.81s
checkmk : TASK | Copy Checkmk Agent ----------------------- 2.52s
TASK | Copy Influxdata repo (for Telegraf) ---------------- 2.22s
TASK | Ensure mount directories exist --------------------- 2.21s
TASK | Copy Telegraf config ------------------------------- 2.18s
TASK | Copy filebeat config template ---------------------- 2.17s
TASK | Copy user ssh/config ------------------------------ 2.16s
TASK | Update /etc/services file -------------------------- 2.14s
TASK | Set /etc/hostname ---------------------------------- 2.13s
TASK | Disable SELinux (Centos 7) ------------------------- 2.10s
TASK | Copy ssh keys -------------------------------------- 2.09s
TASK | Install prowl -------------------------------------- 2.09s
TASK | Copy .bash_logout for user ------------------------- 2.08s
TASK | Copy .bashrc for user ------------------------------ 2.08s
TASK | Copy iTerm2 bash shell integration for user -------- 2.07s
Playbook run took 0 days, 0 hours, 2 minutes, 4 seconds

The important line here is the last one: Playbook run took … 2 minutes, 4 seconds That’s 124 seconds. Not terrible, but if you’re deploying to a large number of machines (say 50 or 100) those minutes can quickly add up.

Let’s start making some configuration tweaks and see if we can speed things up.

Enable SSH Pipelining

To enable SSH pipelining, add this to your ansible.cfg file under the [defaults] heading:

1
pipelining = True

From the Ansible manual: Enabling pipelining reduces the number of SSH operations required to execute a module on the remote server, by executing many ansible modules without actual file transfer.

Let’s run the same playbook again but with this configuration option set and see what happens:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Friday 08 June 2018 16:07:19 -0400 (0:00:23.585) 0:01:56.055 ***
=================================================================
TASK | Start filebeat and enable service ----------------- 23.58s
TASK | Install packages ---------------------------------- 16.75s
TASK | Install filebeat ----------------------------------- 6.17s
Gathering Facts ------------------------------------------- 5.50s
TASK | Install rpms for Spacewalk / RHN ------------------- 4.61s
checkmk : TASK | Copy Checkmk Agent Listener -------------- 2.33s
checkmk : TASK | Copy Checkmk Agent ----------------------- 2.26s
TASK | Set /etc/hostname ---------------------------------- 1.91s
TASK | Copy Influxdata repo (for Telegraf) ---------------- 1.90s
TASK | Copy ssh/config for user --------------------------- 1.88s
TASK | Copy ssh keys for user ----------------------------- 1.87s
TASK | Copy .bash_logout for user ------------------------- 1.83s
TASK | Copy Telegraf config ------------------------------- 1.82s
TASK | Update /etc/services file -------------------------- 1.82s
TASK | Copy .bashrc for user ----------------------------- 1.82s
TASK | Copy Telegraf environment default ------------------ 1.81s
TASK | Copy .bashrc for root ------------------------------ 1.80s
TASK | Install prowl -------------------------------------- 1.79s
TASK | Install prowl API key ------------------------------ 1.77s
TASK | Copy .bash_logout for root ------------------------- 1.77s
Playbook run took 0 days, 0 hours, 1 minutes, 55 seconds

Here we can see that the play run completed 9 seconds faster. Not bad. Let’s see if we can tweak it some more.

Reduce poll interval to 5s

The default poll interval is set to 15 seconds. This is how often Ansible will check on task that’s running and decide if it can proceed. Let’s set it to 5 seconds and see what happens. Add or edit this line in the ansible.cfg file, again under the [defaults] heading:

1
poll_interval = 5

Let’s run it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Friday 08 June 2018 16:10:09 -0400 (0:00:13.277) 0:01:46.888 ***
=================================================================
TASK | Install packages ---------------------------------- 18.66s
TASK | Start filebeat and enable service ----------------- 13.28s
TASK | Install filebeat ----------------------------------- 5.61s
Gathering Facts ------------------------------------------- 5.50s
TASK | Install rpms for Spacewalk / RHN ------------------- 4.77s
checkmk : TASK | Copy Checkmk Agent Listener -------------- 2.33s
checkmk : TASK | Copy Checkmk Agent ----------------------- 2.19s
TASK | Copy filebeat config template ---------------------- 2.01s
TASK | Copy ssh/config for user -------------------------- 1.87s
TASK | Copy Telegraf environment default ------------------ 1.86s
TASK | Set /etc/hostname ---------------------------------- 1.86s
TASK | Copy ssh keys for user ---------------------------- 1.84s
TASK | Copy Influxdata repo (for Telegraf) ---------------- 1.84s
TASK | Copy .bash_logout for root ------------------------- 1.80s
TASK | Update /etc/services file -------------------------- 1.79s
TASK | Install prowlnotify -------------------------------- 1.77s
TASK | Copy .bashrc for root ------------------------------ 1.77s
TASK | Copy .bash_logout for user ------------------------- 1.76s
TASK | Copy .bashrc for user ------------------------------ 1.76s
TASK | Copy sudoers file --------------------------------- 1.76s
Playbook run took 0 days, 0 hours, 1 minutes, 46 seconds

It took 106 seconds to run the playbook that time. That’s 18 seconds faster than what we started with. Nice.

Let’s try another tweak and see if we can’t do even better.

Increase forks to 25

For my use case I’m increasing the number of simultaneous forks to 25 from the default value of 5. Again, Ansible ships with pretty sane defaults. We don’t want sane, we want fast. Let’s see how how this does:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Friday 08 June 2018 16:12:10 -0400 (0:00:17.528) 0:01:25.858 ***
=================================================================
TASK | Start filebeat and enable service ----------------- 17.53s
TASK | Install packages ---------------------------------- 10.29s
TASK | Install filebeat ----------------------------------- 8.51s
Gathering Facts ------------------------------------------- 3.77s
TASK | Install rpms for Spacewalk / RHN ------------------- 3.01s
checkmk : TASK | Copy Checkmk Agent Listener -------------- 1.59s
TASK | Disable SELinux (Centos 7) ------------------------- 1.40s
TASK | Update /etc/services file -------------------------- 1.39s
TASK | Install prowlnotify -------------------------------- 1.39s
checkmk : TASK | Copy Checkmk Agent ----------------------- 1.36s
TASK | Copy sudoers file --------------------------------- 1.30s
TASK | Copy Telegraf config ------------------------------- 1.30s
TASK | Install treesize in /usr/local/bin ----------------- 1.29s
TASK | Copy .bash_logout for user ------------------------- 1.27s
TASK | Copy .bashrc for root ------------------------------ 1.27s
TASK | Install prowl -------------------------------------- 1.25s
TASK | Copy .bash_logout for root ------------------------- 1.23s
TASK | Copy ssh/config for user --------------------------- 1.23s
TASK | Copy filebeat config template ---------------------- 1.22s
TASK | Copy Telegraf environment default ------------------ 1.22s
Playbook run took 0 days, 0 hours, 1 minutes, 25 seconds

Very nice. Now we’re at 85 seconds. Remember, I’m running the exact same playbook just with new configuration values (options). This is very good but I think there’s more we can do.

Enable fact_caching

By enabling this value we’re telling Ansible to keep the facts it gathers in a local file. You can also set this to a redis cache. See the documentation for details.
Fact_caching is what happens when Ansible says, “Gathering facts” about your target hosts. If we don’t change our targets hardware (or virtual hardware) very often this can be very helpful. Enable it by adding this to your ansible.cfg file:

1
2
fact_caching = jsonfile
fact_caching_connection = /tmp/.ansible_fact_cache

What happens when we run it now?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Friday 08 June 2018 17:25:14 -0400 (0:00:03.000) 0:01:15.530 ***
=================================================================
TASK | Install packages ---------------------------------- 17.33s
TASK | Install filebeat ----------------------------------- 4.46s
TASK | Install rpms for Spacewalk / RHN ------------------- 3.87s
Gathering Facts ------------------------------------------- 3.82s
TASK | Start filebeat and enable service ------------------ 3.00s
checkmk : TASK | Copy Checkmk Agent Listener -------------- 2.34s
checkmk : TASK | Copy Checkmk Agent ----------------------- 1.47s
TASK | Install prowl -------------------------------------- 1.40s
TASK | Update /etc/services file -------------------------- 1.38s
TASK | Install prowlnotify -------------------------------- 1.33s
TASK | Set /etc/hostname ---------------------------------- 1.33s
TASK | Ensure mount directories exist --------------------- 1.28s
TASK | Copy iTerm2 bash shell integration for user -------- 1.27s
TASK | Copy Telegraf environment default ------------------ 1.25s
TASK | Copy Influxdata repo (for Telegraf) ---------------- 1.24s
TASK | Copy .bash_logout for user ------------------------- 1.23s
TASK | Copy .bashrc for user ------------------------------ 1.22s
TASK | Copy .bashrc for root ------------------------------ 1.21s
TASK | Disable SELinux (Centos 7) ------------------------- 1.20s
checkmk : TASK | Create Checkmk Agent Unit ---------------- 1.20s
**Playbook run took 0 days, 0 hours, 1 minutes, 15 seconds**

75 seconds. Very nice. These tweaks have made a huge difference.

Let’s recap

We’ve reduced our playbook run time from 2 minutes and 4 seconds down to 1 minute and 15 seconds. (184 seconds -> 75 seconds) That’s 40% less time to run the exact same playbook with just a few configuration tweaks.

By adding / editing these configuration values we were able to cut our playbook run time nearly in half. Now, these results aren’t going to be the same for everyone, every playbook or every environment. There are many factors that account for Ansible performance.

It’s clear, however, that modifying the defaults as we did here results in significant performance gains and can save you time on deployments.

( I’ll add a pretty table with summary here someday. )

Comments