Postcronspam

2018-11-30 - Progress

This is a postmortem of an incident that caused a large amount of cronspam, but not an outage. However, the incident exposed a lot of latent problems that need addressing.

Description of the incident

I arrived at work late on Tuesday morning to find that the DHCP servers were sending cronspam every minute from monit. monit thought dhcpd was not working, although it was.

A few minutes before I arrived, a colleague had run our Ansible playbook to update the DHCP server configuration. This was the trigger for the cronspam.

Cause of the cronspam

We are using monit as a basic daemon supervisor for our critical services. The monit configuration doesn't have an "include" facility (or at least it didn't when we originally set it up) so we are using Ansible's "assemble" feature to concatenate configuration file fragments into a complete monit config.

The problem was that our Ansible setup didn't have any explicit dependencies between installing monit config fragments and reassembling the complete config and restarting monit.

Running the complete playbook caused the monit config to be reassembled, so an incorrect but previously inactive config fragment was activated, causing the cronspam.

Origin of the problem

How was there an inactive monit config fragment on the DHCP servers?

The DHCP servers had an OS upgrade and reinstall in February. This was when the spammy broken monit config fragment was written.

What were the mistakes at that time?

  • The config fragment was not properly tested. A good monit config is normally silent, but in this case we didn't check that it sent cronspam when things are broken, whoch would have revealed that the config fragment was not actually installed properly.

  • The Ansible playbook was not verified to be properly idempotent. It should be possible to wipe a machine and reinstall it with one run of Ansible, and a second run should be all green. We didn't check the second run properly. Check mode isn't enough to verify idempotency of "assemble".

  • During routine config changes in the nine months since the servers were reinstalled, the usual practice was to run the DHCP-specific subset of the Ansible playbook (because that is much faster) so the bug was not revealed.

Deeper issues

There was a lot more anxiety than there should have been when debugging this problem, because at the time the Ansible playbooks were going through a lot of churn for upgrading and reinstalling other servers, and it wasn't clear whether or not this had caused some unexpected change.

This gets close to the heart of the matter:

  • It should always be safe to check out and run the Ansible playbook against the production systems, and expect that nothing will change.

There are other issues related to being a (nearly) solo developer, which makes it easier to get into bad habits. The DHCP server config has the most contributions from colleagues at the moment, so it is not really surprising that this is where we find out the consequences of the bad habits of soloists.

Resolutions

It turns out that monit and dhcpd do not really get along. The monit UDP health checker doesn't work with DHCP (which was the cause of the cronspam) and monit's process checker gets upset by dhcpd being restarted when it needs to be reconfigured.

The monit DHCP UDP checker has been disabled; the process checker needs review to see if it can be useful without sending cronspam on every reconfig.

There should be routine testing to ensure the Ansible playbooks committed to the git server run green, at least in check mode. Unfortunately it's risky to automate this because it requires root access to all the servers; at the moment root access is restricted to admins in person.

We should be in the habit of running the complete playbook on all the servers (e.g. before pushing to the git server), to detect any differences between check mode and normal (active) mode. This is necessary for Ansible tasks that are skipped in check mode.

Future work

This incident also highlights longstanding problems with our low bus protection factor and lack of automated testing. The resolutions listed above will make some small steps to improve these weaknesses.