Make before break

2019-11-18 - Progress - Tony Finch

This afternoon I did a tricky series of reconfigurations. The immediate need was to do some prep work for improving our DNS blocks; I also wanted to make some progress towards completing the renaming/renumbering project that has been on the back burner for most of this year; and I wanted to fix a bad quick-and-dirty hack I made in the past.

Along the way I think I became convinced there's an opportunity for a significant improvement.

Avoiding cockups

Since Rachel Kroll wrote about "make before break" recently, the phrase has been on my mind. Rachel's article is a great example of why I am working towards a transactional API and user interface for IP Register.

More generally, make-before-break is a fundamental safety technique (try Go Ape to see it in another context) and it's a core part of the way I go about things. It's why I check web servers work before changing the DNS and why I play tricks with redirects to provision Let's Encrypt certificates before a web server goes live.

This afternoon's work was a series of configuration changes that were planned, tested, and put into production. By keeping the dependencies in mind and following the make-before-break rule, I could do it without booking downtime in an out-of-hours at-risk period.

High level plan

The "bad quick-and-dirty hack" was to use our hidden primary server as a zone transfer relay/fanout server for a third-party vendor RPZ block list. This violated our security principle that the primary server should not talk to the outside world. An appallingly bad choice, driven (if I remember correctly) by the limited length of the ACL the vendor would allow (smaller than our outward-facing DNS server clusters) and because I had not yet developed a plan for separate zone transfer relay servers.

The most difficult part of the reshuffle plan was to change the service architecture to add these zone transfer relay servers. Although I sketched out the idea a year ago, I did not really start planning how to get there from here until now. This afternoon was the smallest possible first step.

Since the new RPZ feeds will work basically the same as our existing feed, fixing the "bad quick-and-dirty hack" by taking a small step towards separate zone transfer relay servers also solves the immediate need.

The zone transfer servers are going to take over the IPv4 addresses currently used by auth0.dns and auth1.dns since there are lots of other people with configurations that have those addresses wired in for zone transfers, and rather than creating work for others, it's comparatively easy for me to change a few glue records.

Low level process

Our DNS server configuration has a static part provisioned by Ansible; and a dynamic part provisioned from our database back-ends and a simplified configuration file. On an orthogonal axis are the different flavours of server: authoritative, recursive, hidden primary, and (embryonic) zone transfer servers.

The work went roughly as follows. Each point was a plan / code / test / deploy cycle.

  • Prepare ACLs in static config

    There was some accumulated cruft that needed cleaning up, but the significant parts were to add zone transfer source declarations (masters clauses in BIND configuration files) for the new RPZ provider and for the embryonic zone transfer servers.

    PREP: add new as-yet-unused config clauses

  • Create new dynamic config for zone transfer servers

    We're turning auth servers into xfer servers so this was just the tiniest necessary difference: xfer servers relay the RPZ block lists, but auth servers don't.

    NEEDS: not actually any of the new static config clauses, because the new RPZ block lists are not quite ready

    PREP: new as-yet-unused config file

  • Create new static config for xfer servers

    Again the tiniest necessary difference.

    NEEDS: the dynamic config from the previous step

    PREP: new as-yet-unused config file

  • Add auth0 and auth1 to RPZ vendor ACL

    PREP: they need to be able to get the zones!

  • Reconfigure auth0 and auth1 to use xfer config

    Still acting as auth servers because the config has hardly diverged at all.

    NEEDS: the looser ACLs from the previous step

    NEEDS: the config from the two steps before that

    MAKE: zone transfer servers can now relay RPZ block lists

  • Reconfigure recursive servers

    Get RPZ block lists from auth0 / auth1 (future zone transfer relay servers) instead of hidden primary. A user-facing change so always needs extra care and attention.

    NEEDS: static config from first step

    NEEDS: RPZ block lists on xfer servers from previous step

    FLIP: from old hotness to new and busted

  • Fix zone transfers for RPZ block lists

    By mistake the auth/xfer servers were getting the RPZ block lists from our hidden primary not direcly from the vendor, because that's what our recursive servers used to do. This kind of latent bug is why you double-check before removing things that are not supposed to be used any more...

  • Drop RPZ block list from hidden primary

    BREAK: recursive servers used to depend on this

  • Drop hidden primary from RPZ vendor ACL

    BREAK: zone transfers for RPZ block lists needed this

  • Remove evil firewall hole on hidden primary

    BREAK: zone transfers for RPZ block lists needed this

Reflection

That looks a lot neater in terms of planning and execution than it felt like at the time. I think that's because the scruffy indecisiveness happened mostly when I was working out what I needed to do, and I didn't do anything until I was sure it was heading in the right direction and not going to break anything.

Maybe you can see some of the difficulty in the number of PREP steps. There could have been fewer steps if the configuration modules were more self-contained.

One of the things about the DNS server config that makes me uncomfortable is that the dynamic and static configurations are quite tightly coupled, in that the dynamic config depends on ACLs whose names are defined in the static config. So although they look like separate parts of the system from the way the code is laid out, careless changes to either of them can easily break things.

Some of this discomfort dates back to when I redid the server setup with Ansible: before that the DNS server configuration scripts were more unified. I moved half of the configuration to Ansible without cutting the dependencies. My excuse is that it was a mad rush and I was already redesigning too many parts of the system!

Improvement?

It would be fairly easy to move the definition of the ACLs into the dynamic configuration script, so that it can produce a self-contained configuration file with ACLs expanded in-line instead of being referred to by name. This would also make the dynamic configuration more like a catalog zone, which I would like to use to make our recursive and authoritative servers more self-configuring.

But RPZ block lists are more difficult to decouple in this way. Partly this is because the dependency is in the opposite direction: zone configs depend on the static config for ACL names, but for RPZ block lists the static config also depends on the zones. We have a mildly tricky bootstrapping hack to cut this loop on a freshly provisioned server. And a minor planning blight for the future is that RPZ config depending on catalog zones is a no-no.

The main blocker to fixing this is that the simplified zone config is really nice: one line per zone expands to 4 or 5 lines of config for 3 or 4 flavours of server (with help from a short script). So the challenge is to come up with something similarly short that keeps the different flavours of server in sync, without the mistakes of the past.

Perhaps this is another case where YAML and Jinja2 will displace Perl, for those parts of the dynamic config that are not (in practice) dynamic...