BIND patches as a byproduct of setting up new DNS servers

2015-01-17 - Progress - Tony Finch

On Friday evening I reached a BIG milestone in my project to replace Cambridge University's DNS servers. I finished porting and rewriting the dynamic name server configuration and zone data update scripts, and I was - at last! - able to get the new servers up to pretty much full functionality, pulling lists of zones and their contents from the IP Register database and the managed zone service, and with DNSSEC signing on the new hidden master.

There is still some final cleanup and robustifying to do, and checks to make sure I haven't missed anything. And I have to work out the exact process I will follow to put the new system into live service with minimum risk and disruption. But the end is tantalizingly within reach!

In the last couple of weeks I have also got several small patches into BIND.

  • Jan 7: documentation for named -L

    This was a follow-up to a patch I submitted in April last year. The named -L option specifies a log file to use at startup for recording the BIND version banners and other startup information. Previously this information would always go to syslog regardless of your logging configuration.

    This feature will be in BIND 9.11.

  • Jan 8: typo in comment

    Trivial :-)

  • Jan 12: teach nsdiff to AXFR from non-standard ports

    Not a BIND patch, but one of my own companion utilities. Our managed zone service runs a name server on a non-standard port, and our new setup will use nsdiff | nsupdate to implement bump-in-the-wire signing for the MZS.

  • Jan 13: document default DNSKEY TTL

    Took me a while to work out where that value came from. Submitted on Jan 4. Included in 9.10 ARM.

  • Jan 13: automatically tune max-journal-size

    Our old DNS build scripts have a couple of mechanisms for tuning BIND's max-journal-size setting. By default a zone's incremental update journal will grow without bound, which is not helpful. Having to set the parameter by hand is annoying, especially since it should be simple to automatically tune the limit based on the size of the zone.

    Rather than re-implementing some annoying plumbing for yet another setting, I thought I would try to automate it away. I have submitted this patch as RT#38324. In response I was told there is also RT#36279 which sounds like a request for this feature, and RT#25274 which sounds like another implementation of my patch. Based on the ticket number it dates from 2011.

    I hope this gets into 9.11, or something like it. I suppose that rather than maintaining this patch I could do something equivalent in my build scripts...

    Edited to add: this was eventually committed upstream for BIND 9.12.

  • Jan 14: doc: ignore and clean up isc-notes-html.xsl

    I found some cruft in a supposedly-clean source tree.

    This one actually got committed under my name, which I think is a first for me and BIND :-) (RT#38330)

  • Jan 14: close new zone file before renaming, for win32 compatibility

  • Jan 14: use a safe temporary new zone file name

    These two arose from a problem report on the bind-users list. The conversation moved to private mail which I find a bit annoying - I tend to think it is more helpful for other users if problems are fixed in public.

    But it turned out that BIND's error logging in this area is basically negligible, even when you turn on debug logging :-( But the Windows Process Explorer is able to monitor filesystem events, and it reported a 'SHARING VIOLATION' and 'NAME NOT FOUND'. This gave me the clue that it was a POSIX vs Windows portability bug.

    So in the end this problem was more interesting than I expected.

  • Jan 16: critical: ratelimiter.c:151: REQUIRE(ev->ev_sender == ((void *)0)) failed

    My build scripts are designed so that Ansible sets up the name servers with a static configuration which contains everything except for the zone {} clauses. The zone configuration is provisioned by the dynamic reconfiguration scripts. Ansible runs are triggered manually; dynamic reconfiguration runs from cron.

    I discovered a number of problems with bootstrapping from a bare server with no zones to a fully-populated server with all the zones and their contents on the new hidden master.

    The process is basically,

    • if there are any missing master files, initialise them as minimal zone files

    • write zone configuration file and run rndc reconfig

    • run nsdiff | nsupdate for every zone to fill them with the correct contents

    When bootstrapping, the master server would load 123 new zones, then shortly after the nsdiff | nsupdate process started, named crashed with the assertion failure quoted above.

    Mark Andrews replied overnight with the linked patch (he lives in Australia) which fixed the problem. Yay!

Another bootstrapping problem was to do with BIND's zone integrity checks. nsdiff is not very clever about the order in which it emits changes; in particular it does not ensure that hostnames exist before any NS or MX or SRV records are created to point to them. You can turn off most of the integrity checks, but not the NS record checks.

This causes trouble for us when bootstrapping the cam.ac.uk zone, which is the only zone we have with in-zone NS records. It also has lots of delegations which can also trip the checks.

My solution is to create a special bootstrap version of the zone, which contains the apex and delegation records (which are built from configuration stored in git) but not the bulk of the zone contents from the IP Register database. The zone can then be succesfully loaded in two stages, first nsdiff cam.ac.uk DB.bootstrap | nsupdate -l then nsdiff cam.ac.uk zones/cam.ac.uk | nsupdate -l.

Bootstrapping isn't something I expect to do very often, but I want to be sure it is easy to rebuild all the servers from scratch, including the hidden master, in case of major OS upgrades, VM vs hardware changes, disasters, etc.

No more special snowflake servers!