BIND 9.12.2 and serve-stale

2018-08-03 - News - Tony Finch

Earlier this year, we had an abortive attempt to turn on BIND 9.12's new serve-stale feature. This helps to make the DNS more resilient when there are local network problems or when DNS servers out on the Internet are temproarily unreachable. After many trials and tribulations we have at last successfully enabled serve-stale.

Popular websites tend to have very short DNS TTLs, which means the DNS stops working quite soon when there are network problems. As a result, network problems look more like DNS problems, so they get reported to the wrong people. We hope that serve-stale will reduce this kind of misattribution.

Pinning down CVE-2018-5737

The original attempt to roll out serve-stale was rolled back after one of the recursive DNS servers crashed. My normal upgrade testing wasn't enough to trigger the crash, which happened after a few hours of production load.

Since this was a crash that could be triggered by query traffic, it counted as a security bug. After I reported it to ISC.org, there followed a lengthy effort to reproduce it in a repeatable manner, so that it could be debugged and fixed.

I have a tool called adns-masterfile which I use for testing server upgrades and suchlike. I eventually found that sending lots of reverse DNS queries was a good way to provoke the crash; the reverse DNS has quite a large proportion of broken DNS servers, which exercise the serve-stale machinery.

The best I was able to do was get the server to crash after 1 hour; I could sometimes get it to crash sooner, but not reliably. I used a cache dump (from rndc dumpdb) truncated after the .arpa TLD so it contained 58MB of reverse DNS, nearly 700,000 queries. I then set up several concurrent copies of adns-masterfile to run in loops. The different copies tended to synchronize with each other, because when one of them got blocked on a broken domain name the others would catch up. So I added random delays between each run to encourage different copies to make queries from different parts of the dump file.

It was difficult for our friends at ISC.org to provoke a crash. After valgrind failed to provide any clues, I tried using Mozilla's rr debugger which supports record/replay time-travel debugging with efficient reverse execution. It allowed me to bundle up the binary, libraries, and execution trace and send them to ISC.org so they could investigate what happened in detail.

Cosmetic issues

I waited for BIND 9.12.2 before deploying the fixed serve-stale implementation because earlier versions had very verbose logging that could not easily be turned off.

I submitted a patch that moved serve-stale logging to a separate category so that it can be turned on and off independently of other logging or moved to a separate file. This was merged for the 9.12.2 release, which made it usable in production.

Improved upgrade workflow

I also investigated options for better testing of new versions before putting them into production. The disadvantage of adns-masterfile is that it makes a large number of unique queries, whereas the CVE-2018-5737 crash required repetition.

I now have a little script which can extract queries from tcpdump output and replay them against another server. I can use it to mirror production traffic onto a staging server, and let it soak for several hours before performing a live/staging switch-over.

Selfish motivations

Earlier this year we had a number of network outages during which we lost connectivity to about half of the Internet. I was quite keen to get serve-stale deployed before these outages were fixed, so I could observe it working; however I lost that race.

The outages were triggered by work on the network links between our CUDN border equipment and JANET's routers. In theory, this should not have affected our connectivity, because traffic should have seamlessly moved to use our other JANET uplink.

However, the routing changes propagated further than expected: they appeared to one of JANET's connectivity providers as route withdrawals and readvertisements. If more than a few ups and downs happened within a number of minutes, our routes were deemed to be flapping, triggering the flap-damping protection mechanism. Flap-damping meant our routes were ignored for 20 minutes by JANET's connectivity provider.

Addressing this required quite a lot of back-and-forth between network engineers in three organizations. It hasn't been completely eliminated, but the flap-damping has been made less sensitive, and we have amended our border router work processes to avoid multiple up/down events.