2018-03-28 - Future - Tony Finch
Yesterday I enabled
serve-stale
on our recursive DNS servers, and after a few hours one of them
crashed messily. The automatic failover setup handled the crash
reasonably well, and I disabled serve-stale to avoid any more
crashes.
How did this crash slip through our QA processes?
Test server
My test server is the recursive resolver for my workstations, and the primary master for my personal zones. It runs a recent development snapshot of BIND. I use it to try out new features, often months before they are included in a release, and I help to shake out the bugs.
In this case I was relatively late enabling serve-stale so I was
only running it for five weeks before enabling it in production.
It's hard to tell whether a longer test at this stage would have exposed the bug, because there are relatively few junk queries on my test server.
Pre-heat
Usually when I roll out a new version of BIND, I will pre-heat the cache of an upgraded standby server before bringing it into production. This involves making about a million queries against the server based on a cache dump from a live server. This also serves as a basic smoke test that the upgrade is OK.
I didn't do a pre-heat before enabling serve-stale because it was
just a config change that can be done without affecting service.
But it isn't clear that a pre-heat would have exposed this bug because the crash required a particular pattern of failing queries, and the cache dump did not contain the exact problem query (though it does contain some closely related ones).
Possible improvements?
An alternative might be to use live traffic as test data, instead of a
static dump. A bit of code could read a dnstap feed on a live
server, and replay the queries against another server. There are two
useful modes:
- test traffic: replay incoming (recursive client-facing) queries; this reproduces the current live full query load on another server for testing, in a way that is likely to have reproduced yesterday's crash. 
- continuous warming: replay outgoing (iterative Internet-facing) queries; these are queries used to refill the cache, so they are relatively low volume, and suitable for keeping a standby server's cache populated. 
There are a few cases where researchers have expressed interest in DNS
query data, of either of the above types. In order to satisfy them we
would need to be able to split a full dnatap feed so that recipients
only get the data they want.
This live DNS replay idea needs a similar dnstap splitter.
