DNS server resilience and network outages

2020-02-18 - News - Tony Finch

Our recursive DNS servers are set up to be fairly resilient. Each server is physical hardware, so that they only need power and networking in order for the DNS to work, avoiding hidden circular dependencies.

We use keepalived to determine which of the physical servers is in live service. It does two things for us:

  • We can move live service from one server to another with minimal disruption, so we can patch and upgrade servers without downtime.

  • The live DNS service can recover automatically from things like server hardware failure or power failure.

This note is about coping with network outages, which are more difficult.

Problem

This morning there was some planned maintenance work on our data centre network which did not go smoothly. This caused more disruption than expected to our DNS service: although the DNS service was moved out of the affected building yesterday, DNS was not properly isolated from the network faults.

Mitigation

Until the network problems have been fixed, the DNS servers in the affected building have been completely disabled, so that their connectivity problems do not affect the live DNS servers in the other two buildings.

Normally we run with two live servers, one hot standby, and one test server. At the moment we have only the two live servers which are acting as hot standby for each other.

The rest of this note has more detailed background information about how our DNS servers cope with failures, and why this mitigation was chosen.

How keepalived works

Keepalived is an implementation of VRRP, the virtual router redundancy protocol. VRRP is one of a number of first hop redundancy protocols that were originally designed to provide better resilience for a subnet's default gateway. But these protocols are also useful for application servers: as well as DNS servers, keepalived is often used with HAProxy to make web servers more resilient.

VRRP uses periodic multicast packets to implement an election protocol between the servers. The server with the highest priority is elected as the winner. It becomes the live server by using ARP to take over the service IP address. For our DNS servers, 131.111.8.42 is a floating address that moves to whichever server wins a VRRP election. There is another instance of VRRP for 131.111.12.20 so that a different server will win its election, so we have two different live DNS servers.

Split brains

When there is an outage in the data centre ethernet, a DNS server can no longer see VRRP multicast packets from the other servers, so it considers itself to be the highest priority server available and elects itself the winner. Of course the other servers are likely to do the same, so we end up with multiple servers that think they currently own the live service address. This "split brain" situation will be resolved when connectivity is restored.

Recovering from a split brain involves two things:

  • At the VRRP level, the servers must see each other's multicast packets and recognise their correct position in the priority order and whether they should win the election or not;

  • At the ethernet level, the service IP address and MAC address need to be associated with the right physical switch port so that the right server gets the live traffic.

Planned outages

Our normal practice for planned network maintenance is to move the live DNS service away from the servers that will be affected by the work, so that service doesn't have to bounce around unnecessarily. Keepalived takes a few seconds to move the live server, and the standby server will have an empty cache, neither of which are good for DNS performance.

However, the idle server that is affected by the planned maintenance will go into split brain mode. When connectivity is restored, the split brain needs to be healed, and this can involve some disruption at the ethernet level which may briefly affect the live server. There may be longer disruption if connectivity is intermittent.

Avoiding disruption

Our DNS keepalived configuration uses dynamic server priorities to move live service around without reconfiguring keepalived. Reconfiguring it requires restarting it, which can cause disruptive re-elections if the servers win the restart races in the wrong order.

To prevent split brain in planned outages, we need to exclude the affected DNS server completely. However, it isn't possible to exclude a server just using dynamic priorities. Although keepalived has a FAULT state which stops it electing itself the winner, in our version this is based on the ethernet link status and can't be scripted dynamically.

So with our current setup, the only way to stop a server from going into split brain mode is to stop keepalived completely. This removes the server from all the live and test DNS VRRP clusters. It can be safely removed from and added to the clusters, without causing unwanted re-elections, if it is also configured with the lowest dynamic priorities before keepalived is stopped.

Better processes

In the past we generally have not bothered to disable keepalived during planned network outages. However in cases (like this morning) where the network maintenance does not go well, it can disrupt the live DNS servers even when it is not expected to.

So I think as an improvement for the future we will plan to stop keepalived on DNS servers that are going to be affected by network maintenance, just in case. We have enough DNS redundancy to be able to take one or two servers completely out of service temporarily, and still have 2N resilience.

Thanks to David Carter for reporting DNS problems this morning.