An interesting bug in BIND

2018-01-12 - Progress - Tony Finch

(This item isn't really related to progress towards a bright shiny future, but since I'm blogging here I might as well include other work-related articles.)

This week I have been helping Mark Andrews and Evan Hunt to track down a bug in BIND9. The problem manifested as named occasionally failing to re-sign a DNSSEC zone; the underlying cause was access to uninitialized memory.

It was difficult to pin down, partly because there is naturally a lot of nondeterminism in uninitialized memory bugs, but there is also a lot of nondeterminism in the DNSSEC signing process, and it is time-dependent so it is hard to re-run a failure case, and normally the DNSSEC signing process is very slow - three weeks to process a zone, by default.

Timeline

  • Oct 9 - latent bug exposed

  • Nov 12 - first signing failure

    I rebuild and restart my test DNS server quite frequently, and the bug is quite rare, which explains why it took so long to appear.

  • Nov 18 - Dec 6 - Mark fixes several signing-related bugs

  • Dec 28 - another signing failure

  • Jan 2 - I try adding some debugging diagnostics, without success

  • Jan 9 - more signing failures

  • Jan 10 - I make the bug easier to reproduce

    Mark and Evan identify a likely cause

  • Jan 11 - I confirm the cause and fix

The debugging process

The incremental re-signing code in named is tied into BIND's core rbtdb data structure (the red-black tree database). This is tricky code that I don't understand, so I mostly took a black-box approach to try to reproduce it.

I started off by trying to exercise the signing code harder. I set up a test zone with the following options:

    # signatures valid for 1 day (default is 30 days)
    # re-sign 23 hours before expiry
    # (whole zone is re-signed every hour)
    sig-validity-interval 1 23;
    # restrict the size of a batch of signing to examine
    # at most 10 names and generate at most 2 signatures
    sig-signing-nodes 10;
    sig-signing-signatures 2;

I also populated the zone with about 500 records (not counting DNSSEC records) so that several records would get re-signed each minute.

This helped a bit, but I often had to wait a long time before it went wrong. I wrote a script to monitor the zone using rndc zonestatus, so I could see if the "next resign time" matches the zone's earliest expiring signature.

There was quite a lot of flailing around trying to exercise the code harder, by making the zone bigger and changing the configuration options, but I was not successful at making the bug appear on demand.

To make it churn faster, I used dnssec-signzone to construct a version of the zone in which all the signatures expire in the next few minutes:

    rndc freeze test.example
    dig axfr test.example | grep -v RRSIG |
    dnssec-signzone -e now+$((86400 - 3600 - 200)) \
            -i 3600 -j 200 \
            -f signed -o test.example /dev/stdin
    rm -f test.example test.example.jnl
    mv signed test.example
    # re-load the zone
    rndc thaw test.example
    # re-start signing
    rndc sign test.example

I also modified BIND's re-signing co-ordination code; normally each batch will re-sign any records that are due in the next 5 seconds; I reduced that to 1 second to keep batch sizes small, on the assumption that more churn would help - which it did, a little bit.

But the bug still took a random amount of time to appear, sometimes within a few minutes, sometimes it would take ages.

Finding the bug

Mark (who knows the code very well) took a bottom-up approach; he ran named under valgrind which identified an access to uninitialized memory. (I don't know what led Mark to try valgrind - whether he does it routinely or whether he tried it just for this bug.)

Evan had not been able to reproduce the bug, but once the cause was identified it became clear where it came from.

The commit on the 9th October that exposed the bug was a change to BIND's memory management code, to stop it from deliberately filling newly-allocated memory with garbage.

Before this commit, the missing initialization was hidden by the memory fill, and the byte used to fill new allocations (0xbe) happened to have the right value (zero in the bottom bit) so the signer worked correctly.

Evan builds BIND in developer mode, which enables memory filling, which stopped him from being able to reproduce it.

Verifying the fix

I changed BIND to fill memory with 0xff which (if we were right) should provoke signing failures much sooner. And it did!

Then applying the one-character fix to remove the access to uninitialized memory made the signer work properly again.

Lessons learned

BIND has a lot of infrastructure that tries to make C safer to use, for instance:

  • Run-time assertions to ensure that internal APIs are used correctly;

  • Canary elements at the start of most objects to detect memory overruns;

  • buffer and region types to prevent memory overruns;

  • A memory management system that keeps statistics on memory usage, and helps to debug memory leaks and other mistakes.

The bug was caused by failing to use buffers well, and hidden by the memory management system.

The bug occurred when initializing an rdataslab data structure, which is an in-memory serialization of a set of DNS records. The records are copied into the rdataslab in traditional C style, without using a buffer. (This is most blatant when the code manually serializes a 16 bit number instead of using isc_buffer_putuint16.) This code is particularly ancient which might explain the poor style; I think it needs refactoring for safety.

It's ironic that the bug was hidden by the memory management code - it's supposed to help expose these kinds of bug, not hide them! Nowadays, the right approach would be to link to jemalloc or some other advanced allocator, rather than writing a complicated wrapper around standard malloc. However that wasn't an option when BIND9 development started.

Conclusion

Memory bugs are painful.