The future of the IP Register database

KSK rollover project status


I have spent the last week working on DNSSEC key rollover automation in BIND. Or rather, I have been doing some cleanup and prep work. With reference to the work I listed in the previous article...


  • Stop BIND from generating SHA-1 DS and CDS records by default, per

  • Teach dnssec-checkds about CDS and CDNSKEY


  • Teach superglue to use CDS/CDNSKEY records, with similar logic to dnssec-checkds

The "similar logic" is implemented in dnssec-dsfromkey, so I don't actually have to write the code more than once. I hope this will also be useful for other people writing similar tools!

Some of my small cleanup patches have been merged into BIND. We are currently near the end of the 9.13 development cycle, so this work is going to remain out of tree for a while until after the 9.14 stable branch is created and the 9.15 development cycle starts.


So now I need to get to grips with dnssec-coverage and dnssec-keymgr.

Simple safety interlocks

The purpose of the dnssec-checkds improvements is so that it can be used as a safety check.

During a KSK rollover, there are one or two points when the DS records in the parent need to be updated. The rollover must not continue until this update has been confirmed, or the delegation can be broken.

I am using CDS and CDNSKEY records as the signal from the key management and zone signing machinery for when DS records need to change. (There's a shell-style API in dnssec-dsfromkey -p, but that is implemented by just reading these sync records, not by looking into the guts of the key management data.) I am going to call them "sync records" so I don't have to keep writing "CDS/CDNSKEY"; "sync" is also the keyword used by dnssec-settime for controlling these records.

Key timing in BIND

The dnssec-keygen and dnssec-settime commands (which are used by dnssec-keymgr) schedule when changes to a key will happen.

There are parameters related to adding a key: when it is published in the zone, when it becomes actively used for signing, etc. And there are parameters related to removing a key: when it becomes inactive for signing, when it is deleted from the zone.

There are also timing parameters for publishing and deleting sync records. These sync times are the only timing parameters that say when we must update the delegation.

What can break?

The point of the safety interlock is to prevent any breaking key changes from being scheduled until after a delegation change has been confirmed. So what key timing events need to be forbidden from being scheduled after a sync timing event?

Events related to removing a key are particularly dangerous. There are some cases where it is OK to remove a key prematurely, if the DS record change is also about removing that key, and there is another working key and DS record throughout. But it seems simpler and safer to forbid all removal-related events from being scheduled after a sync event.

However, events related to adding a key can also lead to nonsense. If we blindly schedule creation of new keys in advance, without verifying that they are also being properly removed, then the zone can accumulate a ridiculous number of DNSKEY records. This has been observed in the wild surprisingly frequently.

A simple rule

There must be no KSK changes of any kind scheduled after the next sync event.

This rule applies regardless of the flavour of rollover (double DS, double KSK, algorithm rollover, etc.)

Applying this rule to BIND

Whereas for ZSKs, dnssec-coverage ensures rollovers are planned for some fixed period into the future, for KSKs, it must check correctness up to the next sync event, then ensure nothing will occur after that point.

In dnssec-keymgr, the logic should be:

  • If the current time is before the next sync event, ensure there is key coverage until that time and no further.

  • If the current time is after all KSK events, use dnssec-checkds to verify the delegation is in sync.

  • If dnssec-checkds reports an inconsistency and we are within some sync interval dictated by the rollover policy, do nothing while we wait for the delegation update automation to work.

  • If dnssec-checkds reports an inconsistency and the sync interval has passed, report an error because operator intervention is required to fix the failed automation.

  • If dnssec-checkds reports everything is in sync, schedule keys up to the next sync event. The timing needs to be relative to this point in time, since any delegation update delays can make it unsafe to schedule relative to the last sync event.


At the moment I am still not familiar with the internals of dnssec-coverage and dnssec-keymgr so there's a risk that I might have to re-think these plans. But I expect this simple safety rule will be a solid anchor that can be applied to most DNSSEC key management scenarios. (However I have not thought hard enough about recovery from breakage or compromise.)

DNSSEC key rollover automation with BIND


I'm currently working on filling in the missing functionality in BIND that is needed for automatic KSK rollovers. (ZSK rollovers are already automated.) All these parts exist; but they have gaps and don't yet work together.

The basic setup that will be necessary on the child is:

  • Write a policy configuration for dnssec-keymgr.

  • Write a cron job to run dnssec-keymgr at a suitable interval. If the parent does not run dnssec-cds then this cron job should also run superglue or some other program to push updates to the parent.

The KSK rollover process will be driven by dnssec-keymgr, but it will not talk directly to superglue or dnssec-cds, which make the necessary changes. In fact it can't talk to dnssec-cds because that is outside the child's control.

So, as specified in RFC 7344, the child will advertise the desired state of its delegation using CDS and CDNSKEY records. These are read by dnssec-cds or superglue to update the parent. superglue will be loosely coupled, and able to work with any DNSSEC key management softare that publishes CDS records.

The state of the keys in the child is controlled by the timing parameters in the key files, which are updated by dnssec-keymgr as determined by the policy configuration. At the moment it generates keys to cover some period into the future. For KSKs, I think it will make more sense to generate keys up to the next DS change, then stop until dnssec-checkds confirms the parent has implemented the change, before continuing. This is a bit different from the ZSK coverage model, but future coverage for KSKs can't be guaranteed because coverage depends on future interactions with an external system which cannot be assumed to work as planned.

Required work

  • Teach dnssec-checkds about CDS and CDNSKEY

  • Teach dnssec-keymgr to set "sync" timers in key files, and to invoke dnssec-checkds to avoid breaking delegations.

  • Teach dnssec-coverage to agree with dnssec-keymgr about sensible key configuration.

  • Teach superglue to use CDS/CDNSKEY records, with similar logic to dnssec-checkds

  • Stop BIND from generating SHA-1 DS and CDS records by default, per draft-ietf-dnsop-algorithm-update

IPv6 prefixes and LAN names


I have added a note to the ipreg schema wishlist that it should be possible for COs to change LAN names associated with IPv6 prefixes.

A note on prepared transactions


Some further refinements of the API behind shopping-cart style prepared transactions:

On the server side, the prepared transaction is a JSON-RPC request blob which can be updated with HTTP PUT or PATCH. Ideally the server should be able to verify that the result of the PATCH is a valid JSON-RPC blob so that it doesn't later try to perform an invalid request. I am planning to do API validity checks using JSON schema.

This design allows the prepared transaction storage to be just a simple JSON blob store, ignorant of what the blob is for except that it has to match a given schema. (I'm not super keen on nanoservices so I'll just use a table in the ipreg database to store it, but in principle there can be some nice decoupling here.)

It also suggests a more principled API design: An immediate transaction (typically requested by an API client) might look like the following (based on JSON-RPC version 1.0 system.multicall syntax):

    { jsonrpc: "2.0", id: 0,
      method: "rpc.transaction",
      params: [ { jsonrpc: "2.0", id: 1, method: ... },
                { jsonrpc: "2.0", id: 2, method: ... }, ... ] }

When a prepared transaction is requested (typically by the browser UI) it will look like:

    { jsonrpc: "2.0", id: 0,
      method: "rpc.transaction",
      params: { prepared: "#" } }

The "#" is a relative URI referring to the blob stored on the JSON-RPC endpoint (managed by the HTTP methods other than POST) - but it could in principle be any URI. (Tho this needs some thinking about SSRF security!) And I haven't yet decided if I should allow an arbitrary JSON pointer in the fragment identifier :-)

If we bring back rpc.multicall (JSON-RPC changed the reserved prefix from system. to rpc.) we gain support for prepared non-transactional batches. The native batch request format becomes a special case abbreviation of an in-line rpc.multicall request.

DNS server QA traffic


Yesterday I enabled serve-stale on our recursive DNS servers, and after a few hours one of them crashed messily. The automatic failover setup handled the crash reasonably well, and I disabled serve-stale to avoid any more crashes.

How did this crash slip through our QA processes?

Test server

My test server is the recursive resolver for my workstations, and the primary master for my personal zones. It runs a recent development snapshot of BIND. I use it to try out new features, often months before they are included in a release, and I help to shake out the bugs.

In this case I was relatively late enabling serve-stale so I was only running it for five weeks before enabling it in production.

It's hard to tell whether a longer test at this stage would have exposed the bug, because there are relatively few junk queries on my test server.


Usually when I roll out a new version of BIND, I will pre-heat the cache of an upgraded standby server before bringing it into production. This involves making about a million queries against the server based on a cache dump from a live server. This also serves as a basic smoke test that the upgrade is OK.

I didn't do a pre-heat before enabling serve-stale because it was just a config change that can be done without affecting service.

But it isn't clear that a pre-heat would have exposed this bug because the crash required a particular pattern of failing queries, and the cache dump did not contain the exact problem query (though it does contain some closely related ones).

Possible improvements?

An alternative might be to use live traffic as test data, instead of a static dump. A bit of code could read a dnstap feed on a live server, and replay the queries against another server. There are two useful modes:

  • test traffic: replay incoming (recursive client-facing) queries; this reproduces the current live full query load on another server for testing, in a way that is likely to have reproduced yesterday's crash.

  • continuous warming: replay outgoing (iterative Internet-facing) queries; these are queries used to refill the cache, so they are relatively low volume, and suitable for keeping a standby server's cache populated.

There are a few cases where researchers have expressed interest in DNS query data, of either of the above types. In order to satisfy them we would need to be able to split a full dnatap feed so that recipients only get the data they want.

This live DNS replay idea needs a similar dnstap splitter.

Transactions and JSON-RPC


The /update API endpoint that I outlined turns out to be basically JSON-RPC 2.0, so it seems to be worth making the new IP Register API follow that spec exactly.

However, there are a couple of difficulties wrt transactions.

The current not-an-API list_ops page runs each requested action in a separate transaction. It should be possible to make similar multi-transation batch requests with the new API, but my previous API outline did not support this.

A JSON-RPC batch request is a JSON array of request objects, i.e. the same syntax as I previously described for /update transactions, except that JSON-RPC batches are not transactional. This is good for preserving list_ops functionality but it loses one of the key points of the new API.

There is a simple way to fix this problem, based on a fairly well-known idea. XML-RPC doesn't have batch requests like JSON-RPC, but they were retro-fitted by defining a system.multicall method which takes an array of requests and returns an array of responses.

We can define transactional JSON-RPC requests in the same style, like this:

    { "jsonrpc": "2.0",
      "id": 0,
      "method": "transaction",
      "params": [ {
          "jsonrpc": "2.0",
          "id": 1,
          "method": "foo",
          "params": { ... }
        }, {
          "jsonrpc": "2.0",
          "id": 2,
          "method": "bar",
          "params": { ... }
        } ]

If the transaction succeeds, the outer response contains a "result" array of successful response objects, exactly one for each member of the request params array, in any order.

If the transaction fails, the outer response contains an "error" object, which has "code" and "message" members indicating a transaction failure, and an "error" member which is an array of response objects. This will contain at least one failure response; it may contain success responses (for actions which were rolled back); some responses may be missing.

Edited to add: I've described some more refinements to this idea

User interface sketch


The current IP Register user interface closely follows the database schema: you choose an object type (i.e. a table) and then you can perform whatever search/create/update/delete operations you want. This is annoying when I am looking for an object and I don't know its type, so I often end up grepping the DNS or the textual database dumps instead.

I want the new user interface to be search-oriented. The best existing example within the UIS is Lookup. The home page is mostly a search box, which takes you to a search results page, which in turn has links to per-object pages, which in turn are thoroughly hyperlinked.

Read more ...

ANAME vs aname


The IETF dnsop working group are currently discussing a draft specification for an ANAME RR type. The basic idea is that an ANAME is like a CNAME, except it only works for A and AAAA IP address queries, and it can coexist with other records such as SOA (at a zone apex) or MX.

I'm following the ANAME work with great interest because it will make certain configuration problems much simpler for us. I have made some extensive ANAME review comments.

An ANAME is rather different from what the IP Register database calls an aname object. An aname is a name for a set of existing IP addresses, which can be an arbitrary subset of the combined addresses of multiple boxes or vboxes, whereas an ANAME copies all the addresses from exactly one target name.

There is more about the general problem of aliases in the IP Register database in one of the items I posted in December. I am still unsure how the new aliasing model might work; perhaps it will become more clear when I have a better idea about how the existing aname implementation and its limitations.

High-level API design


This is just to record my thoughts about the overall shape of the IP Register API; the details are still to be determined, but see my previous notes on the data model and look at the old user interface for an idea of the actions that need to be available.

Read more ...

IP Register schema wishlist


Here are some criticisms of the IP Register database schema and some thoughts on how we might change it.

There is a lot of infrastructure work to do before I am in a position to make changes - principally, porting from Oracle to PostgreSQL, and developing a test suite so I can make changes with confidence.

Still, it's worth writing down my thoughts so far, so colleagues can see what I have in mind, and so we have some concrete ideas to discuss.

I expect to add to this list as thoughts arise.

Read more ...

Authentication and access control


The IP Register database is an application hosted on Jackdaw, which is a platform based on Oracle and Apache mod_perl.

IP Register access control

Jackdaw and Raven handle authentication, so the IP Register database only needs to concern itself with access control. It does this using views defined with check option, as is briefly described in the database overview and visible in the SQL view DDL.

There are three levels of access to the database:

  • the registrar table contains privileged users (i.e. the UIS network systems team) who have read/write access to everything via the views with the all_ prefix.

  • the areader table contains semi-privileged users (i.e. certain other UIS staff) who have read-only access to everything via the views with the ra_ prefix.

  • the mzone_co table contains normal users (i.e. computer officers in other institutions) who have read-write access to their mzone(s) via the views with the my_ prefix.

Apart from a few special cases, all the underlying tables in the database are available in all three sets of views.

IP Register user identification

The first part of the view definitions is where the IP Register database schema is tied to the authenticated user. There are two kinds of connection: either a web connection authenticated via Raven, or a direct sqlplus connection authenticated with an Oracle password.

SQL users are identified by Oracle's user function; Raven users are obtained from the sys_context() function, which we will now examine more closely.

Porting to PostgreSQL

We are fortunate that support for create view with check option was added to PostgreSQL by our colleague Dean Rasheed.

The sys_context() function is a bit more interesting.

The Jackdaw API

Jackdaw's mod_perl-based API is called WebDBI, documented at

There's some discussion of authentication and database connections at and but it is incomplete or out of date; in particular it doesn't mention Raven (and I think basic auth support has been removed).

The interesting part is the description of sessions. Each web server process makes one persistent connection to Oracle which is re-used for many HTTP requests. How is one database connection securely shared between different authenticated users, without giving the web server enormously privileged access to the database?

Jackdaw authentication - perl

Instead of mod_ucam_webauth, WebDBI has its own implementation of the Raven protocol - see jackdaw:/usr/local/src/httpd/

This mod_perl code does not do all of the work; instead it calls stored procedures to complete the authentication. On initial login it calls raven_auth.create_raven_session() and for a returning user with a cookie it calls raven_auth.use_raven_session().

Jackdaw authentication - SQL

These raven_auth stored procedures set the authenticated user that is retrieved by the sys_context() call in the IP Register views - see jackdaw:/usr/local/src/httpd/raven_auth/.

Most of the logic is written in PL/SQL, but there is also an external procedure written in C which does the core cryptography - see jackdaw:/usr/local/oracle/extproc/RavenExtproc.c.

Porting to PostgreSQL - reprise

On the whole I like Jackdaw's approach to preventing the web server from having too much privilege, so I would like to keep it, though in a simplified form.

As far as I know, PostgreSQL doesn't have anything quite like sys_context() with its security properties, though you can get similar functionality using PL/Perl.

However, in the future I want more heavy-weight sessions that have more server-side context, in particular the "shopping cart" pending transaction.

So I think a better way might be to have a privileged session table, keyed by the user's cookie and containing their username and jsonb session data, etc. This table is accessed via security definer functions, with something like Jackdaw's create_raven_session(), plus functions for getting the logged-in user (to replace sys_context()) and for manipulating the jsonb session data.

We can provide ambient access to the cookie using the set session command at the start of each web request, so the auth functions can retrieve it using the current_setting() function.

Streaming replication from PostgreSQL to the DNS


This entry is backdated - I'm writing this one year after I made this experimental prototype.

Our current DNS update mechanism runs as an hourly batch job. It would be nice to make DNS changes happen as soon as possible.

user interface matters

Instant DNS updates have tricky implications for the user interface.

At the moment it's possible to make changes to the database in between batch runs, knowing that broken intermediate states don't matter, and with plenty of time to check the changes and make sure the result will be OK.

If the DNS is updated immediately, we need a way for users to be able to prepare a set of inter-related changes, and submit them to the database as a single transaction.

(Aside: I vaguely imagine something like a shopping-cart UI that's available for collecting more complicated changes, though it should be possible to submit simple updates without a ceremonial transaction.)

This kind of UI change is necessary even is we simply run the current batch process more frequently. So we can't reasonalbly deploy this without a lot of front-end work.

back-end experiments

Ideally I would like to keep the process of exporting the database to the DNS and DHCP servers as a purely back-end matter; the front-end user interface should only be a database client.

So, assuming we have a better user interface, we would like to be able to get instant DNS updates by improvements to the back end without any help from the front end.

PostgreSQL has a very tempting replication feature called "logical decoding", which takes a replication stream and turns it into a series of database transactions. You can write a logical decoding plugin which emits these transactions in whatever format you want.

With logical decoding, we can (with a bit of programming) treat the DNS as a PostgreSQL replication target, with a script that looks something like pg_recvlogical | nsupdate.

I wrote a prototype along these lines, which is published at

status of this prototype

The plugin itself works in a fairly satisfactory manner.

However it needs a wrapper script to massage transactions before they are fed into nsupdate, mainly to split up very large transactions that cannot fit in a single UPDATE request.

The remaining difficult work is related to starting, stopping, and pausing replication without losing transactions. In particular, during initial deployment we need to be able to pause replication and verify that the replicated updates are faithfully reproducing what the batch update would have done. We can use the same pause/batch/resume mechanism to update the parts of the DNS that are not maintained in the database.

At the moment we are not doing any more work in this area until the other prerequisites are in place.