On Tuesday,this week name resolution seemed to be slow and occasionally unreliable. When somebody from the Help Desk mentioned this to me, I logged into the server that runs the named daemon, and sent it a message (SIGHUP) to make it reread its configuration. This had an effect: the system quit answering queries. The program did write out many messages to the effect that it could not find addresses for particular root servers. A BIND that cannot retrieve information from the root servers is of no use for resolving addresses outside its own network, and, I discovered, may be so busy trying and failing that it can do little else. Yet there was no reason the program should have had trouble finding the root servers--the root hints file was fine.
The failure of DNS (domain name service) quickly stops the work of many parts of a network. Email will not go out, and users cannot connect to web sites. I was one of those users, so the usual recourse of checking on Google for the sense of apparently senseless error messages didn't work. After repeated restarts of the named daemon and the caching daemon, we got back to a fairly stable condition, where the name service would respond correctly, if not on the first try, then on the second.
That lasted until I tried another restart Wednesday morning. The daemon reported that it could not find the root servers. Restarting the named and caching daemons was not working. A SIGINT to the named daemon produced a dump of the state in named_dump.db, but that didn't tell me much. Eventually a BSD-oriented blog suggested that the forwarders in my configuration file could be the problem. I commented them out, restarted, and life returned to something like normal.
Clearly I need to be better at reading named_dump.db, and I need to know more about the work of the forwarders in the configuration.
I did notice a few things in named_dump.db that I don't ordinarily think about. There are many curious domain names out there, for one. A few, copied at random, are