* * * * *

                         A comedic series of setbacks

[The events described herein actually happened yesterday, but technically
spilled over into today, so there you go. —Ed]

Well, that was certainly pleasant.

What was supposed to be a simple upgrade dominoed into a full scale fiasco.

But first, a bit of setup.

At The Company, we have seven name servers. Two are used by all the computers
here to resolve DNS (Domain Name Service) queries only; no configuration
changes are required on these two machines and therefore are not part of this
story.

Four of the name servers are authoritative name servers for our domains.
These machines will only respond to queries on the domains we host; all other
queries (like recursive DNS queries) are ignored.

To make things easier on us, the remaining name server actually hosts all the
zone files and pushes them out to the four authoritative name servers (in
effect, the four authoritative name servers are slaves of this one server,
but the outside world will never see this server). Therefore, we can make
changes to the zone files on one server and have the changes automatically
pushed out.

That still leaves the problem of new zones being added (which is a sore point
with me with reguard to bind). Any new zone that's added, the configuration
of not only the one master server needs changing, but the configuration file
of the four authoritative (aka (Also Known As) “slave” servers) also require
changing. While I have a script that will generate all five configuration
files, we still need to copy four of the configuration files to each of the
servers.

So I wanted to automate the copying of the new configuration files and
restarting the name servers on all the authoritative name servers when the
configuration files are created. And to do that, I needed to set up a trust
mechanism so that the server that has all the zones can copy a configuration
file and restart the nameservers without intervention. Easy enough to do with
ssh.

But there were some … interoperability issues between the various machines
with respect to their various instances of ssh (bascially, scp (secure copy)
didn't work due to protocol differences). Easily solved by installing the
latest version of OpenSSH [1] on each machine. Well, actually, installing the
latest version of zlib [2], then OpenSSL [3] and then OpenSSH.

This master server, the one with all the zones, didn't need this upgrade, so
I didn't bother with that. The four authoritative name servers, however,
needed the upgrades. Now, I should mention at this point that the four
authoritative name servers are all Cobalt RaQs—sure they're pretty old, but
at 1U (rack Unit—1.75″) high and a low power consumption, they're fine for
doing the dedicated task of resolving DNS queries.

The upgrade went smoothly on three of the machines—pretty much:

> # cd zlib-1.2.3
> # ./configure
> # make
> # make install
> # cd ../openssl-0.9.8a
> # ./configure
> # make
> # make install
> # cd ../openssh-4.3p1
> # ./configure
> # make
> # make install
> # /etc/rc.d/init.d/sshd stop
> # /etc/rc.d/init.d/sshd start
> #
>

On the fourth machine (which happened to be the primary of our authoritative
name servers) the make install of OpenSSH failed. Of which I didn't notice.

Oops.

The result of which was a borked program that refused to run, and no backup
of the working version.

Oops.

Somehow, I ended up being logged out of the machine. And without a working
sshd there was no way I could log back into the machine.

Well, not easily.

You see, the Cobalt RaQs don't have video or keyboard ports. They're designed
as servers—they don't really need such devices. They do, however, have a
serial port you can log in through.

So I hook up a serial cable from a nearby server to the RaQ in question and
that's when I got hit with Murphy's Law [4] yet again—the serial login was
disabled.

Hrm.

Okay. Take the machine out of the rack, take the drive out of the machine,
hook it up to my workstation, change it so one can log in through the serial
port, put the drive back in, power up the machine and log in through the
serial port.

So I start to take the machine out of the rack when I get hit with Murphy's
Law for a third time—one of the screws is stripped, so therefore I can't get
it out of the rack.

Okay, now what?

I know that Linux (which is what runs on this Cobalt RaQ) can support a
serial console. Maybe I can boot into single user mode and go from there.

Nice idea, but apparently the Linux kernel for these boxes don't support the
serial console (as incredible as that may seem). Yes, I can see the shell
prompt in single user mode, but everything I type just goes into the bit
bucket (and this I try several times, with different arguments to the Linux
kernel to try to get it to use a serial console). And each time I do this, I
end up having to shut the machine down, which leaves the file system in an
inconsistant state, requiring the use of fsck to fix.

Okay, so I really need to get the machine out of the rack. But how to do
that? I'm looking at the situation when I get an idea: I'll attack the
problem from a different angle. Literally! The Cobalt RaQ has two “wings”
(one on either side) which are attached by screws, and it's these “wings”
which are then screwed into the rack. I can get access to the screws holding
the wing in place. So, I effectively remove the wing from the RaQ and it
slides right out.

Then it's to my workstation. Open up the Cobalt RaQ, remove the drive, attach
said drive to the external USB (Universal Serial Bus) drive case, turn it on,
run fsck on the drive, edit the configuration to allow logins from the serial
port, umount the drive, power it down, remove it from the USB drive, put it
back into the RaQ, power it on and—

—have it fail to boot.

You see, when I “fixed” the drive using fsck on my workstation, it marked the
drive as being a newer version of the filesystem. Which the fsck on the
Cobalt RaQ doesn't support (as part of the boot up sequence, it automatically
checks the drives using fsck).

Murphy strikes again.

Okay, attach the drive to my workstation, copy over a newer version of fsci,
move the drive back to the RaQ and power up—

—only to have it fail yet again. Apparently, the old version of fsck used
options that the new version of fsck doesn't like. So back to the
workstation, modify the startup scripts to remove the options fsck is
bitching about, try again, move the drive back to the workstation because I
apparently edited the wrong startup scripts and try again only to find out I
mucked it up again, so back to the workstation …

It was about half an hour of moving the drive back and forth before I got the
RaQ to finally finished booting and to the point where I can log in
sucessfully through the serial port.

Start the OpenSSH install from scratch.

> # tar xzvf ../archive/openssh-4.3p1.tar.gz
> # cd openssh-4.3p1
> # ./configure
>

Only now configure failed!

Huh?

I check, and gcc failed with an internal error!

I try a quick C program and yup, gcc is totally borked now.

At this point, the only thing left is to reinstall the operating system.

Now, the installation procedure for the Cobalt RaQ 3 and 4s requires another
PC, which you boot using a special CD (Compact Disc). The PC in question must
have a single network port and one (1) CD drive. Anything else will confuse
the installation CD. Once this CD is booted, you then force the Cobalt RaQ to
do a netboot. Ths PC will see the netboot request, and feed it an installtion
program which will install Linux on the Cobalt RaQ.

Because of the requirements of the installation CD, I have to use P's
computer as it fits the requirements of the installation CD. But P's computer
is on its last legs, sounding much like a dying diesel engine. But it's not
dead yet.

Only it hung during the installation, having difficulty reading the CD.

I try it again. Same thing.

I turn off P's computer for half an hour. You know, let it cool down. Try it
again. Same thing.

Smirk then suggested another computer in the office.

Same thing—it hangs.

It was then I remembered something from my past experiences with installing
Cobalt RaQs: don't use the Cobalt RaQ 3 installation disks! They don't work.
Use the Cobalt RaQ 4 installation disks instead (even if you are installing
on a Cobalt RaQ 3, which I was).

That worked.

So now I had a fresh install of the operating system.

But no ssh.

Now, how to get files to the box … okay, the Cobalt RaQ has ftp. Okay,
compile an FTP server on my workstation. Then use ftp to transfer zlib,
openssl, openssh and bind to the Cobalt RaQ. Spend the next couple of hours
compiling.

Finally had it back up and running and could finish the job I started some
eight hours previously.

Blarg.

[1] http://www.openssh.org/
[2] http://www.zlib.net/
[3] http://www.openssl.org/
[4] http://en.wikipedia.org/wiki/Murphy's_law

Email author at [email protected]