The Case of the Non-Mounted File System

* * * * *

The Case of the Non-Mounted File System

I felt like I was in an episode of House [1].

I found myself at The Data Center, waiting for one of our customers, R, to
show up to let him in (he forgot the access code). While there, I was
attempting to extracate a KVM (Keyboard, Video, Mouse) cable he could use
when I pulled the wrong cable and unplugged a power strip.

The upshot: I took down some of R's equipment that wasn't having problems.

Sigh.

R shows up, and we check on the equipment that experienced the unplanned
power outtage and one of his Linux boxes was in trouble. It was running
Asterisk [2] and it had the most amusing problem: it kept core dumping on an
illegal instruction and upon crashing, would restart itself [3].

But in troubleshooting that problem, it became rather apparent something else
was terribly wrong:

> GenericUnixRootPrompt# df
> Filesystem 1K-blocks Used Available Use% Mounted on
> GenericUnixRootPrompt#
>

Nothing mounted, but I could still see files. fdisk showed two partitions,
/dev/hda1 and /dev/hda2. fsck worked fine on /dev/hda1 but failed on
/dev/hda2 since it didn't know what type of filesystem was on it. Odder
still, /dev/hda1 was the boot partition, containing only the kernel and
related files required for the initial operating system boot, but yet, here I
was, in a shell, running Unix commands like fsck and fdisk and more.

Yet fsck and even mount had no idea what type of filesystem was on /dev/hda2.

Yet, it must be the root filesystem, which I was currently using, because
/dev/hda1 didn't have fsck, mount, more much less /bin/bash.

Worse still, what I did have, including /tmp, was in “read-only” mode.

The Asterisk crashing problem would have to wait.

I was able to get the box on the network and backup everything to another
system. While that was chugging along (took about an hour) I realized that
the system was somehow mounting /dev/hda2, otherwise there'd be nothing to
backup. Checking /etc/fstab didn't help much:

> GenericUnixRootPrompt# more /etc/fstab
> # This file is edited by fstab-sync - see 'man fstab-sync' for details
> /dev/hdb1 /media/cdrom auto user,noauto 0 0
> GenericUnixRootPrompt#
>

I then checked /boot/grub/grub.conf (since something was being mounted as the
root filesystem) and found that the root partition wasn't /dev/hda2 but
something like /dev/VolGroup00/LogGroup00. Using that I was able to check and
remount the fileystem as read/write. I was then able to add that to
/etc/fstab, reboot the system and have it come up fine, thus saving R from
having to nuke-n-pave the system. How /etc/fstab ended up without the root
filesystem is something I don't know (but I suspect it may have been trying
to update that file when the power was cut—hey, it's as good a theory as
anything), but at least the system was back up and running.

That just left the little problem of Asterisk continously dumping core in an
illegal instruction. A recompile of the program (since R and I thought maybe
the executable was corrupted) didn't solve the problem. A compile of the
lastest version didn't solve the problem, but we did notice that there were a
few modules for Asterisk installed that don't come with the default install
of Asterisk. And one of those modules had pentium4-sse3 in the name.

I checked the box—it was a Pentium IV with SSE2, not a Pentium IV with SSE3.

That would definitely explain the crashing.

It seems that R hired someone to install a particular codec for Asterisk and
they grabbed the wrong version (or rather, the version for the wrong
processor) and the only reason Asterisk hadn't crashed was that it hadn't
actually been loaded into Asterisk. Well, until the reboot that is. We
removed that module and Asterisk started up fine.

And then it was time to turn to the problem that R had come to The Data
Center to investigate …

[1] http://en.wikipedia.org/wiki/House_(TV_series)
[2] http://www.asterisk.org/
[3] gopher://gopher.conman.org/0Phlog:2007/03/09.1

Email author at [email protected]