* * * * *
RAID problems
Hi. My name is Agent Conner. This is my partner, Agent Grosberg. [1] This is
our story. (insert Dragnet Theme here)
At 2130 we recieved the call from John the paper millionaire of a dotcom. He
has a problem. A computer problem. A major computer problem and he's calling
in the experts. That's us.
It seems that there is a problem with his RAID system (Mark, I need details
on the RAID system). Upon investigation it seems that the hardware is fine.
It's the software that is a problem. Or rather, the operating system has a
problem that leads to a corrupt file system.
> Rule 1. Just because you have RAID doesn't mean your data can't get lost or
> corrupted.
>
The operating system in question is Microsoft Windows NT 4.0 Service Pack 3.
There's a reason he's at Service Pack 3—it works with his RAID system, and
that was hard enough to get running. His entire dotcom runs under NT. All his
data, his critical data, relies upon Microsoft Windows NT to be stable.
> Rule 2. No Fortune 100 Company uses Microsoft Windows NT for financial or
> critical applications. None.
>
> Corollary 2: Microsoft is a Fortune 100 Company.
>
From our investigation we were able to asertain that Microsoft Windows NT has
a problem with filesystems that contain over four million files. John the
paper millionaire of a dotcom has a filesystem with over four million files.
John's data is slowly being corrupted.
> Rule 3. See Rule 2.
>
John the paper millionaire of a dotcom now knows the difficulty of using
Microsoft Windows NT for a critical application. But that still doesn't help
him.
Any attempt to delete, copy, move or rename the file fails with a modal
dialog box popping up informing the user that the operating system cannot
delete, copy, move or rename said file. You have to click “OK” to make it go
away.
> Rule 4. Any software that requires user intervention can't be used in a
> server capacity.
>
The backup program John uses has failed multiple times in face of said files.
Therefore it is proving difficult to get a reliable backup of the four
million plus files that John needs to run his business. Microsoft does have a
patch available for said bug, but the time frame required to run CHKDISK is
unacceptable, possibly taking up to four days to run.
> Rule 5. Any backup software that cannot run in the face of errors (even if
> told to ignore said file and carry on) should not be used in a server
> capacity.
>
We did manage to test the GNU tar program under Microsoft Windows NT and it
carried on, ignoring the corrupt files. But there doesn't seem to be a way to
actually reference the tape backup unit from the command line, and there is
not enough free space to backup onto disk. And the number of corrupt files
seems to be relatively few, about a hundred.
But since you can't delete, move, copy or rename the files, it's hard to work
around them. Another method would be to put the RAID system into read-only
mode, make a backup of the RAID system (by swapping drives in and out of the
hot-swappable RAID system to build a backup set of drives with the data on
it, set up a separate system with said RAID backup, and go from there) but we
have to see what John's bosses say to that (John became a paper millionaire
of a dotcom by having his dotcom being bought out).
The case is still open …
[1]
http://www.conman.org/people/myg/
Email author at
[email protected]