[09] WHAT SHOULD I DO IF A SYSTEM CRASHES OR LOCKS UP?

[09] WHAT SHOULD I DO IF A SYSTEM CRASHES OR LOCKS UP?

Hopefully this will not happen at all to you, but if you experience
'lock ups' or 'freezes', please follow these steps to help prevent
your own data loss.

Also, it is important to note that you do not have a direct connection
to SDF and are mostly likely hopping through 10 or more networks to
get to SDF. You can use ping and traceroute to measure lag between
your computer and SDF. So, your experience of lag on SDF is subjective
and it is very important for you to understand that.

Typically a lockup will occur when you are trying to access a
file that is resident on the fileserver. For instance, say you
are trying to cat a file and instead of seeing the contents you
get either nothing or a message similar to:

ol1:/sys: not responding

Be patient, the fileserver will recover shortly and your task
will be completed .. you will probably see:

ol1:/sys: is alive again

which means your request will actually begin to be processed.

During the hang time, you can use ^T (CTRL T) to display the
status of your job .. for instance:

load: 2.04 cmd: tail 12966 [select] 0.00u 0.00s 0% 808k

[select] is the current state of the process id 12966 which
is the 'tail' program. If the system is waiting on actual
disk I/O, you'll probably see [biowait]. In cases of a hang
you may see either [nfsrcvlk] (Network File System Received Lock)
or [vnlock] (Virtual Node Lock) which the system will usually
recover from, but can be telling of a serious resource problem
on the NFS client should this state be prolonged.

In the event that the fileserver becomes unavailable, it is
important that you do not become impatient and interrupt, quit
or suspend your jobs (^C, ^\ or ^Z) but rather, wait them out.
If you are patient your chances of losing data will be
significantly reduced. Usually the fileserver will respond
within a few seconds, but usually no longer. In the case when
it is the NFS client's problem (vnlock for more than say 20
seconds) that particular host will most likely need to be reset.

More on this. SDF is pushing NetBSD to its limits and we are
currently (2003-2004) doing quite a bit of investigation with
the uvm/vfs/vnode code developers to help NetBSD become scalable
in high usage situations such as the loads we experience on SDF.
Solutions we find will be incorporated into the public code.