[12] WHEN ARE THE SYSTEM MAINTENANCE WINDOWS? WHY THE LOW UPTIME?

[12] WHEN ARE THE SYSTEM MAINTENANCE WINDOWS? WHY THE LOW UPTIME?

Typically the SDF Public Access UNIX System is available to its
members and, in some cases, the general public 24 hours a day,
7 days a week, 365 days a year, 10 years a decade, 25 years a
quarter century .. and so on.

That being said there are unforeseen issues that can cause the
system to become unavailable:

1. Hard Disk Crash - We have several spare drives, some of
them already plugged in and ready to be used. In the
best case scenario no maintenance window is required.

2. Fire - In the case of fire all SDF machines must be shut
down unless the fire is an isolated occurance.

3. Natural Disaster - In the Spring (Apr-May) we do get
affected by lighting strikes in our area due to heavy
thunderstorms. Best case scenario the UPS systems filter
the spikes and dips which allow SDF to run uninterrupted.

4. Software Bug - This due crop up from time to time and are
usually related to system updates. On SDF we typically
will let the public access machines lag behind NetBSD
development in order to test new releases in our lab before
subjecting the userbase to 'new bugs'.

5. Routine and Scheduled Maintenance - Please read below.

6. Hardware Component Failure - We have many spare machines,
some completely cabled up and ready to go at the flick of
a remote command. If an SDF client host becomes completely
unrecoverable, a spare can be put into operation within
minutes. Keep in mind that while all of your personal files
are hosted on the file server, the /tmp directory is exclusive
to each SDF client host.

ROUTINE AND SCHEDULED MAINTENANCE

There is a weekly maintenance window on Sunday mornings beginning at
02:00 AM until 03:00 AM. This windows is not always used and when it
is, it is used very briefly. 5 minutes prior to a shutdown or runlevel
transition all logged in members will be notified on their terminals.
If you see this message alerting you to system maintenance, you should
save all open files and prepare to logout.

Scheduled maintenance is always announced several days in advance on
the bboard in the <ANNOUNCE> board. If it that maintenance window
requires extended time (basically anything over 5 to 10 minutes) the
/etc/motd file (displayed at login) will note the details of the event.

Scheduled maintenance is really only used when hardware upgrades have
to take place. In most cases, software updates can occur while the
systems are up and available.

WHY THE LOW UPTIME?

Uptime is relative. What we're after is 'high availability'. This
means that our goal is to have the servers answering at least 99.9%
of the time. In the 20+ years of service SDF has been able to meet
this goal. The most uptime you'll see on any given server will be
about 3 to 4 weeks. After 3 weeks performing maintenance is necessary.
This helps with clearing buffers, caches and other inconsistencies
that can occur as the systems run from cold or warm boot. Rather
than waiting for the system to fail due to kernel panic or a hang,
a warm boot is performed, during the weekly maintenance window, which
takes roughly 5 minutes or less. Keep in mind, this doesn't occur
weekly but usually after 3 to 4 weeks of linear uptime.

Why is this necessary? (aka "My box runs for years under my desk").
We too have very low usage non-public NetBSD systems that run for years
without requiring a reboot. However, SDF is extremely high volume with
sophsiticated NFS, NIS and VNODE caching. While these do not cause
problems with light loads, with 40,000 active users they become an
issue. Again, our goal is high availability which doesn't necessarily
have to translate it long uptimes.