Software-RAID HOWTO
 Linas Vepstas, [email protected]
 v0.54, 21 November 1998

 RAID stands for ''Redundant Array of Inexpensive Disks'', and is meant
 to be a way of creating a fast and reliable disk-drive subsystem out
 of individual disks.  RAID can guard against disk failure, and can
 also improve performance over that of a single disk drive.  This docu-
 ment is a tutorial/HOWTO/FAQ for users of the Linux MD kernel exten-
 sion, the associated tools, and their use.  The MD extension imple-
 ments RAID-0 (striping), RAID-1 (mirroring), RAID-4 and RAID-5 in
 software. That is, with MD, no special hardware or disk controllers
 are required to get many of the benefits of RAID.
 ______________________________________________________________________

 Table of Contents


 1. Introduction

 2. Understanding RAID

 3. Setup & Installation Considerations

 4. Error Recovery

 5. Troubleshooting Install Problems

 6. Supported Hardware & Software

 7. Modifying an Existing Installation

 8. Performance, Tools & General Bone-headed Questions

 9. High Availability RAID

 10. Questions Waiting for Answers

 11. Wish List of Enhancements to MD and Related Software



 ______________________________________________________________________


    PPrreeaammbbllee
       This document is copyrighted and GPL'ed by Linas Vepstas
       ([email protected]).  Permission to use, copy, distribute this
       document for any purpose is hereby granted, provided that the
       author's / editor's name and this notice appear in all copies
       and/or supporting documents; and that an unmodified version of
       this document is made freely available.  This document is
       distributed in the hope that it will be useful, but WITHOUT ANY
       WARRANTY, either expressed or implied.  While every effort has
       been taken to ensure the accuracy of the information documented
       herein, the author / editor / maintainer assumes NO
       RESPONSIBILITY for any errors, or for any damages, direct or
       consequential, as a result of the use of the information
       documented herein.


       RRAAIIDD,, aalltthhoouugghh ddeessiiggnneedd ttoo iimmpprroovvee ssyysstteemm rreelliiaabbiilliittyy bbyy aaddddiinngg
       rreedduunnddaannccyy,, ccaann aallssoo lleeaadd ttoo aa ffaallssee sseennssee ooff sseeccuurriittyy aanndd
       ccoonnffiiddeennccee wwhheenn uusseedd iimmpprrooppeerrllyy..  TThhiiss ffaallssee ccoonnffiiddeennccee ccaann lleeaadd
       ttoo eevveenn ggrreeaatteerr ddiissaasstteerrss..  IInn ppaarrttiiccuullaarr,, nnoottee tthhaatt RRAAIIDD iiss
       ddeessiiggnneedd ttoo pprrootteecctt aaggaaiinnsstt **ddiisskk** ffaaiilluurreess,, aanndd nnoott aaggaaiinnsstt
       **ppoowweerr** ffaaiilluurreess oorr **ooppeerraattoorr** mmiissttaakkeess..  PPoowweerr ffaaiilluurreess,, bbuuggggyy
       ddeevveellooppmmeenntt kkeerrnneellss,, oorr ooppeerraattoorr//aaddmmiinn eerrrroorrss ccaann lleeaadd ttoo
       ddaammaaggeedd ddaattaa tthhaatt iitt iiss nnoott rreeccoovveerraabbllee!!  RRAAIIDD iiss **nnoott** aa
       ssuubbssttiittuuttee ffoorr pprrooppeerr bbaacckkuupp ooff yyoouurr ssyysstteemm..  KKnnooww wwhhaatt yyoouu aarree
       ddooiinngg,, tteesstt,, bbee kknnoowwlleeddggeeaabbllee aanndd aawwaarree!!

 11..  IInnttrroodduuccttiioonn


 1. QQ: What is RAID?

      AA: RAID stands for "Redundant Array of Inexpensive Disks",
      and is meant to be a way of creating a fast and reliable
      disk-drive subsystem out of individual disks.  In the PC
      world, "I" has come to stand for "Independent", where mar-
      keting forces continue to differentiate IDE and SCSI.  In
      it's original meaning, "I" meant "Inexpensive as compared to
      refrigerator-sized mainframe 3380 DASD", monster drives
      which made nice houses look cheap, and diamond rings look
      like trinkets.



 2. QQ: What is this document?

      AA: This document is a tutorial/HOWTO/FAQ for users of the
      Linux MD kernel extension, the associated tools, and their
      use.  The MD extension implements RAID-0 (striping), RAID-1
      (mirroring), RAID-4 and RAID-5 in software.   That is, with
      MD, no special hardware or disk controllers are required to
      get many of the benefits of RAID.


      This document is NNOOTT an introduction to RAID; you must find
      this elsewhere.




 3. QQ: What levels of RAID does the Linux kernel implement?

      AA: Striping (RAID-0) and linear concatenation are a part of
      the stock 2.x series of kernels.  This code is of production
      quality; it is well understood and well maintained.  It is
      being used in some very large USENET news servers.


      RAID-1, RAID-4 & RAID-5 are a part of the 2.1.63 and greater
      kernels.  For earlier 2.0.x and 2.1.x kernels, patches exist
      that will provide this function.  Don't feel obligated to
      upgrade to 2.1.63; upgrading the kernel is hard; it is
      *much* easier to patch an earlier kernel.  Most of the RAID
      user community is running 2.0.x kernels, and that's where
      most of the historic RAID development has focused.   The
      current snapshots should be considered near-production
      quality; that is, there are no known bugs but there are some
      rough edges and untested system setups.  There are a large
      number of people using Software RAID in a production
      environment.


      RAID-1 hot reconstruction has been recently introduced
      (August 1997) and should be considered alpha quality.
      RAID-5 hot reconstruction will be alpha quality any day now.


 A word of caution about the 2.1.x development kernels: these
 are less than stable in a variety of ways.  Some of the
 newer disk controllers (e.g. the Promise Ultra's) are
 supported only in the 2.1.x kernels.  However, the 2.1.x
 kernels have seen frequent changes in the block device
 driver, in the DMA and interrupt code, in the PCI, IDE and
 SCSI code, and in the disk controller drivers.  The
 combination of these factors, coupled to cheapo hard drives
 and/or low-quality ribbon cables can lead to considerable
 heartbreak.   The ckraid tool, as well as fsck and mount put
 considerable stress on the RAID subsystem.  This can lead to
 hard lockups during boot, where even the magic alt-SysReq
 key sequence won't save the day.  Use caution with the 2.1.x
 kernels, and expect trouble.  Or stick to the 2.0.34 kernel.




 4. QQ: I'm running an older kernel. Where do I get patches?

      AA: Software RAID-0 and linear mode are a stock part of all
      current Linux kernels.  Patches for Software RAID-1,4,5 are
      available from
      <http://luthien.nuclecu.unam.mx/~miguel/raid>.  See also the
      quasi-mirror <ftp://linux.kernel.org/pub/linux/dae-
      mons/raid/> for patches, tools and other goodies.



 5. QQ: Are there other Linux RAID references?

      AA:

      +o  Generic RAID overview:
         <http://www.dpt.com/uraiddoc.html>.

      +o  General Linux RAID options:
         <http://linas.org/linux/raid.html>.

      +o  Latest version of this document:
         <http://linas.org/linux/Software-RAID/Software-
         RAID.html>.

      +o  Linux-RAID mailing list archive:
         <http://www.linuxhq.com/lnxlists/>.

      +o  Linux Software RAID Home Page:
         <http://luthien.nuclecu.unam.mx/~miguel/raid>.

      +o  Linux Software RAID tools:
         <ftp://linux.kernel.org/pub/linux/daemons/raid/>.

      +o  How to setting up linear/stripped Software RAID:
         <http://www.ssc.com/lg/issue17/raid.html>.

      +o  Bootable RAID mini-HOWTO:
         <ftp://ftp.bizsystems.com/pub/raid/bootable-raid>.

      +o  Root RAID HOWTO: <ftp://ftp.bizsystems.com/pub/raid/Root-
         RAID-HOWTO>.

      +o  Linux RAID-Geschichten:
         <http://www.infodrom.north.de/~joey/Linux/raid/>.



 6. QQ: Who do I blame for this document?

      AA: Linas Vepstas slapped this thing together.  However, most
      of the information, and some of the words were supplied by

      +o  Bradley Ward Allen <[email protected]>

      +o  Luca Berra <[email protected]>

      +o  Brian Candler <[email protected]>

      +o  Bohumil Chalupa <[email protected]>

      +o  Rob Hagopian <[email protected]>

      +o  Anton Hristozov <[email protected]>

      +o  Miguel de Icaza <[email protected]>

      +o  Marco Meloni <[email protected]>

      +o  Ingo Molnar <[email protected]>

      +o  Alvin Oga <[email protected]>

      +o  Gadi Oxman <[email protected]>

      +o  Vaughan Pratt <[email protected]>

      +o  Steven A. Reisman <[email protected]>

      +o  Michael Robinton <[email protected]>

      +o  Martin Schulze <[email protected]>

      +o  Geoff Thompson <[email protected]>

      +o  Edward Welbon <[email protected]>

      +o  Rod Wilkens <[email protected]>

      +o  Johan Wiltink <[email protected]>

      +o  Leonard N. Zubkoff <[email protected]>

      +o  Marc ZYNGIER <[email protected]>


         CCooppyyrriigghhttss

      +o  Copyright (C) 1994-96 Marc ZYNGIER

      +o  Copyright (C) 1997 Gadi Oxman, Ingo Molnar, Miguel de
         Icaza

      +o  Copyright (C) 1997, 1998 Linas Vepstas

      +o  By copyright law, additional copyrights are implicitly
         held by the contributors listed above.

         Thanks all for being there!





 22..  UUnnddeerrssttaannddiinngg RRAAIIDD


 1. QQ: What is RAID?  Why would I ever use it?

      AA: RAID is a way of combining multiple disk drives into a
      single entity to improve performance and/or reliability.
      There are a variety of different types and implementations
      of RAID, each with its own advantages and disadvantages.
      For example, by putting a copy of the same data on two disks
      (called ddiisskk mmiirrrroorriinngg, or RAID level 1), read performance
      can be improved by reading alternately from each disk in the
      mirror.  On average, each disk is less busy, as it is han-
      dling only 1/2 the reads (for two disks), or 1/3 (for three
      disks), etc.  In addition, a mirror can improve reliability:
      if one disk fails, the other disk(s) have a copy of the
      data.  Different ways of combining the disks into one,
      referred to as RRAAIIDD lleevveellss,  can provide greater storage
      efficiency than simple mirroring, or can alter latency
      (access-time) performance, or throughput (transfer rate)
      performance, for reading or writing, while still retaining
      redundancy that is useful for guarding against failures.

      AAlltthhoouugghh RRAAIIDD ccaann pprrootteecctt aaggaaiinnsstt ddiisskk ffaaiilluurree,, iitt ddooeess nnoott
      pprrootteecctt aaggaaiinnsstt ooppeerraattoorr aanndd aaddmmiinniissttrraattoorr ((hhuummaann)) eerrrroorr,, oorr
      aaggaaiinnsstt lloossss dduuee ttoo pprrooggrraammmmiinngg bbuuggss ((ppoossssiibbllyy dduuee ttoo bbuuggss
      iinn tthhee RRAAIIDD ssooffttwwaarree iittsseellff))..  TThhee nneett aabboouunnddss wwiitthh ttrraaggiicc
      ttaalleess ooff ssyysstteemm aaddmmiinniissttrraattoorrss wwhhoo hhaavvee bbuunngglleedd aa RRAAIIDD
      iinnssttaallllaattiioonn,, aanndd hhaavvee lloosstt aallll ooff tthheeiirr ddaattaa..  RRAAIIDD iiss nnoott
      aa ssuubbssttiittuuttee ffoorr ffrreeqquueenntt,, rreegguullaarrllyy sscchheedduulleedd bbaacckkuupp..

      RAID can be implemented in hardware, in the form of special
      disk controllers, or in software, as a kernel module that is
      layered in between the low-level disk driver, and the file
      system which sits above it.  RAID hardware is always a "disk
      controller", that is, a device to which one can cable up the
      disk drives. Usually it comes in the form of an adapter card
      that will plug into a ISA/EISA/PCI/S-Bus/MicroChannel slot.
      However, some RAID controllers are in the form of a box that
      connects into the cable in between the usual system disk
      controller, and the disk drives.  Small ones may fit into a
      drive bay; large ones may be built into a storage cabinet
      with its own drive bays and power supply.  The latest RAID
      hardware used with the latest & fastest CPU will usually
      provide the best overall performance, although at a
      significant price.  This is because most RAID controllers
      come with on-board DSP's and memory cache that can off-load
      a considerable amount of processing from the main CPU, as
      well as allow high transfer rates into the large controller
      cache.  Old RAID hardware can act as a "de-accelerator" when
      used with newer CPU's: yesterday's fancy DSP and cache can
      act as a bottleneck, and it's performance is often beaten by
      pure-software RAID and new but otherwise plain, run-of-the-
      mill disk controllers.  RAID hardware can offer an advantage
      over pure-software RAID, if it can makes use of disk-spindle
      synchronization and its knowledge of the disk-platter
      position with regard to the disk head, and the desired disk-
      block.  However, most modern (low-cost) disk drives do not
      offer this information and level of control anyway, and
      thus, most RAID hardware does not take advantage of it.
      RAID hardware is usually not compatible across different
      brands, makes and models: if a RAID controller fails, it
      must be replaced by another controller of the same type.  As
      of this writing (June 1998), a broad variety of hardware
      controllers will operate under Linux; however, none of them
      currently come with configuration and management utilities
 that run under Linux.

 Software-RAID is a set of kernel modules, together with
 management utilities that implement RAID purely in software,
 and require no extraordinary hardware.  The Linux RAID
 subsystem is implemented as a layer in the kernel that sits
 above the low-level disk drivers (for IDE, SCSI and Paraport
 drives), and the block-device interface.  The filesystem, be
 it ext2fs, DOS-FAT, or other, sits above the block-device
 interface.  Software-RAID, by its very software nature,
 tends to be more flexible than a hardware solution.  The
 downside is that it of course requires more CPU cycles and
 power to run well than a comparable hardware system.  Of
 course, the cost can't be beat.  Software RAID has one
 further important distinguishing feature: it operates on a
 partition-by-partition basis, where a number of individual
 disk partitions are ganged together to create a RAID
 partition.  This is in contrast to most hardware RAID
 solutions, which gang together entire disk drives into an
 array.  With hardware, the fact that there is a RAID array
 is transparent to the operating system, which tends to
 simplify management.  With software, there are far more
 configuration options and choices, tending to complicate
 matters.

 AAss ooff tthhiiss wwrriittiinngg ((JJuunnee 11999988)),, tthhee aaddmmiinniissttrraattiioonn ooff RRAAIIDD
 uunnddeerr LLiinnuuxx iiss ffaarr ffrroomm ttrriivviiaall,, aanndd iiss bbeesstt aatttteemmpptteedd bbyy
 eexxppeerriieenncceedd ssyysstteemm aaddmmiinniissttrraattoorrss..  TThhee tthheeoorryy ooff ooppeerraattiioonn
 iiss ccoommpplleexx..  TThhee ssyysstteemm ttoooollss rreeqquuiirree mmooddiiffiiccaattiioonn ttoo
 ssttaarrttuupp ssccrriippttss..  AAnndd rreeccoovveerryy ffrroomm ddiisskk ffaaiilluurree iiss nnoonn--
 ttrriivviiaall,, aanndd pprroonnee ttoo hhuummaann eerrrroorr..   RRAAIIDD iiss nnoott ffoorr tthhee
 nnoovviiccee,, aanndd aannyy bbeenneeffiittss iitt mmaayy bbrriinngg ttoo rreelliiaabbiilliittyy aanndd
 ppeerrffoorrmmaannccee ccaann bbee eeaassiillyy oouuttwweeiigghheedd bbyy tthhee eexxttrraa
 ccoommpplleexxiittyy..  IInnddeeeedd,, mmooddeerrnn ddiisskk ddrriivveess aarree iinnccrreeddiibbllyy
 rreelliiaabbllee aanndd mmooddeerrnn CCPPUU''ss aanndd ccoonnttrroolllleerrss aarree qquuiittee
 ppoowweerrffuull..  YYoouu mmiigghhtt mmoorree eeaassiillyy oobbttaaiinn tthhee ddeessiirreedd
 rreelliiaabbiilliittyy aanndd ppeerrffoorrmmaannccee lleevveellss bbyy ppuurrcchhaassiinngg hhiigghheerr--
 qquuaalliittyy aanndd//oorr ffaasstteerr hhaarrddwwaarree..



 2. QQ: What are RAID levels?  Why so many? What distinguishes them?

      AA: The different RAID levels have different performance,
      redundancy, storage capacity, reliability and cost charac-
      teristics.   Most, but not all levels of RAID offer redun-
      dancy against disk failure.  Of those that offer redundancy,
      RAID-1 and RAID-5 are the most popular.  RAID-1 offers bet-
      ter performance, while RAID-5 provides for more efficient
      use of the available storage space.  However, tuning for
      performance is an entirely different matter, as performance
      depends strongly on a large variety of factors, from the
      type of application, to the sizes of stripes, blocks, and
      files.  The more difficult aspects of performance tuning are
      deferred to a later section of this HOWTO.

      The following describes the different RAID levels in the
      context of the Linux software RAID implementation.


      +o  RRAAIIDD--lliinneeaarr is a simple concatenation of partitions to
         create a larger virtual partition.  It is handy if you
         have a number small drives, and wish to create a single,
         large partition.  This concatenation offers no
         redundancy, and in fact decreases the overall
         reliability: if any one disk fails, the combined
    partition will fail.



 +o  RRAAIIDD--11 is also referred to as "mirroring".  Two (or more)
    partitions, all of the same size, each store an exact
    copy of all data, disk-block by disk-block.  Mirroring
    gives strong protection against disk failure: if one disk
    fails, there is another with the an exact copy of the
    same data. Mirroring can also help improve performance in
    I/O-laden systems, as read requests can be divided up
    between several disks.   Unfortunately, mirroring is also
    the least efficient in terms of storage: two mirrored
    partitions can store no more data than a single
    partition.



 +o  SSttrriippiinngg is the underlying concept behind all of the
    other RAID levels.  A stripe is a contiguous sequence of
    disk blocks.  A stripe may be as short as a single disk
    block, or may consist of thousands.  The RAID drivers
    split up their component disk partitions into stripes;
    the different RAID levels differ in how they organize the
    stripes, and what data they put in them. The interplay
    between the size of the stripes, the typical size of
    files in the file system, and their location on the disk
    is what determines the overall performance of the RAID
    subsystem.



 +o  RRAAIIDD--00 is much like RAID-linear, except that the
    component partitions are divided into stripes and then
    interleaved.  Like RAID-linear, the result is a single
    larger virtual partition.  Also like RAID-linear, it
    offers no redundancy, and therefore decreases overall
    reliability: a single disk failure will knock out the
    whole thing.  RAID-0 is often claimed to improve
    performance over the simpler RAID-linear.  However, this
    may or may not be true, depending on the characteristics
    to the file system, the typical size of the file as
    compared to the size of the stripe, and the type of
    workload.  The ext2fs file system already scatters files
    throughout a partition, in an effort to minimize
    fragmentation. Thus, at the simplest level, any given
    access may go to one of several disks, and thus, the
    interleaving of stripes across multiple disks offers no
    apparent additional advantage. However, there are
    performance differences, and they are data, workload, and
    stripe-size dependent.



 +o  RRAAIIDD--44 interleaves stripes like RAID-0, but it requires
    an additional partition to store parity information.  The
    parity is used to offer redundancy: if any one of the
    disks fail, the data on the remaining disks can be used
    to reconstruct the data that was on the failed disk.
    Given N data disks, and one parity disk, the parity
    stripe is computed by taking one stripe from each of the
    data disks, and XOR'ing them together.  Thus, the storage
    capacity of a an (N+1)-disk RAID-4 array is N, which is a
    lot better than mirroring (N+1) drives, and is almost as
    good as a RAID-0 setup for large N.  Note that for N=1,
    where there is one data drive, and one parity drive,
    RAID-4 is a lot like mirroring, in that each of the two
    disks is a copy of each other.  However, RAID-4 does NNOOTT
    offer the read-performance of mirroring, and offers
    considerably degraded write performance. In brief, this
    is because updating the parity requires a read of the old
    parity, before the new parity can be calculated and
    written out.  In an environment with lots of writes, the
    parity disk can become a bottleneck, as each write must
    access the parity disk.



 +o  RRAAIIDD--55 avoids the write-bottleneck of RAID-4 by
    alternately storing the parity stripe on each of the
    drives.  However, write performance is still not as good
    as for mirroring, as the parity stripe must still be read
    and XOR'ed before it is written.  Read performance is
    also not as good as it is for mirroring, as, after all,
    there is only one copy of the data, not two or more.
    RAID-5's principle advantage over mirroring is that it
    offers redundancy and protection against single-drive
    failure, while offering far more storage capacity  when
    used with three or more drives.



 +o  RRAAIIDD--22 aanndd RRAAIIDD--33 are seldom used anymore, and to some
    degree are have been made obsolete by modern disk
    technology.  RAID-2 is similar to RAID-4, but stores ECC
    information instead of parity.  Since all modern disk
    drives incorporate ECC under the covers, this offers
    little additional protection.  RAID-2 can offer greater
    data consistency if power is lost during a write;
    however, battery backup and a clean shutdown can offer
    the same benefits.  RAID-3 is similar to RAID-4, except
    that it uses the smallest possible stripe size. As a
    result, any given read will involve all disks, making
    overlapping I/O requests difficult/impossible. In order
    to avoid delay due to rotational latency, RAID-3 requires
    that all disk drive spindles be synchronized. Most modern
    disk drives lack spindle-synchronization ability, or, if
    capable of it, lack the needed connectors, cables, and
    manufacturer documentation.  Neither RAID-2 nor RAID-3
    are supported by the Linux Software-RAID drivers.



 +o  OOtthheerr RRAAIIDD lleevveellss have been defined by various
    researchers and vendors.  Many of these represent the
    layering of one type of raid on top of another.  Some
    require special hardware, and others are protected by
    patent. There is no commonly accepted naming scheme for
    these other levels. Sometime the advantages of these
    other systems are minor, or at least not apparent until
    the system is highly stressed.  Except for the layering
    of RAID-1 over RAID-0/linear, Linux Software RAID does
    not support any of the other variations.



 33..  SSeettuupp && IInnssttaallllaattiioonn CCoonnssiiddeerraattiioonnss


 1. QQ: What is the best way to configure Software RAID?


 AA: I keep rediscovering that file-system planning is one of
 the more difficult Unix configuration tasks.  To answer your
 question, I can describe what we did.

 We planned the following setup:

 +o  two EIDE disks, 2.1.gig each.


      disk partition mount pt.  size    device
        1      1       /        300M   /dev/hda1
        1      2       swap      64M   /dev/hda2
        1      3       /home    800M   /dev/hda3
        1      4       /var     900M   /dev/hda4

        2      1       /root    300M   /dev/hdc1
        2      2       swap      64M   /dev/hdc2
        2      3       /home    800M   /dev/hdc3
        2      4       /var     900M   /dev/hdc4





 +o  Each disk is on a separate controller (& ribbon cable).
    The theory is that a controller failure and/or ribbon
    failure won't disable both disks.  Also, we might
    possibly get a performance boost from parallel operations
    over two controllers/cables.

 +o  Install the Linux kernel on the root (/) partition
    /dev/hda1.  Mark this partition as bootable.

 +o  /dev/hdc1 will contain a ``cold'' copy of /dev/hda1. This
    is NOT a raid copy, just a plain old copy-copy. It's
    there just in case the first disk fails; we can use a
    rescue disk, mark /dev/hdc1 as bootable, and use that to
    keep going without having to reinstall the system.  You
    may even want to put /dev/hdc1's copy of the kernel into
    LILO to simplify booting in case of failure.

    The theory here is that in case of severe failure, I can
    still boot the system without worrying about raid
    superblock-corruption or other raid failure modes &
    gotchas that I don't understand.

 +o  /dev/hda3 and /dev/hdc3 will be mirrors /dev/md0.

 +o  /dev/hda4 and /dev/hdc4 will be mirrors /dev/md1.

 +o  we picked /var and /home to be mirrored, and in separate
    partitions, using the following logic:

 +o  / (the root partition) will contain relatively static,
    non-changing data: for all practical purposes, it will be
    read-only without actually being marked & mounted read-
    only.

 +o  /home will contain ''slowly'' changing data.

 +o  /var will contain rapidly changing data, including mail
    spools, database contents and web server logs.

    The idea behind using multiple, distinct partitions is
    that iiff, for some bizarre reason, whether it is human
    error, power loss, or an operating system gone wild,
    corruption is limited to one partition.  In one typical
    case, power is lost while the system is writing to disk.
    This will almost certainly lead to a corrupted
    filesystem, which will be repaired by fsck during the
    next boot.  Although fsck does it's best to make the
    repairs without creating additional damage during those
    repairs, it can be comforting to know that any such
    damage has been limited to one partition.  In another
    typical case, the sysadmin makes a mistake during rescue
    operations, leading to erased or destroyed data.
    Partitions can help limit the repercussions of the
    operator's errors.

 +o  Other reasonable choices for partitions might be /usr or
    /opt.  In fact, /opt and /home make great choices for
    RAID-5 partitions, if we had more disks.  A word of
    caution: DDOO NNOOTT put /usr in a RAID-5 partition.  If a
    serious fault occurs, you may find that you cannot mount
    /usr, and that you want some of the tools on it (e.g. the
    networking tools, or the compiler.)  With RAID-1, if a
    fault has occurred, and you can't get RAID to work, you
    can at least mount one of the two mirrors.  You can't do
    this with any of the other RAID levels (RAID-5, striping,
    or linear append).



    So, to complete the answer to the question:

 +o  install the OS on disk 1, partition 1.  do NOT mount any
    of the other partitions.

 +o  install RAID per instructions.

 +o  configure md0 and md1.

 +o  convince yourself that you know what to do in case of a
    disk failure!  Discover sysadmin mistakes now, and not
    during an actual crisis.  Experiment!  (we turned off
    power during disk activity -- this proved to be ugly but
    informative).

 +o  do some ugly mount/copy/unmount/rename/reboot scheme to
    move /var over to the /dev/md1.  Done carefully, this is
    not dangerous.

 +o  enjoy!




 2. QQ: What is the difference between the mdadd, mdrun, _e_t_c_. commands,
    and the raidadd, raidrun commands?

      AA: The names of the tools have changed as of the 0.5 release
      of the raidtools package.  The md naming convention was used
      in the 0.43 and older versions, while raid is used in 0.5
      and newer versions.



 3. QQ: I want to run RAID-linear/RAID-0 in the stock 2.0.34 kernel.  I
    don't want to apply the raid patches, since these are not needed
    for RAID-0/linear.  Where can I get the raid-tools to manage this?


 AA: This is a tough question, indeed, as the newest raid
 tools package needs to have the RAID-1,4,5 kernel patches
 installed in order to compile.  I am not aware of any pre-
 compiled, binary version of the raid tools that is available
 at this time.  However, experiments show that the raid-tools
 binaries, when compiled against kernel 2.1.100, seem to work
 just fine in creating a RAID-0/linear partition under
 2.0.34.  A brave soul has asked for these, and I've tteemm--
 ppoorraarriillyy placed the binaries mdadd, mdcreate, etc.  at
 http://linas.org/linux/Software-RAID/ You must get the man
 pages, etc. from the usual raid-tools package.




 4. QQ: Can I strip/mirror the root partition (/)?  Why can't I boot
    Linux directly from the md disks?


      AA: Both LILO and Loadlin need an non-stripped/mirrored par-
      tition to read the kernel image from. If you want to
      strip/mirror the root partition (/), then you'll want to
      create an unstriped/mirrored partition to hold the ker-
      nel(s).  Typically, this partition is named /boot.  Then you
      either use the initial ramdisk support (initrd), or patches
      from Harald Hoyer <[email protected]> that allow a stripped
      partition to be used as the root device.  (These patches are
      now a standard part of recent 2.1.x kernels)


      There are several approaches that can be used.  One approach
      is documented in detail in the Bootable RAID mini-HOWTO:
      <ftp://ftp.bizsystems.com/pub/raid/bootable-raid>.


      Alternately, use mkinitrd to build the ramdisk image, see
      below.


      Edward Welbon <[email protected]> writes:

      +o  ... all that is needed is a script to manage the boot
         setup.  To mount an md filesystem as root, the main thing
         is to build an initial file system image that has the
         needed modules and md tools to start md.  I have a simple
         script that does this.


      +o  For boot media, I have a small cchheeaapp SCSI disk (170MB I
         got it used for $20).  This disk runs on a AHA1452, but
         it could just as well be an inexpensive IDE disk on the
         native IDE.  The disk need not be very fast since it is
         mainly for boot.


      +o  This disk has a small file system which contains the
         kernel and the file system image for initrd.  The initial
         file system image has just enough stuff to allow me to
         load the raid SCSI device driver module and start the
         raid partition that will become root.  I then do an


      echo 0x900 > /proc/sys/kernel/real-root-dev



 (0x900 is for /dev/md0) and exit linuxrc.  The boot proceeds
 normally from there.


 +o  I have built most support as a module except for the
    AHA1452 driver that brings in the initrd filesystem.  So
    I have a fairly small kernel. The method is perfectly
    reliable, I have been doing this since before 2.1.26 and
    have never had a problem that I could not easily recover
    from.  The file systems even survived several 2.1.4[45]
    hard crashes with no real problems.


 +o  At one time I had partitioned the raid disks so that the
    initial cylinders of the first raid disk held the kernel
    and the initial cylinders of the second raid disk hold
    the initial file system image, instead I made the initial
    cylinders of the raid disks swap since they are the
    fastest cylinders (why waste them on boot?).


 +o  The nice thing about having an inexpensive device
    dedicated to boot is that it is easy to boot from and can
    also serve as a rescue disk if necessary. If you are
    interested, you can take a look at the script that builds
    my initial ram disk image and then runs LILO.

      <http://www.realtime.net/~welbon/initrd.md.tar.gz>


 It is current enough to show the picture.  It isn't espe-
 cially pretty and it could certainly build a much smaller
 filesystem image for the initial ram disk.  It would be easy
 to a make it more efficient.  But it uses LILO as is.  If
 you make any improvements, please forward a copy to me. 8-)



 5. QQ: I have heard that I can run mirroring over striping. Is this
    true?  Can I run mirroring over the loopback device?

      AA: Yes, but not the reverse.  That is, you can put a stripe
      over several disks, and then build a mirror on top of this.
      However, striping cannot be put on top of mirroring.


      A brief technical explanation is that the linear and stripe
      personalities use the ll_rw_blk routine for access.  The
      ll_rw_blk routine maps disk devices and  sectors, not
      blocks.  Block devices can be layered one on top of the
      other; but devices that do raw, low-level disk accesses,
      such as ll_rw_blk, cannot.


      Currently (November 1997) RAID cannot be run over the
      loopback devices, although this should be fixed shortly.



 6. QQ: I have two small disks and three larger disks.  Can I
    concatenate the two smaller disks with RAID-0, and then create a
    RAID-5 out of that and the larger disks?

      AA: Currently (November 1997), for a RAID-5 array, no.  Cur-
      rently, one can do this only for a RAID-1 on top of the con-
      catenated drives.
 7. QQ: What is the difference between RAID-1 and RAID-5 for a two-disk
    configuration (i.e. the difference between a RAID-1 array  built
    out of two disks, and a RAID-5 array built out of two disks)?


      AA: There is no difference in storage capacity.  Nor can
      disks be added to either array to increase capacity (see the
      question below for details).


      RAID-1 offers a performance advantage for reads: the RAID-1
      driver uses distributed-read technology to simultaneously
      read two sectors, one from each drive, thus doubling read
      performance.


      The RAID-5 driver, although it contains many optimizations,
      does not currently (September 1997) realize that the parity
      disk is actually a mirrored copy of the data disk.  Thus, it
      serializes data reads.




 8. QQ: How can I guard against a two-disk failure?


      AA: Some of the RAID algorithms do guard against multiple
      disk failures, but these are not currently implemented for
      Linux.  However, the Linux Software RAID can guard against
      multiple disk failures by layering an array on top of an
      array.  For example, nine disks can be used to create three
      raid-5 arrays.  Then these three arrays can in turn be
      hooked together into a single RAID-5 array on top.  In fact,
      this kind of a configuration will guard against a three-disk
      failure.  Note that a large amount of disk space is
      ''wasted'' on the redundancy information.



          For an NxN raid-5 array,
          N=3, 5 out of 9 disks are used for parity (=55%)
          N=4, 7 out of 16 disks
          N=5, 9 out of 25 disks
          ...
          N=9, 17 out of 81 disks (=~20%)





 In general, an MxN array will use M+N-1 disks for parity.
 The least amount of space is "wasted" when M=N.


 Another alternative is to create a RAID-1 array with three
 disks.  Note that since all three disks contain identical
 data, that 2/3's of the space is ''wasted''.




 9. QQ: I'd like to understand  how it'd be possible to have something
    like fsck: if the partition hasn't been cleanly unmounted, fsck
    runs and fixes the filesystem by itself more than 90% of the time.
    Since the machine is capable of fixing it by itself with ckraid
    --fix, why not make it automatic?



      AA: This can be done by adding lines like the following to
      /etc/rc.d/rc.sysinit:

          mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
              ckraid --fix /etc/raid.usr.conf
              mdadd /dev/md0 /dev/hda1 /dev/hdc1
          }



      or

          mdrun -p1 /dev/md0
          if [ $? -gt 0 ] ; then
                  ckraid --fix /etc/raid1.conf
                  mdrun -p1 /dev/md0
          fi



      Before presenting a more complete and reliable script, lets
      review the theory of operation.

      Gadi Oxman writes: In an unclean shutdown, Linux might be in
      one of the following states:

      +o  The in-memory disk cache was in sync with the RAID set
         when the unclean shutdown occurred; no data was lost.

      +o  The in-memory disk cache was newer than the RAID set
         contents when the crash occurred; this results in a
         corrupted filesystem and potentially in data loss.

         This state can be further divided to the following two
         states:


      +o  Linux was writing data when the unclean shutdown
         occurred.

      +o  Linux was not writing data when the crash occurred.


         Suppose we were using a RAID-1 array. In (2a), it might
         happen that before the crash, a small number of data
         blocks were successfully written only to some of the
         mirrors, so that on the next reboot, the mirrors will no
         longer contain the same data.

         If we were to ignore the mirror differences, the
         raidtools-0.36.3 read-balancing code might choose to read
         the above data blocks from any of the mirrors, which will
         result in inconsistent behavior (for example, the output
         of e2fsck -n /dev/md0 can differ from run to run).


         Since RAID doesn't protect against unclean shutdowns,
         usually there isn't any ''obviously correct'' way to fix
         the mirror differences and the filesystem corruption.

         For example, by default ckraid --fix will choose the
         first operational mirror and update the other mirrors
    with its contents.  However, depending on the exact
    timing at the crash, the data on another mirror might be
    more recent, and we might want to use it as the source
    mirror instead, or perhaps use another method for
    recovery.

    The following script provides one of the more robust
    boot-up sequences.  In particular, it guards against
    long, repeated ckraid's in the presence of uncooperative
    disks, controllers, or controller device drivers.  Modify
    it to reflect your config, and copy it to rc.raid.init.
    Then invoke rc.raid.init after the root partition has
    been fsck'ed and mounted rw, but before the remaining
    partitions are fsck'ed.  Make sure the current directory
    is in the search path.

        mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
            rm -f /fastboot             # force an fsck to occur
            ckraid --fix /etc/raid.usr.conf
            mdadd /dev/md0 /dev/hda1 /dev/hdc1
        }
        # if a crash occurs later in the boot process,
        # we at least want to leave this md in a clean state.
        /sbin/mdstop /dev/md0

        mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {
            rm -f /fastboot             # force an fsck to occur
            ckraid --fix /etc/raid.home.conf
            mdadd /dev/md1 /dev/hda2 /dev/hdc2
        }
        # if a crash occurs later in the boot process,
        # we at least want to leave this md in a clean state.
        /sbin/mdstop /dev/md1

        mdadd /dev/md0 /dev/hda1 /dev/hdc1
        mdrun -p1 /dev/md0
        if [ $? -gt 0 ] ; then
            rm -f /fastboot             # force an fsck to occur
            ckraid --fix /etc/raid.usr.conf
            mdrun -p1 /dev/md0
        fi
        # if a crash occurs later in the boot process,
        # we at least want to leave this md in a clean state.
        /sbin/mdstop /dev/md0

        mdadd /dev/md1 /dev/hda2 /dev/hdc2
        mdrun -p1 /dev/md1
        if [ $? -gt 0 ] ; then
            rm -f /fastboot             # force an fsck to occur
            ckraid --fix /etc/raid.home.conf
            mdrun -p1 /dev/md1
        fi
        # if a crash occurs later in the boot process,
        # we at least want to leave this md in a clean state.
        /sbin/mdstop /dev/md1

        # OK, just blast through the md commands now.  If there were
        # errors, the above checks should have fixed things up.
        /sbin/mdadd /dev/md0 /dev/hda1 /dev/hdc1
        /sbin/mdrun -p1 /dev/md0

        /sbin/mdadd /dev/md12 /dev/hda2 /dev/hdc2
        /sbin/mdrun -p1 /dev/md1



 In addition to the above, you'll want to create a
 rc.raid.halt which should look like the following:

     /sbin/mdstop /dev/md0
     /sbin/mdstop /dev/md1



 Be sure to modify both rc.sysinit and init.d/halt to include
 this everywhere that filesystems get unmounted before a
 halt/reboot.  (Note that rc.sysinit unmounts and reboots if
 fsck returned with an error.)




 10.
    QQ: Can I set up one-half of a RAID-1 mirror with the one disk I
    have now, and then later get the other disk and just drop it in?


      AA: With the current tools, no, not in any easy way.  In par-
      ticular, you cannot just copy the contents of one disk onto
      another, and then pair them up.  This is because the RAID
      drivers use glob of space at the end of the partition to
      store the superblock.  This decreases the amount of space
      available to the file system slightly; if you just naively
      try to force a RAID-1 arrangement onto a partition with an
      existing filesystem, the raid superblock will overwrite a
      portion of the file system and mangle data.  Since the
      ext2fs filesystem scatters files randomly throughput the
      partition (in order to avoid fragmentation), there is a very
      good chance that some file will land at the very end of a
      partition long before the disk is full.


      If you are clever, I suppose you can calculate how much room
      the RAID superblock will need, and make your filesystem
      slightly smaller, leaving room for it when you add it later.
      But then, if you are this clever, you should also be able to
      modify the tools to do this automatically for you.  (The
      tools are not terribly complex).


      NNoottee::A careful reader has pointed out that the following
      trick may work; I have not tried or verified this: Do the
      mkraid with /dev/null as one of the devices.  Then mdadd -r
      with only the single, true disk (do not mdadd /dev/null).
      The mkraid should have successfully built the raid array,
      while the mdadd step just forces the system to run in
      "degraded" mode, as if one of the disks had failed.



 44..  EErrrroorr RReeccoovveerryy


 1. QQ: I have a RAID-1 (mirroring) setup, and lost power while there
    was disk activity.  Now what do I do?


      AA: The redundancy of RAID levels is designed to protect
      against a ddiisskk failure, not against a ppoowweerr failure.

      There are several ways to recover from this situation.

 +o  Method (1): Use the raid tools.  These can be used to
    sync the raid arrays.  They do not fix file-system
    damage; after the raid arrays are sync'ed, then the file-
    system still has to be fixed with fsck.  Raid arrays can
    be checked with ckraid /etc/raid1.conf (for RAID-1, else,
    /etc/raid5.conf, etc.)

    Calling ckraid /etc/raid1.conf --fix will pick one of the
    disks in the array (usually the first), and use that as
    the master copy, and copy its blocks to the others in the
    mirror.

    To designate which of the disks should be used as the
    master, you can use the --force-source flag: for example,
    ckraid /etc/raid1.conf --fix --force-source /dev/hdc3

    The ckraid command can be safely run without the --fix
    option to verify the inactive RAID array without making
    any changes.  When you are comfortable with the proposed
    changes, supply the --fix  option.


 +o  Method (2): Paranoid, time-consuming, not much better
    than the first way.  Lets assume a two-disk RAID-1 array,
    consisting of partitions /dev/hda3 and /dev/hdc3.  You
    can try the following:

    a. fsck /dev/hda3

    b. fsck /dev/hdc3

    c. decide which of the two partitions had fewer errors,
       or were more easily recovered, or recovered the data
       that you wanted.  Pick one, either one, to be your new
       ``master'' copy.  Say you picked /dev/hdc3.

    d. dd if=/dev/hdc3 of=/dev/hda3

    e. mkraid raid1.conf -f --only-superblock


    Instead of the last two steps, you can instead run ckraid
    /etc/raid1.conf --fix --force-source /dev/hdc3 which
    should be a bit faster.

 +o  Method (3): Lazy man's version of above.  If you don't
    want to wait for long fsck's to complete, it is perfectly
    fine to skip the first three steps above, and move
    directly to the last two steps.  Just be sure to run fsck
    /dev/md0 after you are done.  Method (3) is actually just
    method (1) in disguise.


    In any case, the above steps will only sync up the raid
    arrays.  The file system probably needs fixing as well:
    for this, fsck needs to be run on the active, unmounted
    md device.


    With a three-disk RAID-1 array, there are more
    possibilities, such as using two disks to ''vote'' a
    majority answer.  Tools to automate this do not currently
    (September 97) exist.



 2. QQ: I have a RAID-4 or a RAID-5 (parity) setup, and lost power while
    there was disk activity.  Now what do I do?


      AA: The redundancy of RAID levels is designed to protect
      against a ddiisskk failure, not against a ppoowweerr failure.

      Since the disks in a RAID-4 or RAID-5 array do not contain a
      file system that fsck can read, there are fewer repair
      options.  You cannot use fsck to do preliminary checking
      and/or repair; you must use ckraid first.


      The ckraid command can be safely run without the --fix
      option to verify the inactive RAID array without making any
      changes.  When you are comfortable with the proposed
      changes, supply the --fix option.


      If you wish, you can try designating one of the disks as a
      ''failed disk''.  Do this with the --suggest-failed-disk-
      mask flag.

      Only one bit should be set in the flag: RAID-5 cannot
      recover two failed disks.  The mask is a binary bit mask:
      thus:

          0x1 == first disk
          0x2 == second disk
          0x4 == third disk
          0x8 == fourth disk, etc.




      Alternately, you can choose to modify the parity sectors, by
      using the --suggest-fix-parity flag.  This will recompute
      the parity from the other sectors.


      The flags --suggest-failed-dsk-mask and --suggest-fix-parity
      can be safely used for verification. No changes are made if
      the --fix flag is not specified.  Thus, you can experiment
      with different possible repair schemes.




 3. QQ: My RAID-1 device, /dev/md0 consists of two hard drive
    partitions: /dev/hda3 and /dev/hdc3.  Recently, the disk with
    /dev/hdc3 failed, and was replaced with a new disk.  My best
    friend, who doesn't understand RAID, said that the correct thing to
    do now is to ''dd if=/dev/hda3 of=/dev/hdc3''.  I tried this, but
    things still don't work.


      AA: You should keep your best friend away from you computer.
      Fortunately, no serious damage has been done.  You can
      recover from this by running:


      mkraid raid1.conf -f --only-superblock




 By using dd, two identical copies of the partition were cre-
 ated. This is almost correct, except that the RAID-1 kernel
 extension expects the RAID superblocks to be different.
 Thus, when you try to reactivate RAID, the software will
 notice the problem, and deactivate one of the two parti-
 tions.  By re-creating the superblock, you should have a
 fully usable system.



 4. QQ: My version of mkraid doesn't have a --only-superblock flag.
    What do I do?

      AA: The newer tools drop support for this flag, replacing it
      with the --force-resync flag.  It has been reported that the
      following sequence appears to work with the latest tools and
      software:


        umount /web (where /dev/md0 was mounted on)
        raidstop /dev/md0
        mkraid /dev/md0 --force-resync --really-force
        raidstart /dev/md0





 After doing this, a cat /proc/mdstat should report resync in
 progress, and one should be able to mount /dev/md0 at this
 point.



 5. QQ: My RAID-1 device, /dev/md0 consists of two hard drive
    partitions: /dev/hda3 and /dev/hdc3.  My best (girl?)friend, who
    doesn't understand RAID, ran fsck on /dev/hda3 while I wasn't
    looking, and now the RAID won't work. What should I do?


      AA: You should re-examine your concept of ``best friend''.
      In general, fsck should never be run on the individual par-
      titions that compose a RAID array.  Assuming that neither of
      the partitions are/were heavily damaged, no data loss has
      occurred, and the RAID-1 device can be recovered as follows:

         a. make a backup of the file system on /dev/hda3

         b. dd if=/dev/hda3 of=/dev/hdc3

         c. mkraid raid1.conf -f --only-superblock

      This should leave you with a working disk mirror.



 6. QQ: Why does the above work as a recovery procedure?

      AA: Because each of the component partitions in a RAID-1 mir-
      ror is a perfectly valid copy of the file system.  In a
      pinch, mirroring can be disabled, and one of the partitions
      can be mounted and safely run as an ordinary, non-RAID file
      system.  When you are ready to restart using RAID-1, then
      unmount the partition, and follow the above instructions to
      restore the mirror.   Note that the above works ONLY for
      RAID-1, and not for any of the other levels.
 It may make you feel more comfortable to reverse the
 direction of the copy above: copy ffrroomm the disk that was
 untouched ttoo the one that was.  Just be sure to fsck the
 final md.



 7. QQ: I am confused by the above questions, but am not yet bailing
    out.  Is it safe to run fsck /dev/md0 ?


      AA: Yes, it is safe to run fsck on the md devices.  In fact,
      this is the oonnllyy safe place to run fsck.



 8. QQ: If a disk is slowly failing, will it be obvious which one it is?
    I am concerned that it won't be, and this confusion could lead to
    some dangerous decisions by a sysadmin.


      AA: Once a disk fails, an error code will be returned from
      the low level driver to the RAID driver.  The RAID driver
      will mark it as ``bad'' in the RAID superblocks of the
      ``good'' disks (so we will later know which mirrors are good
      and which aren't), and continue RAID operation on the
      remaining operational mirrors.


      This, of course, assumes that the disk and the low level
      driver can detect a read/write error, and will not silently
      corrupt data, for example. This is true of current drives
      (error detection schemes are being used internally), and is
      the basis of RAID operation.



 9. QQ: What about hot-repair?


      AA: Work is underway to complete ``hot reconstruction''.
      With this feature, one can add several ``spare'' disks to
      the RAID set (be it level 1 or 4/5), and once a disk fails,
      it will be reconstructed on one of the spare disks in run
      time, without ever needing to shut down the array.


      However, to use this feature, the spare disk must have been
      declared at boot time, or it must be hot-added, which
      requires the use of special cabinets and connectors that
      allow a disk to be added while the electrical power is on.


      As of October 97, there is a beta version of MD that allows:

      +o  RAID 1 and 5 reconstruction on spare drives

      +o  RAID-5 parity reconstruction after an unclean shutdown

      +o  spare disk to be hot-added to an already running RAID 1
         or 4/5 array

         By default, automatic reconstruction is (Dec 97)
         currently disabled by default, due to the preliminary
         nature of this work.  It can be enabled by changing the
         value of SUPPORT_RECONSTRUCTION in include/linux/md.h.
    If spare drives were configured into the array when it
    was created and kernel-based reconstruction is enabled,
    the spare drive will already contain a RAID superblock
    (written by mkraid), and the kernel will reconstruct its
    contents automatically (without needing the usual mdstop,
    replace drive, ckraid, mdrun steps).


    If you are not running automatic reconstruction, and have
    not configured a hot-spare disk, the procedure described
    by Gadi Oxman <[email protected]> is recommended:

 +o  Currently, once the first disk is removed, the RAID set
    will be running in degraded mode. To restore full
    operation mode, you need to:

 +o  stop the array (mdstop /dev/md0)

 +o  replace the failed drive

 +o  run ckraid raid.conf to reconstruct its contents

 +o  run the array again (mdadd, mdrun).

    At this point, the array will be running with all the
    drives, and again protects against a failure of a single
    drive.

    Currently, it is not possible to assign single hot-spare
    disk to several arrays.   Each array requires it's own
    hot-spare.



 10.
    QQ: I would like to have an audible alarm for ``you schmuck, one
    disk in the mirror is down'', so that the novice sysadmin knows
    that there is a problem.


      AA: The kernel is logging the event with a ``KERN_ALERT''
      priority in syslog.  There are several software packages
      that will monitor the syslog files, and beep the PC speaker,
      call a pager, send e-mail, etc. automatically.



 11.
    QQ: How do I run RAID-5 in degraded mode (with one disk failed, and
    not yet replaced)?


      AA: Gadi Oxman <[email protected]> writes: Normally, to
      run a RAID-5 set of n drives you have to:


      mdadd /dev/md0 /dev/disk1 ... /dev/disk(n)
      mdrun -p5 /dev/md0





 Even if one of the disks has failed, you still have to mdadd
 it as you would in a normal setup.  (?? try using /dev/null
 in place of the failed disk ???  watch out) Then,
 The array will be active in degraded mode with (n - 1)
 drives.  If ``mdrun'' fails, the kernel has noticed an error
 (for example, several faulty drives, or an unclean shut-
 down).  Use ``dmesg'' to display the kernel error messages
 from ``mdrun''.  If the raid-5 set is corrupted due to a
 power loss, rather than a disk crash, one can try to recover
 by creating a new RAID superblock:


      mkraid -f --only-superblock raid5.conf





 A RAID array doesn't provide protection against a power
 failure or a kernel crash, and can't guarantee correct
 recovery.  Rebuilding the superblock will simply cause the
 system to ignore the condition by marking all the drives as
 ``OK'', as if nothing happened.



 12.
    QQ: How does RAID-5 work when a disk fails?


      AA: The typical operating scenario is as follows:

      +o  A RAID-5 array is active.

      +o  One drive fails while the array is active.

      +o  The drive firmware and the low-level Linux
         disk/controller drivers detect the failure and report an
         error code to the MD driver.

      +o  The MD driver continues to provide an error-free /dev/md0
         device to the higher levels of the kernel (with a
         performance degradation) by using the remaining
         operational drives.

      +o  The sysadmin can umount /dev/md0 and mdstop /dev/md0 as
         usual.

      +o  If the failed drive is not replaced, the sysadmin can
         still start the array in degraded mode as usual, by
         running mdadd and mdrun.



 13.
    QQ:

      AA:



 14.
    QQ: Why is there no question 13?


      AA: If you are concerned about RAID, High Availability, and
      UPS, then its probably a good idea to be superstitious as
      well.  It can't hurt, can it?

 15.
    QQ: I just replaced a failed disk in a RAID-5 array.  After
    rebuilding the array, fsck is reporting many, many errors.  Is this
    normal?


      AA: No. And, unless you ran fsck in "verify only; do not
      update" mode, its quite possible that you have corrupted
      your data.  Unfortunately, a not-uncommon scenario is one of
      accidentally changing the disk order in a RAID-5 array,
      after replacing a hard drive.  Although the RAID superblock
      stores the proper order, not all tools use this information.
      In particular, the current version of ckraid will use the
      information specified with the -f flag (typically, the file
      /etc/raid5.conf) instead of the data in the superblock.  If
      the specified order is incorrect, then the replaced disk
      will be reconstructed incorrectly.   The symptom of this
      kind of mistake seems to be heavy & numerous fsck errors.


      And, in case you are wondering, yyeess, someone lost aallll of
      their data by making this mistake.   Making a tape backup of
      aallll data before reconfiguring a RAID array is ssttrroonnggllyy
      rreeccoommmmeennddeedd.



 16.
    QQ: The QuickStart says that mdstop is just to make sure that the
    disks are sync'ed. Is this REALLY necessary? Isn't unmounting the
    file systems enough?


      AA: The command mdstop /dev/md0 will:

      +o  mark it ''clean''. This allows us to detect unclean
         shutdowns, for example due to a power failure or a kernel
         crash.

      +o  sync the array. This is less important after unmounting a
         filesystem, but is important if the /dev/md0 is accessed
         directly rather than through a filesystem (for example,
         by e2fsck).





 55..  TTrroouubblleesshhoooottiinngg IInnssttaallll PPrroobblleemmss


 1. QQ: What is the current best known-stable patch for RAID in the
    2.0.x series kernels?


      AA: As of 18 Sept 1997, it is "2.0.30 + pre-9 2.0.31 + Werner
      Fink's swapping patch + the alpha RAID patch".  As of Novem-
      ber 1997, it is 2.0.31 + ... !?



 2. QQ: The RAID patches will not install cleanly for me.  What's wrong?

      AA: Make sure that /usr/include/linux is a symbolic link to
      /usr/src/linux/include/linux.

 Make sure that the new files raid5.c, etc.  have been copied
 to their correct locations.  Sometimes the patch command
 will not create new files.  Try the -f flag on patch.



 3. QQ: While compiling raidtools 0.42, compilation stops trying to
    include <pthread.h> but it doesn't exist in my system.  How do I
    fix this?


      AA: raidtools-0.42 requires linuxthreads-0.6 from:
      <ftp://ftp.inria.fr/INRIA/Projects/cristal/Xavier.Leroy>
      Alternately, use glibc v2.0.



 4. QQ: I get the message: mdrun -a /dev/md0: Invalid argument


      AA: Use mkraid to initialize the RAID set prior to the first
      use.  mkraid ensures that the RAID array is initially in a
      consistent state by erasing the RAID partitions. In addi-
      tion, mkraid will create the RAID superblocks.



 5. QQ: I get the message: mdrun -a /dev/md0: Invalid argument The setup
    was:

 +o  raid build as a kernel module

 +o  normal install procedure followed ... mdcreate, mdadd, etc.

 +o  cat /proc/mdstat shows

        Personalities :
        read_ahead not set
        md0 : inactive sda1 sdb1 6313482 blocks
        md1 : inactive
        md2 : inactive
        md3 : inactive




 +o  mdrun -a generates the error message /dev/md0: Invalid argument



      AA: Try lsmod (or, alternately, cat /proc/modules) to see if
      the raid modules are loaded.  If they are not, you can load
      them explicitly with the modprobe raid1 or modprobe raid5
      command.  Alternately,  if you are using the autoloader, and
      expected kerneld to load them and it didn't this is probably
      because your loader is missing the info to load the modules.
      Edit /etc/conf.modules and add the following lines:


          alias md-personality-3 raid1
          alias md-personality-4 raid5





 6. QQ: While doing mdadd -a I get the error: /dev/md0: No such file or
    directory.  Indeed, there seems to be no /dev/md0 anywhere.  Now
    what do I do?


      AA: The raid-tools package will create these devices when you
      run make install as root.  Alternately, you can do the fol-
      lowing:

          cd /dev
          ./MAKEDEV md







 7. QQ: After creating a raid array on /dev/md0, I try to mount it and
    get the following error:
     mount: wrong fs type, bad option, bad superblock on /dev/md0, or
    too many mounted file systems. What's wrong?

      AA: You need to create a file system on /dev/md0 before you
      can mount it.  Use mke2fs.




 8. QQ: Truxton Fulton wrote:

      On my Linux 2.0.30 system, while doing a mkraid for a RAID-1
      device, during the clearing of the two individual parti-
      tions, I got "Cannot allocate free page" errors appearing on
      the console, and "Unable to handle kernel paging request at
      virtual address ..." errors in the system log.  At this
      time, the system became quite unusable, but it appears to
      recover after a while.  The operation appears to have com-
      pleted with no other errors, and I am successfully using my
      RAID-1 device.  The errors are disconcerting though.  Any
      ideas?




      AA: This was a well-known bug in the 2.0.30 kernels.  It is
      fixed in the 2.0.31 kernel; alternately, fall back to
      2.0.29.



 9. QQ: I'm not able to mdrun a RAID-1, RAID-4 or RAID-5 device.  If I
    try to mdrun a mdadd'ed device I get the message ''invalid raid
    superblock magic''.


      AA: Make sure that you've run the mkraid part of the install
      procedure.



 10.
    QQ: When I access /dev/md0, the kernel spits out a lot of errors
    like md0: device not running, giving up !  and I/O error.... I've
    successfully added my devices to the virtual device.

      AA: To be usable, the device must be running. Use mdrun -px
      /dev/md0 where x is l for linear, 0 for RAID-0 or 1 for
      RAID-1, etc.



 11.
    QQ: I've created a linear md-dev with 2 devices.  cat /proc/mdstat
    shows the total size of the device, but df only shows the size of
    the first physical device.


      AA: You must mkfs your new md-dev before using it the first
      time, so that the filesystem will cover the whole device.



 12.
    QQ: I've set up /etc/mdtab using mdcreate, I've mdadd'ed, mdrun and
    fsck'ed my two /dev/mdX partitions.  Everything looks okay before a
    reboot.  As soon as I reboot, I get an fsck error on both
    partitions: fsck.ext2: Attempt to read block from filesystem
    resulted in short read while trying too open /dev/md0.  Why?! How
    do I fix it?!


      AA: During the boot process, the RAID partitions must be
      started before they can be fsck'ed.  This must be done in
      one of the boot scripts.  For some distributions, fsck is
      called from /etc/rc.d/rc.S, for others, it is called from
      /etc/rc.d/rc.sysinit. Change this file to mdadd -ar *before*
      fsck -A is executed.  Better yet, it is suggested that
      ckraid be run if mdadd returns with an error.  How do do
      this is discussed in greater detail in question 14 of the
      section ''Error Recovery''.



 13.
    QQ: I get the message invalid raid superblock magic while trying to
    run an array which consists of partitions which are bigger than
    4GB.


      AA: This bug is now fixed. (September 97)  Make sure you have
      the latest raid code.



 14.
    QQ: I get the message Warning: could not write 8 blocks in inode
    table starting at 2097175 while trying to run mke2fs on a partition
    which is larger than 2GB.


      AA: This seems to be a problem with mke2fs (November 97).  A
      temporary work-around is to get the mke2fs code, and add
      #undef HAVE_LLSEEK to e2fsprogs-1.10/lib/ext2fs/llseek.c
      just before the first #ifdef HAVE_LLSEEK and recompile
      mke2fs.



 15.
    QQ: ckraid currently isn't able to read /etc/mdtab

      AA: The RAID0/linear configuration file format used in
      /etc/mdtab is obsolete, although it will be supported for a
      while more.  The current, up-to-date config files are cur-
      rently named /etc/raid1.conf, etc.



 16.
    QQ: The personality modules (raid1.o) are not loaded automatically;
    they have to be manually modprobe'd before mdrun. How can this be
    fixed?


      AA: To autoload the modules, we can add the following to
      /etc/conf.modules:

          alias md-personality-3 raid1
          alias md-personality-4 raid5






 17.
    QQ: I've mdadd'ed 13 devices, and now I'm trying to mdrun -p5
    /dev/md0 and get the message: /dev/md0: Invalid argument


      AA: The default configuration for software RAID is 8 real
      devices. Edit linux/md.h, change #define MAX_REAL=8 to a
      larger number, and rebuild the kernel.



 18.
    QQ: I can't make md work with partitions on our latest SPARCstation
    5.  I suspect that this has something to do with disk-labels.


      AA: Sun disk-labels sit in the first 1K of a partition.  For
      RAID-1, the Sun disk-label is not an issue since ext2fs will
      skip the label on every mirror.  For other raid levels (0,
      linear and 4/5), this appears to be a problem; it has not
      yet (Dec 97) been addressed.



 66..  SSuuppppoorrtteedd HHaarrddwwaarree && SSooffttwwaarree


 1. QQ: I have SCSI adapter brand XYZ (with or without several
    channels), and disk brand(s) PQR and LMN, will these work with md
    to create a linear/stripped/mirrored personality?


      AA: Yes!  Software RAID will work with any disk controller
      (IDE or SCSI) and any disks.  The disks do not have to be
      identical, nor do the controllers.  For example, a RAID mir-
      ror can be created with one half the mirror being a SCSI
      disk, and the other an IDE disk.  The disks do not even have
      to be the same size.  There are no restrictions on the mix-
      ing & matching of disks and controllers.


      This is because Software RAID works with disk partitions,
 not with the raw disks themselves.  The only recommendation
 is that for RAID levels 1 and 5, the disk partitions that
 are used as part of the same set be the same size. If the
 partitions used to make up the RAID 1 or 5 array are not the
 same size, then the excess space in the larger partitions is
 wasted (not used).



 2. QQ: I have a twin channel BT-952, and the box states that it
    supports hardware RAID 0, 1 and 0+1.   I have made a RAID set with
    two drives, the card apparently recognizes them when it's doing
    it's BIOS startup routine. I've been reading in the driver source
    code, but found no reference to the hardware RAID support.  Anybody
    out there working on that?


      AA: The Mylex/BusLogic FlashPoint boards with RAIDPlus are
      actually software RAID, not hardware RAID at all.  RAIDPlus
      is only supported on Windows 95 and Windows NT, not on Net-
      ware or any of the Unix platforms.  Aside from booting and
      configuration, the RAID support is actually in the OS
      drivers.


      While in theory Linux support for RAIDPlus is possible, the
      implementation of RAID-0/1/4/5 in the Linux kernel is much
      more flexible and should have superior performance, so
      there's little reason to support RAIDPlus directly.



 3. QQ: I want to run RAID with an SMP box.  Is  RAID SMP-safe?

      AA: "I think so" is the best answer available at the time I
      write this (April 98).  A number of users report that they
      have been using RAID with SMP for nearly a year, without
      problems.  However, as of April 98 (circa kernel 2.1.9x),
      the following problems have been noted on the mailing list:

      +o  Adaptec AIC7xxx SCSI drivers are not SMP safe (General
         note: Adaptec adapters have a long & lengthly history of
         problems & flakiness in general.  Although they seem to
         be the most easily available, widespread and cheapest
         SCSI adapters, they should be avoided.  After factoring
         for time lost, frustration, and corrupted data, Adaptec's
         will prove to be the costliest mistake you'll ever make.
         That said, if you have SMP problems with 2.1.88, try the
         patch ftp://ftp.bero-
         online.ml.org/pub/linux/aic7xxx-5.0.7-linux21.tar.gz I am
         not sure if this patch has been pulled into later 2.1.x
         kernels.  For further info, take a look at the mail
         archives for March 98 at
         http://www.linuxhq.com/lnxlists/linux-raid/lr_9803_01/ As
         usual, due to the rapidly changing nature of the latest
         experimental 2.1.x kernels, the problems described in
         these mailing lists may or may not have been fixed by the
         time your read this. Caveat Emptor.  )



      +o  IO-APIC with RAID-0 on SMP has been reported to crash in
         2.1.90



 77..  MMooddiiffyyiinngg aann EExxiissttiinngg IInnssttaallllaattiioonn


 1. QQ: Are linear MD's expandable?  Can a new hard-drive/partition be
    added, and the size of the existing file system expanded?


      AA: Miguel de Icaza <[email protected]> writes:

      I changed the ext2fs code to be aware of multiple-devices
      instead of the regular one device per file system assump-
      tion.


      So, when you want to extend a file system, you run a utility
      program that makes the appropriate changes on the new device
      (your extra partition) and then you just tell the system to
      extend the fs using the specified device.


      You can extend a file system with new devices at system
      operation time, no need to bring the system down (and
      whenever I get some extra time, you will be able to remove
      devices from the ext2 volume set, again without even having
      to go to single-user mode or any hack like that).


      You can get the patch for 2.1.x kernel from my web page:

      <http://www.nuclecu.unam.mx/~miguel/ext2-volume>





 2. QQ: Can I add disks to a RAID-5 array?


      AA: Currently, (September 1997) no, not without erasing all
      data. A conversion utility to allow this does not yet exist.
      The problem is that the actual structure and layout of a
      RAID-5 array depends on the number of disks in the array.

      Of course, one can add drives by backing up the array to
      tape, deleting all data, creating a new array, and restoring
      from tape.



 3. QQ: What would happen to my RAID1/RAID0 sets if I shift one of the
    drives from being /dev/hdb to /dev/hdc?

    Because of cabling/case size/stupidity issues, I had to make my
    RAID sets on the same IDE controller (/dev/hda and /dev/hdb). Now
    that I've fixed some stuff, I want to move /dev/hdb to /dev/hdc.

    What would happen if I just change the /etc/mdtab and
    /etc/raid1.conf files to reflect the new location?

      AA: For RAID-0/linear, one must be careful to specify the
      drives in exactly the same order. Thus, in the above exam-
      ple, if the original config is




 mdadd /dev/md0 /dev/hda /dev/hdb





 Then the new config *must* be


      mdadd /dev/md0 /dev/hda /dev/hdc






 For RAID-1/4/5, the drive's ''RAID number'' is stored in its
 RAID superblock, and therefore the order in which the disks
 are specified is not important.

 RAID-0/linear does not have a superblock due to it's older
 design, and the desire to maintain backwards compatibility
 with this older design.



 4. QQ: Can I convert a two-disk RAID-1 mirror to a three-disk RAID-5
    array?


      AA: Yes.  Michael at BizSystems has come up with a clever,
      sneaky way of doing this.  However, like virtually all
      manipulations of RAID arrays once they have data on them, it
      is dangerous and prone to human error.  MMaakkee aa bbaacckkuupp bbeeffoorree
      yyoouu ssttaarrtt.































 I will make the following assumptions:
 ---------------------------------------------
 disks
 original: hda - hdc
 raid1 partitions hda3 - hdc3
 array name /dev/md0

 new hda - hdc - hdd
 raid5 partitions hda3 - hdc3 - hdd3
 array name: /dev/md1

 You must substitute the appropriate disk and partition numbers for
 you system configuration. This will hold true for all config file
 examples.
 --------------------------------------------
 DO A BACKUP BEFORE YOU DO ANYTHING
 1) recompile kernel to include both raid1 and raid5
 2) install new kernel and verify that raid personalities are present
 3) disable the redundant partition on the raid 1 array. If this is a
  root mounted partition (mine was) you must be more careful.

  Reboot the kernel without starting raid devices or boot from rescue
  system ( raid tools must be available )

  start non-redundant raid1
 mdadd -r -p1 /dev/md0 /dev/hda3

 4) configure raid5 but with 'funny' config file, note that there is
   no hda3 entry and hdc3 is repeated. This is needed since the
   raid tools don't want you to do this.
 -------------------------------
 # raid-5 configuration
 raiddev                 /dev/md1
 raid-level              5
 nr-raid-disks           3
 chunk-size              32

 # Parity placement algorithm
 parity-algorithm        left-symmetric

 # Spare disks for hot reconstruction
 nr-spare-disks          0

 device                  /dev/hdc3
 raid-disk               0

 device                  /dev/hdc3
 raid-disk               1

 device                  /dev/hdd3
 raid-disk               2
 ---------------------------------------
  mkraid /etc/raid5.conf
 5) activate the raid5 array in non-redundant mode

 mdadd -r -p5 -c32k /dev/md1 /dev/hdc3 /dev/hdd3

 6) make a file system on the array

 mke2fs -b {blocksize} /dev/md1

 recommended blocksize by some is 4096 rather than the default 1024.
 this improves the memory utilization for the kernel raid routines and
 matches the blocksize to the page size. I compromised and used 2048
 since I have a relatively high number of small files on my system.

 7) mount the two raid devices somewhere

 mount -t ext2 /dev/md0 mnt0
 mount -t ext2 /dev/md1 mnt1

 8) move the data

 cp -a mnt0 mnt1

 9) verify that the data sets are identical
 10) stop both arrays
 11) correct the information for the raid5.conf file
   change /dev/md1 to /dev/md0
   change the first disk to read /dev/hda3

 12) upgrade the new array to full redundant status
  (THIS DESTROYS REMAINING raid1 INFORMATION)

 ckraid --fix /etc/raid5.conf








 88..  PPeerrffoorrmmaannccee,, TToooollss && GGeenneerraall BBoonnee--hheeaaddeedd QQuueessttiioonnss


 1. QQ: I've created a RAID-0 device on /dev/sda2 and /dev/sda3. The
    device is a lot slower than a single partition. Isn't md a pile of
    junk?

      AA: To have a RAID-0 device running a full speed, you must
      have partitions from different disks.  Besides, putting the
      two halves of the mirror on the same disk fails to give you
      any protection whatsoever against disk failure.



 2. QQ: What's the use of having RAID-linear when RAID-0 will do the
    same thing, but provide higher performance?

      AA: It's not obvious that RAID-0 will always provide better
      performance; in fact, in some cases, it could make things
      worse.  The ext2fs file system scatters files all over a
      partition, and it attempts to keep all of the blocks of a
      file contiguous, basically in an attempt to prevent fragmen-
      tation.  Thus, ext2fs behaves "as if" there were a (vari-
      able-sized) stripe per file.  If there are several disks
      concatenated into a single RAID-linear, this will result
      files being statistically distributed on each of the disks.
      Thus, at least for ext2fs, RAID-linear will behave a lot
      like RAID-0 with large stripe sizes.  Conversely, RAID-0
      with small stripe sizes can cause excessive disk activity
      leading to severely degraded performance if several large
      files are accessed simultaneously.

      In many cases, RAID-0 can be an obvious win. For example,
      imagine a large database file.  Since ext2fs attempts to
      cluster together all of the blocks of a file, chances are
      good that it will end up on only one drive if RAID-linear is
      used, but will get chopped into lots of stripes if RAID-0 is
      used.  Now imagine a number of (kernel) threads all trying
      to random access to this database.  Under RAID-linear, all
 accesses would go to one disk, which would not be as
 efficient as the parallel accesses that RAID-0 entails.



 3. QQ: How does RAID-0 handle a situation where the different stripe
    partitions are different sizes?  Are the stripes uniformly
    distributed?


      AA: To understand this, lets look at an example with three
      partitions; one that is 50MB, one 90MB and one 125MB.

      Lets call D0 the 50MB disk, D1 the 90MB disk and D2 the
      125MB disk.  When you start the device, the driver calcu-
      lates 'strip zones'.  In this case, it finds 3 zones,
      defined like this:


                  Z0 : (D0/D1/D2) 3 x 50 = 150MB  total in this zone
                  Z1 : (D1/D2)  2 x 40 = 80MB total in this zone
                  Z2 : (D2) 125-50-40 = 35MB total in this zone.




      You can see that the total size of the zones is the size of
      the virtual device, but, depending on the zone, the striping
      is different.  Z2 is rather inefficient, since there's only
      one disk.

      Since ext2fs and most other Unix file systems distribute
      files all over the disk, you have a  35/265 = 13% chance
      that a fill will end up on Z2, and not get any of the bene-
      fits of striping.

      (DOS tries to fill a disk from beginning to end, and thus,
      the oldest files would end up on Z0.  However, this strategy
      leads to severe filesystem fragmentation, which is why no
      one besides DOS does it this way.)



 4. QQ: I have some Brand X hard disks and a Brand Y controller.  and am
    considering using md.  Does it significantly increase the
    throughput?  Is the performance really noticeable?


      AA: The answer depends on the configuration that you use.


         LLiinnuuxx MMDD RRAAIIDD--00 aanndd RRAAIIDD--lliinneeaarr ppeerrffoorrmmaannccee::
            If the system is heavily loaded with lots of I/O,
            statistically, some of it will go to one disk, and
            some to the others.  Thus, performance will improve
            over a single large disk.   The actual improvement
            depends a lot on the actual data, stripe sizes, and
            other factors.   In a system with low I/O usage, the
            performance is equal to that of a single disk.



         LLiinnuuxx MMDD RRAAIIDD--11 ((mmiirrrroorriinngg)) rreeaadd ppeerrffoorrmmaannccee::
            MD implements read balancing. That is, the  RAID-1
            code will alternate between each of the (two or more)
            disks in the mirror, making alternate reads to each.
       In a low-I/O situation, this won't change performance
       at all: you will have to wait for one disk to complete
       the read.  But, with two disks in a high-I/O
       environment, this could as much as double the read
       performance, since reads can be issued to each of the
       disks in parallel.  For N disks in the mirror, this
       could improve performance N-fold.


    LLiinnuuxx MMDD RRAAIIDD--11 ((mmiirrrroorriinngg)) wwrriittee ppeerrffoorrmmaannccee::
       Must wait for the write to occur to all of the disks
       in the mirror.  This is because a copy of the data
       must be written to each of the disks in the mirror.
       Thus, performance will be roughly equal to the write
       performance to a single disk.


    LLiinnuuxx MMDD RRAAIIDD--44//55 rreeaadd ppeerrffoorrmmaannccee::
       Statistically, a given block can be on any one of a
       number of disk drives, and thus RAID-4/5 read
       performance is a lot like that for RAID-0.  It will
       depend on the data, the stripe size, and the
       application.  It will not be as good as the read
       performance of a mirrored array.


    LLiinnuuxx MMDD RRAAIIDD--44//55 wwrriittee ppeerrffoorrmmaannccee::
       This will in general be considerably slower than that
       for a single disk.  This is because the parity must be
       written out to one drive as well as the data to
       another.  However, in order to compute the new parity,
       the old parity and the old data must be read first.
       The old data, new data and old parity must all be
       XOR'ed together to determine the new parity: this
       requires considerable CPU cycles in addition to the
       numerous disk accesses.



 5. QQ: What RAID configuration should I use for optimal performance?

      AA: Is the goal to maximize throughput, or to minimize
      latency?  There is no easy answer, as there are many factors
      that affect performance:


      +o  operating system  - will one process/thread, or many be
         performing disk access?

      +o  application       - is it accessing data in a sequential
         fashion, or random access?

      +o  file system       - clusters files or spreads them out
         (the ext2fs clusters together the blocks of a file, and
         spreads out files)

      +o  disk driver       - number of blocks to read ahead (this
         is a tunable parameter)

      +o  CEC hardware      - one drive controller, or many?

      +o  hd controller     - able to queue multiple requests or
         not?  Does it provide a cache?

      +o  hard drive        - buffer cache memory size -- is it big
         enough to handle the write sizes and rate you want?
 +o  physical platters - blocks per cylinder -- accessing
    blocks on different cylinders will lead to seeks.



 6. QQ: What is the optimal RAID-5 configuration for performance?

      AA: Since RAID-5 experiences an I/O load that is equally dis-
      tributed across several drives, the best performance will be
      obtained when the RAID set is balanced by using identical
      drives, identical controllers,  and the same (low) number of
      drives on each controller.

      Note, however, that using identical components will raise
      the probability of multiple simultaneous failures, for exam-
      ple due to a sudden jolt or drop, overheating, or a power
      surge during an electrical storm. Mixing brands and models
      helps reduce this risk.



 7. QQ: What is the optimal block size for a RAID-4/5 array?


      AA: When using the current (November 1997) RAID-4/5 implemen-
      tation, it is strongly recommended that the file system be
      created with mke2fs -b 4096 instead of the default 1024 byte
      filesystem block size.


      This is because the current RAID-5 implementation allocates
      one 4K memory page per disk block; if a disk block were just
      1K in size, then 75% of the memory which RAID-5 is
      allocating for pending I/O would not be used.  If the disk
      block size matches the memory page size, then the driver can
      (potentially) use all of the page.  Thus, for a filesystem
      with a 4096 block size as opposed to a 1024 byte block size,
      the RAID driver will potentially queue 4 times as much
      pending I/O to the low level drivers without allocating
      additional memory.


      NNoottee: the above remarks do NOT apply to Software
      RAID-0/1/linear driver.


      NNoottee:: the statements about 4K memory page size apply to the
      Intel x86 architecture.   The page size on Alpha, Sparc, and
      other CPUS are different; I believe they're 8K on
      Alpha/Sparc (????).  Adjust the above figures accordingly.


      NNoottee:: if your file system has a lot of small files (files
      less than 10KBytes in size), a considerable fraction of the
      disk space might be wasted.  This is because the file system
      allocates disk space in multiples of the block size.
      Allocating large blocks for small files clearly results in a
      waste of disk space: thus, you may want to stick to small
      block sizes, get a larger effective storage capacity, and
      not worry about the "wasted" memory due to the block-
      size/page-size mismatch.


      NNoottee:: most ''typical'' systems do not have that many small
      files.  That is, although there might be thousands of small
      files, this would lead to only some 10 to 100MB wasted
 space, which is probably an acceptable tradeoff for
 performance on a multi-gigabyte disk.

 However, for news servers, there might be tens or hundreds
 of thousands of small files.  In such cases, the smaller
 block size, and thus the improved storage capacity, may be
 more important than the more efficient I/O scheduling.


 NNoottee:: there exists an experimental file system for Linux
 which packs small files and file chunks onto a single block.
 It apparently has some very positive performance
 implications when the average file size is much smaller than
 the block size.


 Note: Future versions may implement schemes that obsolete
 the above discussion. However, this is difficult to
 implement, since dynamic run-time allocation can lead to
 dead-locks; the current implementation performs a static
 pre-allocation.



 8. QQ: How does the chunk size (stripe size) influence the speed of my
    RAID-0, RAID-4 or RAID-5 device?


      AA: The chunk size is the amount of data contiguous on the
      virtual device that is also contiguous on the physical
      device.  In this HOWTO, "chunk" and "stripe" refer to the
      same thing: what is commonly called the "stripe" in other
      RAID documentation is called the "chunk" in the MD man
      pages.  Stripes or chunks apply only to RAID 0, 4 and 5,
      since stripes are not used in mirroring (RAID-1) and simple
      concatenation (RAID-linear).  The stripe size affects both
      read and write latency (delay), throughput (bandwidth), and
      contention between independent operations (ability to simul-
      taneously service overlapping I/O requests).

      Assuming the use of the ext2fs file system, and the current
      kernel policies about read-ahead, large stripe sizes are
      almost always better than small stripe sizes, and stripe
      sizes from about a fourth to a full disk cylinder in size
      may be best.  To understand this claim, let us consider the
      effects of large stripes on small files, and small stripes
      on large files.  The stripe size does not affect the read
      performance of small files:  For an array of N drives, the
      file has a 1/N probability of being entirely within one
      stripe on any one of the drives.  Thus, both the read
      latency and bandwidth will be comparable to that of a single
      drive.  Assuming that the small files are statistically well
      distributed around the filesystem, (and, with the ext2fs
      file system, they should be), roughly N times more
      overlapping, concurrent reads should be possible without
      significant collision between them.  Conversely, if very
      small stripes are used, and a large file is read
      sequentially, then a read will issued to all of the disks in
      the array.  For a the read of a single large file, the
      latency will almost double, as the probability of a block
      being 3/4'ths of a revolution or farther away will increase.
      Note, however, the trade-off: the bandwidth could improve
      almost N-fold for reading a single, large file, as N drives
      can be reading simultaneously (that is, if read-ahead is
      used so that all of the disks are kept active).  But there
      is another, counter-acting trade-off:  if all of the drives
 are already busy reading one file, then attempting to read a
 second or third file at the same time will cause significant
 contention, ruining performance as the disk ladder
 algorithms lead to seeks all over the platter.  Thus,  large
 stripes will almost always lead to the best performance. The
 sole exception is the case where one is streaming a single,
 large file at a time, and one requires the top possible
 bandwidth, and one is also using a good read-ahead
 algorithm, in which case small stripes are desired.


 Note that this HOWTO previously recommended small stripe
 sizes for news spools or other systems with lots of small
 files. This was bad advice, and here's why:  news spools
 contain not only many small files, but also large summary
 files, as well as large directories.  If the summary file is
 larger than the stripe size, reading it will cause many
 disks to be accessed, slowing things down as each disk
 performs a seek.  Similarly, the current ext2fs file system
 searches directories in a linear, sequential fashion.  Thus,
 to find a given file or inode, on average half of the
 directory will be read. If this directory is spread across
 several stripes (several disks), the directory read (e.g.
 due to the ls command) could get very slow. Thanks to Steven
 A. Reisman <[email protected]> for this correction.  Steve
 also adds:

      I found that using a 256k stripe gives much better perfor-
      mance.  I suspect that the optimum size would be the size of
      a disk cylinder (or maybe the size of the disk drive's sec-
      tor cache).  However, disks nowadays have recording zones
      with different sector counts (and sector caches vary among
      different disk models).  There's no way to guarantee stripes
      won't cross a cylinder boundary.




 The tools accept the stripe size specified in KBytes.  You'll want to
 specify a multiple of if the page size for your CPU (4KB on the x86).




 9. QQ: What is the correct stride factor to use when creating the
    ext2fs file system on the RAID partition?  By stride, I mean the -R
    flag on the mke2fs command:

    mke2fs -b 4096 -R stride=nnn  ...



 What should the value of nnn be?

      AA: The -R stride flag is used to tell the file system about
      the size of the RAID stripes.  Since only RAID-0,4 and 5 use
      stripes, and RAID-1 (mirroring) and RAID-linear do not, this
      flag is applicable only for RAID-0,4,5.

      Knowledge of the size of a stripe allows mke2fs to allocate
      the block and inode bitmaps so that they don't all end up on
      the same physical drive.  An unknown contributor wrote:

      I noticed last spring that one drive in a pair always had a
      larger I/O count, and tracked it down to the these meta-data
      blocks.  Ted added the -R stride= option in response to my
 explanation and request for a workaround.


 For a 4KB block file system, with stripe size 256KB, one would use -R
 stride=64.

 If you don't trust the -R flag, you can get a similar effect in a
 different way.   Steven A. Reisman <[email protected]> writes:

      Another consideration is the filesystem used on the RAID-0
      device.  The ext2 filesystem allocates 8192 blocks per
      group.  Each group has its own set of inodes.  If there are
      2, 4 or 8 drives, these inodes cluster on the first disk.
      I've distributed the inodes across all drives by telling
      mke2fs to allocate only 7932 blocks per group.


 Some mke2fs pages do not describe the [-g blocks-per-group] flag used
 in this operation.



 10.
    QQ: Where can I put the md commands in the startup scripts, so that
    everything will start automatically at boot time?


      AA: Rod Wilkens <[email protected]> writes:

      What I did is put ``mdadd -ar'' in the
      ``/etc/rc.d/rc.sysinit'' right after the kernel loads the
      modules, and before the ``fsck'' disk check.  This way, you
      can put the ``/dev/md?'' device in the ``/etc/fstab''. Then
      I put the ``mdstop -a'' right after the ``umount -a''
      unmounting the disks, in the ``/etc/rc.d/init.d/halt'' file.


 For raid-5, you will want to look at the return code for mdadd, and if
 it failed, do a


      ckraid --fix /etc/raid5.conf





 to repair any damage.



 11.
    QQ: I was wondering if it's possible to setup striping with more
    than 2 devices in md0? This is for a news server, and I have 9
    drives... Needless to say I need much more than two.  Is this
    possible?


      AA: Yes. (describe how to do this)



 12.
    QQ: When is Software RAID superior to Hardware RAID?


 AA: Normally, Hardware RAID is considered superior to Soft-
 ware RAID, because hardware controllers often have a large
 cache, and can do a better job of scheduling operations in
 parallel.  However, integrated Software RAID can (and does)
 gain certain advantages from being close to the operating
 system.


 For example, ... ummm. Opaque description of caching of
 reconstructed blocks in buffer cache elided ...


 On a dual PPro SMP system, it has been reported that
 Software-RAID performance exceeds the performance of a well-
 known hardware-RAID board vendor by a factor of 2 to 5.


 Software RAID is also a very interesting option for high-
 availability redundant server systems.  In such a
 configuration, two CPU's are attached to one set or SCSI
 disks.  If one server crashes or fails to respond, then the
 other server can mdadd, mdrun and mount the software RAID
 array, and take over operations.  This sort of dual-ended
 operation is not always possible with many hardware RAID
 controllers, because of the state configuration that the
 hardware controllers maintain.



 13.
    QQ: If I upgrade my version of raidtools, will it have trouble
    manipulating older raid arrays?  In short, should I recreate my
    RAID arrays when upgrading the raid utilities?


      AA: No, not unless the major version number changes.  An MD
      version x.y.z consists of three sub-versions:

           x:      Major version.
           y:      Minor version.
           z:      Patchlevel version.




      Version x1.y1.z1 of the RAID driver supports a RAID array
      with version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).

      Different patchlevel (z) versions for the same (x.y) version
      are designed to be mostly compatible.


      The minor version number is increased whenever the RAID
      array layout is changed in a way which is incompatible with
      older versions of the driver. New versions of the driver
      will maintain compatibility with older RAID arrays.

      The major version number will be increased if it will no
      longer make sense to support old RAID arrays in the new
      kernel code.


      For RAID-1, it's not likely that the disk layout nor the
      superblock structure will change anytime soon.  Most all Any
      optimization and new features (reconstruction, multithreaded
      tools, hot-plug, etc.) doesn't affect the physical layout.
 14.
    QQ: The command mdstop /dev/md0 says that the device is busy.


      AA: There's a process that has a file open on /dev/md0, or
      /dev/md0 is still mounted.  Terminate the process or umount
      /dev/md0.



 15.
    QQ: Are there performance tools?

      AA: There is also a new utility called iotrace in the
      linux/iotrace directory. It reads /proc/io-trace and analy-
      ses/plots it's output.  If you feel your system's block IO
      performance is too low, just look at the iotrace output.



 16.
    QQ: I was reading the RAID source, and saw the value SPEED_LIMIT
    defined as 1024K/sec.  What does this mean?  Does this limit
    performance?


      AA: SPEED_LIMIT is used to limit RAID reconstruction speed
      during automatic reconstruction.  Basically, automatic
      reconstruction allows you to e2fsck and mount immediately
      after an unclean shutdown, without first running ckraid.
      Automatic reconstruction is also used after a failed hard
      drive has been replaced.


      In order to avoid overwhelming the system while
      reconstruction is occurring, the reconstruction thread
      monitors the reconstruction speed and slows it down if its
      too fast.  The 1M/sec limit was arbitrarily chosen as a
      reasonable rate which allows the reconstruction to finish
      reasonably rapidly, while creating only a light load on the
      system so that other processes are not interfered with.



 17.
    QQ: What about ''spindle synchronization'' or ''disk
    synchronization''?

      AA: Spindle synchronization is used to keep multiple hard
      drives spinning at exactly the same speed, so that their
      disk platters are always perfectly aligned.  This is used by
      some hardware controllers to better organize disk writes.
      However, for software RAID, this information is not used,
      and spindle synchronization might even hurt performance.



 18.
    QQ: How can I set up swap spaces using raid 0?  Wouldn't striped
    swap ares over 4+ drives be really fast?

      AA: Leonard N. Zubkoff replies: It is really fast, but you
      don't need to use MD to get striped swap.  The kernel auto-
      matically stripes across equal priority swap spaces.  For
      example, the following entries from /etc/fstab stripe swap
      space across five drives in three groups:
 /dev/sdg1       swap    swap    pri=3
 /dev/sdk1       swap    swap    pri=3
 /dev/sdd1       swap    swap    pri=3
 /dev/sdh1       swap    swap    pri=3
 /dev/sdl1       swap    swap    pri=3
 /dev/sdg2       swap    swap    pri=2
 /dev/sdk2       swap    swap    pri=2
 /dev/sdd2       swap    swap    pri=2
 /dev/sdh2       swap    swap    pri=2
 /dev/sdl2       swap    swap    pri=2
 /dev/sdg3       swap    swap    pri=1
 /dev/sdk3       swap    swap    pri=1
 /dev/sdd3       swap    swap    pri=1
 /dev/sdh3       swap    swap    pri=1
 /dev/sdl3       swap    swap    pri=1





 19.
    QQ: I want to maximize performance.  Should I use multiple
    controllers?

      AA: In many cases, the answer is yes.  Using several con-
      trollers to perform disk access in parallel will improve
      performance.  However, the actual improvement depends on
      your actual configuration.  For example, it has been
      reported (Vaughan Pratt, January 98) that a single 4.3GB
      Cheetah attached to an Adaptec 2940UW can achieve a rate of
      14MB/sec (without using RAID).  Installing two disks on one
      controller, and using a RAID-0 configuration results in a
      measured performance of 27 MB/sec.


      Note that the 2940UW controller is an "Ultra-Wide" SCSI
      controller, capable of a theoretical burst rate of 40MB/sec,
      and so the above measurements are not surprising.  However,
      a slower controller attached to two fast disks would be the
      bottleneck.  Note also, that most out-board SCSI enclosures
      (e.g. the kind with hot-pluggable trays) cannot be run at
      the 40MB/sec rate, due to cabling and electrical noise
      problems.


      If you are designing a multiple controller system, remember
      that most disks and controllers typically run at 70-85% of
      their rated max speeds.


      Note also that using one controller per disk can reduce the
      likelihood of system outage due to a controller or cable
      failure (In theory -- only if the device driver for the
      controller can gracefully handle a broken controller. Not
      all SCSI device drivers seem to be able to handle such a
      situation without panicking or otherwise locking up).


 99..  HHiigghh AAvvaaiillaabbiilliittyy RRAAIIDD


 1. QQ: RAID can help protect me against data loss.  But how can I also
    ensure that the system is up as long as possible, and not prone to
    breakdown?  Ideally, I want a system that is up 24 hours a day, 7
    days a week, 365 days a year.

      AA: High-Availability is difficult and expensive.  The harder
      you try to make a system be fault tolerant, the harder and
      more expensive it gets.   The following hints, tips, ideas
      and unsubstantiated rumors may help you with this quest.

      +o  IDE disks can fail in such a way that the failed disk on
         an IDE ribbon can also prevent the good disk on the same
         ribbon from responding, thus making it look as if two
         disks have failed.   Since RAID does not protect against
         two-disk failures, one should either put only one disk on
         an IDE cable, or if there are two disks, they should
         belong to different RAID sets.

      +o  SCSI disks can fail in such a way that the failed disk on
         a SCSI chain can prevent any device on the chain from
         being accessed.  The failure mode involves a short of the
         common (shared) device ready pin; since this pin is
         shared, no arbitration can occur until the short is
         removed.  Thus, no two disks on the same SCSI chain
         should belong to the same  RAID array.

      +o  Similar remarks apply to the disk controllers.  Don't
         load up the channels on one controller; use multiple
         controllers.

      +o  Don't use the same brand or model number for all of the
         disks.  It is not uncommon for severe electrical storms
         to take out two or more disks.  (Yes, we all use surge
         suppressors, but these are not perfect either).   Heat &
         poor ventilation of the disk enclosure are other disk
         killers.  Cheap disks often run hot.  Using different
         brands of disk & controller decreases the likelihood that
         whatever took out one disk (heat, physical shock,
         vibration, electrical surge) will also damage the others
         on the same date.

      +o  To guard against controller or CPU failure, it should be
         possible to build a SCSI disk enclosure that is "twin-
         tailed": i.e. is connected to two computers.  One
         computer will mount the file-systems read-write, while
         the second computer will mount them read-only, and act as
         a hot spare.  When the hot-spare is able to determine
         that the master has failed (e.g.  through a watchdog), it
         will cut the power to the master (to make sure that it's
         really off), and then fsck & remount read-write.   If
         anyone gets this working, let me know.

      +o  Always use an UPS, and perform clean shutdowns.  Although
         an unclean shutdown may not damage the disks, running
         ckraid on even small-ish arrays is painfully slow.   You
         want to avoid running ckraid as much as possible.  Or you
         can hack on the kernel and get the hot-reconstruction
         code debugged ...

      +o  SCSI cables are well-known to be very temperamental
         creatures, and prone to cause all sorts of problems.  Use
         the highest quality cabling that you can find for sale.
         Use e.g. bubble-wrap to make sure that ribbon cables to
         not get too close to one another and cross-talk.
         Rigorously observe cable-length restrictions.

      +o  Take a look at SSI (Serial Storage Architecture).
         Although it is rather expensive, it is rumored to be less
         prone to the failure modes that SCSI exhibits.


 +o  Enjoy yourself, its later than you think.


 1100..  QQuueessttiioonnss WWaaiittiinngg ffoorr AAnnsswweerrss


 1. QQ: If, for cost reasons, I try to mirror a slow disk with a fast
    disk, is the S/W smart enough to balance the reads accordingly or
    will it all slow down to the speed of the slowest?


 2. QQ: For testing the raw disk thru put...  is there a character
    device for raw read/raw writes instead of /dev/sdaxx that we can
    use to measure performance on the raid drives??  is there a GUI
    based tool to use to watch the disk thru-put??


 1111..  WWiisshh LLiisstt ooff EEnnhhaanncceemmeennttss ttoo MMDD aanndd RReellaatteedd SSooffttwwaarree

 Bradley Ward Allen <[email protected]> wrote:

      Ideas include:

      +o  Boot-up parameters to tell the kernel which devices are
         to be MD devices (no more ``mdadd'')

      +o  Making MD transparent to ``mount''/``umount'' such that
         there is no ``mdrun'' and ``mdstop''

      +o  Integrating ``ckraid'' entirely into the kernel, and
         letting it run as needed

         (So far, all I've done is suggest getting rid of the
         tools and putting them into the kernel; that's how I feel
         about it, this is a filesystem, not a toy.)

      +o  Deal with arrays that can easily survive N disks going
         out simultaneously or at separate moments, where N is a
         whole number > 0 settable by the administrator

      +o  Handle kernel freezes, power outages, and other abrupt
         shutdowns better

      +o  Don't disable a whole disk if only parts of it have
         failed, e.g., if the sector errors are confined to less
         than 50% of access over the attempts of 20 dissimilar
         requests, then it continues just ignoring those sectors
         of that particular disk.

      +o  Bad sectors:

      +o  A mechanism for saving which sectors are bad, someplace
         onto the disk.

      +o  If there is a generalized mechanism for marking degraded
         bad blocks that upper filesystem levels can recognize,
         use that. Program it if not.

      +o  Perhaps alternatively a mechanism for telling the upper
         layer that the size of the disk got smaller, even
         arranging for the upper layer to move out stuff from the
         areas being eliminated.  This would help with a degraded
         blocks as well.

      +o  Failing the above ideas, keeping a small (admin settable)
         amount of space aside for bad blocks (distributed evenly
    across disk?), and using them (nearby if possible)
    instead of the bad blocks when it does happen.  Of
    course, this is inefficient.  Furthermore, the kernel
    ought to log every time the RAID array starts each bad
    sector and what is being done about it with a ``crit''
    level warning, just to get the administrator to realize
    that his disk has a piece of dust burrowing into it (or a
    head with platter sickness).

 +o  Software-switchable disks:

    ````ddiissaabbllee tthhiiss ddiisskk''''
       would block until kernel has completed making sure
       there is no data on the disk being shut down that is
       needed (e.g., to complete an XOR/ECC/other error
       correction), then release the disk from use (so it
       could be removed, etc.);

    ````eennaabbllee tthhiiss ddiisskk''''
       would mkraid a new disk if appropriate and then start
       using it for ECC/whatever operations, enlarging the
       RAID5 array as it goes;

    ````rreessiizzee aarrrraayy''''
       would respecify the total number of disks and the
       number of redundant disks, and the result would often
       be to resize the size of the array; where no data loss
       would result, doing this as needed would be nice, but
       I have a hard time figuring out how it would do that;
       in any case, a mode where it would block (for possibly
       hours (kernel ought to log something every ten seconds
       if so)) would be necessary;

    ````eennaabbllee tthhiiss ddiisskk wwhhiillee ssaavviinngg ddaattaa''''
       which would save the data on a disk as-is and move it
       to the RAID5 system as needed, so that a horrific save
       and restore would not have to happen every time
       someone brings up a RAID5 system (instead, it may be
       simpler to only save one partition instead of two, it
       might fit onto the first as a gzip'd file even);
       finally,

    ````rree--eennaabbllee ddiisskk''''
       would be an operator's hint to the OS to try out a
       previously failed disk (it would simply call disable
       then enable, I suppose).


 Other ideas off the net:


      +o  finalrd analog to initrd, to simplify root raid.

      +o  a read-only raid mode, to simplify the above

      +o  Mark the RAID set as clean whenever there are no "half
         writes" done. -- That is, whenever there are no write
         transactions that were committed on one disk but still
         unfinished on another disk.

         Add a "write inactivity" timeout (to avoid frequent seeks
         to the RAID superblock when the RAID set is relatively
         busy).