RAID, Can it Fail? If it Does is Data Recovery Possible?

Originally, as envisaged in 1987 by Patterson, Gibson and Katz from the University of California in Berkeley, the acronym RAID stood for a “Redundant Array of Inexpensive Disks”. In short a larger number of smaller cheaper disks could be used in place of a single much more expensive large hard disk, or even to create a disk that was larger than any currently available.

They went a stage further and postulated a variety of options that would not only result in getting a big disk for a lower cost, but could improve performance, or increase reliability at the same time. Partly the options for improved reliability were required as using multiple disks gave a reduction in the Mean-Time-Between-Failure, divide the MTBF for a drive in the array by the number of drives and theoretically a RAID will fail more quickly than a single disk.

Today RAID is usually described as a “Redundant Array of Independent Disks”, technology has moved on and even the most costly disks are not particularly expensive.

Six levels of RAID were originally defined, some geared towards performance, others to improved fault tolerance, though the first of these did not have any redundancy or fault-tolerance so might not truly be considered RAID.

RAID 0 – Striped and not really “RAID”

RAID 0 provides capacity and speed but not redundancy, data is striped across the drives with all of the benefits that gives, but if one drive fails the RAID is dead just as if a single hard disk drive fails.

This is good for transient storage where performance matters but the data is either non-critical or a copy is also kept elsewhere. Other RAID levels are more suited for critical systems where backups might not be up-to-the-minute, or down-time is undesirable.

RAID 1 – Mirroring

RAID 1 is often used for the boot devices in servers or for critical data where reliability requirements are paramount. Usually 2 hard disk drives are used and any data written to one disk is also written to the other.

In the event of a failure of one drive the system can switch to single drive operation, the failed drive replaced and the data transferred to a replacement drive to rebuild the mirror.

RAID 2

RAID 2 introduced error correction code generation to compensate for drives that did not have their own error detection. There are no such drives now, and have not been for a long time. RAID 2 is not really used anywhere.

RAID 3 – Dedicated Parity

RAID 3 uses striping, down to the byte level. This adds a hardware overhead for no apparent benefit. It also introduces “parity” or error correction data on a separate drive so an additional hard disk is needed that gives greater security but no additional space.

RAID 4 – Dedicated Parity

RAID 4 stripes to the block level, and like RAID 3 stores parity information on a dedicated drive.

RAID 5 – The most common format

RAID 5 stripes at the block level but does not use a single dedicated drive for storing parity. Instead, parity is interspersed within the data, so after each run of data stripes there is a strip of parity data, but this changes then for the next set of stripes.

This could means, for example, that in a 3 disk RAID 5 there are data strips on disks 0 and 1 followed by a parity strip on disk 2. For the next set of stripes the data is on disks 0 and 2 with the parity on disk 1, then data on disks 1 and 2 with parity on disk 0. hard drive recovery

RAID 5 is generally faster for smaller reads, so eminently suitable for server systems being shared by large numbers of users created smaller data files or accessing smaller amounts of data each time. For other applications, however, RAID 4 will outperform RAID 5 quite considerably.

Beyond RAID 5?

Advances on RAID 5 do exist, though in general these use RAID 5 techniques and enhance them, for example by mirroring two RAID 5 arrays, or by having 2 parity stripes.

RAID data recovery

It might be imaged that with all of this fault tolerance that hard drive recovery would not be a requirement, but things will still go wrong.

With all RAID levels logical corruption, damage to the file system, has just as devastating effect as with a single hard disk. You might have a robustly stored file system, but it is a robustly stored and corrupted file system.

With RAID 0 the result of a failure of one disk is terminal for the RAID, if data cannot be recovered from the failed disk then a percentage of the data is lost for good, and since RAID uses data striping, this could be like losing 1 MB of data out of every 4 MB, and the chances of that leaving any major files intact are low. For smaller files, those less than the sum of a strip each from the working drive there will be files that are fortunately intact, for larger files (e.g. Exchange or SQL databases) there will be considerable data loss and structural damage and low level work will be required to salvage any useful data from them.

 

« »

Leave a Reply

Your email address will not be published. Required fields are marked *