A new approach toward data protection for high-density disk drives

April 26. 2011    1 COMMENT

The use of technology termed erasure coding isn’t an entirely new concept, although it does indeed serve as a new approach for use in commercial storage systems for data protection. Its specifically useful for protecting data stored on high-density (multi-terabyte) disk drives, where RAID technology has particular limitations (see the prior Blog posting).

Erasure codes are a form of forward error-correction technology that has been used in a variety of technical applications for years. The basic idea is that it enables data to be broken into multiple packets (with a bit of additional information), sent to a receiver, and then reassembled on the receiving side. The key is that the receiver can reassemble the data even if some of the packets are lost in the transmission phase (that is, the receiver has a subset of the original packets). This created a perfect use for erasure codes in deep space transmissions.

The notion of utilizing erasure codes for storage media came about with the advent of CD’s & DVD’s and Blue Ray discs, where it is desirable that the media should be playable even with scratches or other damage to the recording surface. The most common algorithm for erasure coding in these applications has become known as Reed-Solomon codes, which were developed at MIT Lincoln Labs in the 1960’s.

A few years ago, when disk drive densities approached the 250GB and 500GB mark, most of the storage manufacturers realized that they needed to do something about the increasing probability of data loss on large-density drives. As explained in our previous posting, as drive densities increase, the RAID rebuild times increase. This presents a window-of-exposure to data loss, due to additional drive failures during rebuild (or media errors such as unrecoverable read errors). The idea of adding a second level of protection to RAID-5, in the form of what is commonly known as RAID-6 came about to protect data against two simultaneous disk drive failures, rather than one disk drive failure. In some vendor implementations, the second level of protection (the second “parity”) is implemented using Reed-Solomon techniques. This can create confusion since in many writings RAID-6 is therefore lumped in with erasure-coding.

The use of pure erasure-coding algorithms enables protection way beyond the 2 drive failures tolerated in RAID-6. In fact, in some implementations allow essentially unlimited levels of data protection, and a few even allow the user (or storage administrator) to specify the level of protection as a policy (for example, the data should survive 4 failures out of 16 disks, or 10 failures out of 20 disks).

A handful of storage systems have now been commercialized based on traditional Reed-Solomon techniques, or a newer generation based on variants of erasure coding such as Hurricane codes, or online erasure codes. Each of these has advantages, including protecting data against multiple component failures (disk drives, network links, power failures, etc), unrecoverable read errors, bit rot and many provide this in systems that automatically heal the data in the event of such a component failure.

These systems can leverage an opportunity created by these high-density disk drives, to store data very affordably (large disk drives usually provide the lowest cost per Gigabyte) and yet with the ultimate in storage reliability and durability.