Hard Disk Failure Rates
This week we had an unfortunate series of coincidences that led us to take a close look at our backup and restore practices.
To give you a little high-level insight into operations at Deep Core Data, we run a number of virtualized environments for different customers. In these environments, we do regularly scheduled image backups of all the virtual hard drives in the environment. Once the backups are taken, we have them pulled to staging servers, where safe copies of the data are held. From here, we schedule them for movement to an off-site backup site. These staging servers are basically enormous stacks of hard disks raided together. Some use RAID 5, some use RAID 6.
Now the technical details of RAID 5 and RAID 6 aren’t particularly relevant here, besides one key fact: a RAID 5 set of hard drives can have 1 drive fail with no issues, while a RAID 6 set of hard drives can survive 2 drives failing. RAID 6 needs at least five drives in the array, so it’s only possible on machines with more than that minimum number of hard drives.
On Thursday night last week, I got an alert from our monitoring system that a drive had failed on one of our storage servers. That’s not extremely uncommon, but it does require us to take action. These drives are rated by their manufacturers as having a mean time between failures (MTBF) of about 100,000 hours, or about 11 years. The practical upshot is that, plus or minus a few factors, the drives have about a one-in-100,000 chance of failing in any given hour. The following morning, I went into our data center, and found we had lost a second drive and the array was compromised. That was a big pain and resulted in a lot of time-consuming work to bring the array back online, but it got me thinking about drive survivability.
Now, my math suggests that there was only around a 1 in 8,333 chance of a second drive failure in the 12 hours after the first drive failure. Was I just that unlucky? I did some digging.
It turns out that the published MTBF for hard disk by their manufacturers is… optimistic. I always rather suspected as much, but I didn’t realize the degree their optimism borders on blue sky. At least one major manufacturer’s drives have a failure rate of over 24% after 3 years, suggesting they fail at around 9-10% per year. The practical upshot is that the drives don’t really have MTBF of 100,000, but something more akin to 53,000 hours. So we were unlucky, but not lottery-odds unlucky, especially considering the sheer number of drives spinning away in our environment.
We’re insulated from the risk of failing drives because of the sheer quantity we use and using separate, fully redundant storage systems to make sure any given data loss isn’t catastrophic. That said, many of our customers current and future use small NAS’s both at work and at home, and it’s important to understand a RAID array isn’t full-proof protection; even they need to be backed up.
If you’re staying up at night worrying about what will happen if your business’s storage fails, give us a call. We’re ready to help.