The Math on Hard Drive Failure

Nick2253 · May 23, 2014

Since I haven't seen any discussion on this issue (well, anything more than speculation), I thought I'd go ahead and do some math on exactly what is the likelihood of HDD failure (and, by extension, array failure).

I'll be using the data from Google's HDD study (http://research.google.com/archive/disk_failures.pdf), the Microsoft/U. Virginia study (http://www.cs.virginia.edu/~gurumurthi/papers/acmtos13.pdf), and Microsoft Research (http://arxiv.org/ftp/cs/papers/0701/0701166.pdf). Like any good hard drive manufacturer, 1 TB is 1 trillion bytes.

Now, on to the data.

The annual failure rate (AFR) for the average consumer hard drive (unsegregated by age, temperature, workload, etc), is between approximately 4-6%. The data is corroborated by both the Google and Microsoft studies, as well as other general studies cited in the Microsoft paper. Certain subgroups of hard drives have failure rates below 2%, and certain subgroups have failure rates above 15%.

Here's a data table that lists the AFR for an array of "# drives" size with "# redun" redundant disks based on various single-disk AFRs. (For example, a 6 drive RAID10 array would have 6 drives and 3 redundant drives, and if those drives had a 5% AFR, the array would have a 0.009% failure rate).

Array AFR based on number of drives, number of redundant drives, and drive AFR.

What we are calculating here is the likelihood that (# redun + 1) drives or more fails. Mathematically, that looks like (let d = # drives, r = # redun and a = AFR):

However, disk failures aren't the only cause of array failures. We also have to consider the dreaded unrecoverable read errors (URE), also know as bite error rate (BER).

Now, we've all read cyberjock's article on how RAID5 died in 2009. However, the math in that article isn't quite correct. This is because unrecoverable read errors are not *independent* events. In other words, if you have one URE, you are almost always guaranteed a second one. In a large part, this is driven by the fact that data is not stored in bits on a hard drive, but rather blocks (or sectors).

The causes for URE are diverse, and apart from the internals of the hard drive, include disk controllers, host adapters, cables, electromagnetic noise, etc., so it behooves us to refer to an empirical study on failure rates.

Based on a study by Microsoft in 2005, 1.4 PB of data were read, and only 3 unrecoverable read *events* were recorded. Though each of these events caused multiple bits of data to be lost, only three events occurred.

The table below shows the data for these three theories: that UREs are independent based on manufactures' specifications, that UREs are dependent based on sector failure, and the empirical data from Microsoft.

URE probability for different data read amounts

The top row shows bits read per URE (which is the inverse of how it's sometimes reported). What we are calculating here is the likelihood that one or more UREs occurs, or 1-P(no URE). Let s=drive size in bits, U=bits per URE:

Obviously, UREs cannot be fully independent based on manufacturers' specifications. Anecdotally, during my hard drive burn in, I wrote and read 4 passes of data to my 6 4TB hard drives, which comes to 96TB of data. If UREs were independent, I would have had a 99.95% chance of a URE, yet I did not encounter one.

Also, UREs cannot be dependent based on sectors. Even with 96TB, the likelihood of a URE (under this model) would mean less than one in 1 septillionth of a percent. However, empirically, UREs happen.

Which brings us to the empirical model of 3 failures per 1.4 PB, which translates to a URE rate of 3.73 * 10^15. For my 96TB, this translates to 15.68% likelihood of a URE. Anecdotally, for modern drives, this rate may still overestimate failures, though I have yet to find an empirical study looking for UREs on modern HDDs. Since we're looking for a failure rate, let's be slightly pessimistic and assume that this is in fact the empirical URE rate.

Now, we can combine the HDD failure rate with the URE rate to get a more accurate picture of how protected (or not) our data is.

There are 5 dimensions involved here (drive size, array size, redundant disks, URE rate, and drive AFR), so I've simplified some of the dimensions for usability. Using the methods here, you should be able to calculate your array failure rate if your parameters.

Array AFR based on various factors

Across the top is the URE rate in bits per URE and the next row is AFR for the individual hard drive. Mathematically, I'm calculating:

The equation above can simplify quite a bit, but I left it in this form because it is more intuitive. What we have is the same P(URE) from before, and our new P(failure) is 1-P(success), which is the probability that we don't have a URE and we have 0 to r drive failures, and the probability that we have a URE and we have 0 to r-1 drive failures.

Results

What we find is that, even under our empirical URE rate, RAID5/RAIDZ1 arrays are still incredibly risky. AFR for these arrays can be as bad as 5%.

It's also important to note that drive AFR is very important for array AFR. Decreasing your drive AFR by a small amount (e.g. better cooling, minimizing vibration, replacing older drives) pays back in large dividends. For example, decreasing the AFR from 7% to 3% on a six drive RAIDZ2 array decreases the AFR for the array by 10x.

I hope this has been educational for you all. If you have any other questions, let me know.

crisman · May 23, 2014

Hi Nick 2253,

I found your post very interesting but there are other relevant factors that we must know regarding drive failure, I recommend reading this study by Blackblaze a cloud storage company, where you and all of us could learn about drive failure rates accross different makes and models.

look at the following link: http://blog.backblaze.com/category/storage-pod/

Happy sharing!! ;)

Nick2253 · May 23, 2014

crisman said:
Hi Nick 2253,

I found your post very interesting but there are other relevant factors that we must know regarding drive failure, I recommend reading this study by Blackblaze a cloud storage company, where you and all of us could learn about drive failure rates accross different makes and models.

look at the following link: http://blog.backblaze.com/category/storage-pod/

Happy sharing!! ;)

Actually, I intentionally did not include the Backblaze study. I question the data collected by Backblaze, because it actually goes against a large body of work on hard drive failure, including huge studies conducted by Microsoft, Google, and a number of universities. Of note, Backblaze found no correlation between temperature and hard drive failure, which disagrees with almost all existing studies about hard drive failure, not to mention the much larger body of work on electronic failures in general and motor failures, specifically.

In addition, Backblaze uses a proprietary drive cage that subjects their drives to a large amount of vibration, and their methods of procuring drives are somewhat esoteric. Neither of these factors likely apply to the average user here, whereas the standard hard drive mounting used by Google, Microsoft, others is much more likely to be the same as users here use.

For all we know, Backblaze's study gives us a statistical look at how hard drives fail when subjected to large amounts of vibration, but without additional information, it's difficult to treat that as a legitimate "study" any more than a media grab.

crisman · May 23, 2014

You could be right but most of us don't have datacenters at home, I've seen lots of people here that build some Freenas servers without the minimum of precautions on getting good airflow to their chassis to avoid high temperatures and much less taking precautions on the vibrations because they build the chassis and put them on top of desks and other not appropriate supports, Blackblaze study also helps understand the different behaviors on that conditions showed by different hard drives models.

Nick2253 · May 23, 2014

crisman said:
You could be right but most of us don't have datacenters at home, I've seen lots of people here that build some Freenas servers without the minimum of precautions on getting good airflow to their chassis to avoid high temperatures and much less taking precautions on the vibrations because they build the chassis and put them on top of desks and other not appropriate supports, Blackblaze study also helps understand the different behaviors on that conditions showed by different hard drives models.

Again, Backblaze's data is inconsistent with almost all research data collected on hard drives. The fact that they break the data out by hard drive model doesn't make it any more valid. Furthermore, Backblaze clearly favors Seagate hard drives as you can read from their storage blog, but they also show Seagate drives with the highest failure rate. Is this because they favor Seagate hard drives in the most demanding conditions because they believe them the most reliable? Hence the problem with this so-called "study." They don't give us any details of their methodology other than "drives go in pods and run." We have no indication that they addressed possible systematic biases on their part, and furthermore we have no detailed data to verify their analyses.

Their "statistical" analysis for hard drive temperature involved them looking at the average temperature for different models of drive, and concluded that temperature doesn't impact drives. This is simply bad statistics. Average anything across a large enough sample size and you'll erase any statistically significant correlations.

Backblaze (and it's Backblaze, not Blackblaze) has produced some interesting anecdotal data, but they've clearly demonstrated that it's nothing more than anecdotal, and not scientifically rigorous.

crisman · May 23, 2014

Well your opinion may be valid, the fact of having used more Seagate drives is due to the fact that they are cheaper ("we buy the least expensive drives that will work"), the share of other brands is much lower but still can be used as a sampling.

Nick2253 · May 23, 2014

I guess I don't know why you're so gung-ho about this Backblaze thing. There are countless other studies, that are actually proper, published, peer reviewed studies to pull from.

The sampling methods used here definitely introduce systemic bias. The fact that, in their signature graphic, they chose to segregate drives by size, when that actually measures drive reliability due to age (since bigger drives are newer and therefore less likely to fail) compounded by sample size and purchase volume for the different drive sizes.. To do any kind of analysis like that you need to actually account for other factors, showing what's going on, all other factors being equal. It's Statistics and Sampling 101. But they don't. And that means you can't draw any meaningful conclusion from the data.

Study after study indicates that things like vibration, temperature, and age have a huge impact in drive reliability. Backblaze basically sweeps all of that under the rug with their poor statistical methodology. Throughout their "study" they are comparing apples and oranges: drives with different ages, drives used in different enclosure types, non-uniform distribution of brands and models. There are legitimate statistical ways to account for this discrepancies. But they don't do it.

If you have other questions about the Backblaze data, I'm going to refer you to two fantastic articles about problems with the data:

http://www.tweaktown.com/articles/6...bility-myth-the-real-story-covered/index.html
http://www.enterprisestorageforum.c...ng-a-disk-drive-how-not-to-do-research-1.html

I understand the appeal of the Backblaze data: it's the first time that any major company has released brand v. brand data on HDDs. But Backblaze has muddied their own waters by failing to account for any of the other variables here, which are essential to understanding what's actually going on. I'm not saying there's no difference. In fact, differences in reliability between vintages of HDDs are well know: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1285441&url=http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1285441.

The problem here is that HDD models and vintages change faster than the data available to analyze them. Sure, I could update the table above, but part of why I generalized the data to disk AFR is that it allows comparisons based on different AFR. You run your drives in a high-temperature environment, well, you have a higher AFR. I could create mathematical formulae for different conditions, but it's difficult to truly quantify those conditions.

crisman · May 23, 2014

No! No! I'm not so gung-ho about Backblaze thing, I think your opinion and your math study is really interesting.
I just found very curious the statistics about their study and has I said before its just a "sampling" about different brands and models of hard drives behaviour in their environments.

Nick2253 · May 23, 2014

OK, whew ;)

One of my biggest pet peeves is when poor data gets picked up as "fact." I really wish that Backblaze had provided better data, because I think they have a lot of useful insight. I mean, on the vibration issue, Backblaze could have provided a ton of data on exactly what vibration failure looks like (and, therefore, the importance or unimportance of properly securing your drive). But they basically threw their data into Excel, made a pretty chart, and were like "voila!", data. And then a small part of me died on the inside :)

crisman · May 23, 2014

Maybe in future . . . I almost assure you that they've already done some homework (like you ;) ) and have seen other's people comments about their methodology and will be more reliable on their studies .

Weeblie · May 24, 2014

The Annualized Failure Rate numbers for arrays are kind of misleading since you haven't taken into consideration that people usually replace and resilver failed drives. :)

A triple failure event with insufficient time to create additional replicas are for all intent and purposes not going to happen. Most cloud platforms operate under that form of redundancy (at least before being erasure coded for final storage). I think it would be more valuable if you also added "array rebuild speed" into the picture.

crisman · May 24, 2014

I'm not sure if that its possible to achieve since its depends on all hardware build (processor, memory, type of controller, ....)

no_connection · May 24, 2014

Are URE treated as recoverable when you have redundant data?

What is considered good per above statistics? Or good enough?

cyberjock · May 25, 2014

First, when Nick2253 said "Certain subgroups of hard drives have failure rates below 2%, and certain subgroups have failure rates above 15%" that should have been the flag that "all this math is so subjective as to almost not even really matter" because that's almost a full factor difference. That's not useful at all if drives don't have a reliable failure rate.

crisman said:
You could be right but most of us don't have datacenters at home, I've seen lots of people here that build some Freenas servers without the minimum of precautions on getting good airflow to their chassis to avoid high temperatures and much less taking precautions on the vibrations because they build the chassis and put them on top of desks and other not appropriate supports, Blackblaze study also helps understand the different behaviors on that conditions showed by different hard drives models.

Yeah, and you aren't exactly doing what Backblaze does either, so if you aren't doing what Google does, and you aren't doing what Backblaze does, why should you trust either one? And more importantly "what the heck is so wrong with you that you'd spend all this time to set a zfs based system only to completely f*** yourself with not following standardized, expected, proper engineering techniques with your server? Sounds to me like you built the whole vehicle, then realized you forgot the engine, so you drop in what fits(oh look at my lawn mower engine) and then wonder why nothing works right. Either you do it right or you do it wrong. And if you do it wrong, realize that there's no comprehensive data for "how to do it wrong in the method you are using".

The Backblaze article is a media grab. As far as I'm concerned Nick2253 has it right. The really crappy thing is Backblaze had some interesting ideas with their first design. But they looked like fools by building servers with SATA port multipliers. Anyone that knows *anything* about file servers knows you do not use those.. EVER. So who the hell did they hire that was so incompetent and unskilled to not know better? Then, you add in their second gen where the didn't fix it. Then they finally fixed it with their third gen about 2 months ago. But then, instead of trying to get some face back they published an article that I can only describe as just slightly above what I put in the toilet every morning. If their business model works for them, great. I'm happy for them.

But I'd never ever trust Backblaze for a hardware solution nor would I ever use their data in any capacity unless I was planning to build a bonafide Backblaze system.

crisman · May 25, 2014

Well, it seems that you and Nick2253, are saying that I'm defending Backblaze, that's not the case!

The only thing I found interesting in their study (this being correct or not) was the correlation that can be drawn about the types of hard disks and failure rate, I'm not saying that their method and the technology they use to manufacture "hard drive boxes" is the correct, no way!

cyberjock · May 25, 2014

Oh crisman, I didn't think you were defending Backblaze. You might like them and you might even use them. But I try not to make assumptions about things because this forum is well known to blow assumptions out of the water.

To me, I don't even care about Backblaze's results because, as far as I'm concerned, their methodology is so f'd up their "results" don't mean anything.

Sir.Robin · May 25, 2014

Hehe.. i got some stats for you from work.

SAN1:
Xyratex. 11 shelves/12 disk a shelf. Seagate ES 7200rpm. 750GB
5-10 dead drives a year.
Fased out now.

SAN2:
IBM (NetApp) 9 shelves/14 drives a shelf. Mostly Seagate/Hitatchi 300-450GB 15K.
2-4 drives dead a year.

SAN3:
Dothill. 2 shelves/12 drives Seagate ES2. 2TB.
Running since 2011. 1 drive dead.

SAN4:
Dothill. 4 shelves/12 drives. Seagate ES2(?) 4TB.
Running since summer. 2013. 3 drives died the first 3 months.

SAN5:
IBM (NetApp) 24 drives a shelf.
First 3 shelves. Been running a year now. 3 drives dead after the first 4 months. After that: none.
Second 3 shelves. Been running a month. 0 drives dead so far.

Here's the killer:
On SAN2, i also have 5 SATA shelves/14 drives a shelf. 1TB Seagate. All from 2009.
Drives dead so far: 0. :)

cyberjock · May 25, 2014

And here's my stats:

24 WD Greens. Average one failure per year for 4 years. ;)

crisman · May 25, 2014

Actually I can not say that I like or do not like these guys, I did not know them until two weeks ago when I found this article.

Weeblie · May 26, 2014

crisman said:
I'm not sure if that its possible to achieve since its depends on all hardware build (processor, memory, type of controller, ....)

It doesn't have to be an exact science. An estimation of it taking about a week for a home user to notice the failure, order a replacement drive and have it resilvered should probably be pretty close to the truth. :)

The results from Microsoft, Google and other big datacenter owners must in either case be taken with a grain of salt. Cloud storage providers typically stresses their drives far more than what you see in SMB or home scenarios. 24/7 of sequential writes mixed with random reads puts a lot of mechanical wear-and-tear on the poor drives.

Important Announcement for the TrueNAS Community.

The Math on Hard Drive Failure

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Cadet

Explorer

Patron

Inactive Account

Explorer

Inactive Account

Guru

Inactive Account

Explorer

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The Math on Hard Drive Failure"

Similar threads