The Math on Hard Drive Failure

solarisguy · May 26, 2014

Very nice math!

These are somefactors to consider when establishing an average time to resilver in home usage:

Do you notice a failed drive immediately? (holidays, illness, weekend trips, sleep, work etc.)
Do you keep a spare at home? Do you purchase the next spare immediately after being notified upon a failure?
Although it happens more often in a business setting, it does happen at home too, that there is a critical transfer that cannot be interrupted and you have to wait until it completes ;)

solarisguy · May 26, 2014

My three sentences about Backblaze. Regardless of their methodology, it would be surprising to see a correlation with temperature, since essentially their drives were operating between 21°C and 31°C. Had the average temperature varied from 21°C to 55°C, there would be surely something. See University of Virginia study for the explanation.

I think some sort of the shortened table should become a sticky in the hardware forum. If I may, I would recommend limiting rows to recommended RAID configurations. With less rows it would be easier to read.

I am seeing published AFR values that are smaller, than the ones in the columns. So I would suggest for the sticky version to have columns corresponding to most often encountered AFR values. I have found following published AFRs (no values for desktop/home models):

AFR 0.437% = MTBF 2Mh → Savvio 15K.3, Ultrastar 7K3000, Ultrastar 7K4000, Ultrastar He6, WD Xe
AFR 0.55% = MTBF 1.6Mh → Seagate Video 3.5 HD
AFR 0.624% =MTBF 1.4Mh → Seagate Enterprise Capacity 3.5 HDD, WD Re SAS
AFR 0.727% = MTBF 1.2Mh → WD Re SATA, WD Se (in a 1-5 bayNAS)
AFR 0.8% = MTBF 1.1Mh → WD Black (or better AFR, assigned the value based on an old entry from Seagate KB: Annualized Failure Rate is less than 0.8%)
AFR 0.872% = MTBF 1Mh → WD Se (in general usage), WD AV
AFR 0.969% = MTBF 0.9Mh →Seagate Surveillance HDD (AFR < 1%)
AFR 1.09% = MTBF 0.8Mh → HGST MegaScale DC 4000.B, Seagate Terascale HDD/Constellation CS

One would assume that drives with very short warranty and/or intended for only occasional usage would have AFR higher than 1%. I have no idea how to guess...

---------------------
Mh = 1 000 000 hours
AFR = 1 - exp(-8760/MTBF), MTBF in hours; to have AFR in %, the result must be multiplied by 100

Z300M · May 27, 2014

Sir.Robin said:
Hehe.. i got some stats for you from work.

SAN1:
Xyratex. 11 shelves/12 disk a shelf. Seagate ES 7200rpm. 750GB
5-10 dead drives a year.
Fased out now.

SAN2:
IBM (NetApp) 9 shelves/14 drives a shelf. Mostly Seagate/Hitatchi 300-450GB 15K.
2-4 drives dead a year.

SAN3:
Dothill. 2 shelves/12 drives Seagate ES2. 2TB.
Running since 2011. 1 drive dead.

SAN4:
Dothill. 4 shelves/12 drives. Seagate ES2(?) 4TB.
Running since summer. 2013. 3 drives died the first 3 months.

SAN5:
IBM (NetApp) 24 drives a shelf.
First 3 shelves. Been running a year now. 3 drives dead after the first 4 months. After that: none.
Second 3 shelves. Been running a month. 0 drives dead so far.

Here's the killer:
On SAN2, i also have 5 SATA shelves/14 drives a shelf. 1TB Seagate. All from 2009.
Drives dead so far: 0. :)

Is the environment (e.g., cooling and vibration) the same for all these SANs?

Sir.Robin · May 27, 2014

Z300M said:
Is the environment (e.g., cooling and vibration) the same for all these SANs?

Yes.

Sent from my mobile using Tapatalk

NASbox · Jan 5, 2018

Weeblie said:
It doesn't have to be an exact science. An estimation of it taking about a week for a home user to notice the failure, order a replacement drive and have it resilvered should probably be pretty close to the truth. :)

Really with a FreeNAS box? Just turn on the email alerts and you know almost instantly. I think I lost one drive in about 4-5 years and I had it replaced in about a day and a half including resilver time.

I suspect that you can skew the odds in your favor by making sure that no two disks are from the same lot, and that you burn them in for differing amounts of time.

I'm also wondering if there is a way to work around errors in different parts of the disk. If drive 1 has an error at block A, and drive 2 has an error at block B, of a 3 drive RAIDZ1 array, that is technically 2 failures, but if the disks are still spinning there are still good copies of the data. Add a disk to the pool for a replacement operation and get the data for A from drive 2, and the data for B from drive 1.

danb35 · Jan 6, 2018

NASbox said:
Really with a FreeNAS box? Just turn on the email alerts and you know almost instantly.

You revived a thread that had been dead for over three and a half years to post that, and you didn't even read what you quoted? The estimate is a week to "notice the failure, order a replacement drive (and wait for it to arrive) and have it resilvered." Setting up email alerts helps a lot with the "notice the failure" part, but doesn't do anything for the rest. If the drive isn't already on hand, a week is probably a short estimate, because it doesn't account for also needing to burn in and test the drive before putting it into the pool.

Of course, if you have a tested spare on hand, the time drops dramatically--but most people won't, I bet.

Chris Moore · Feb 1, 2018

cyberjock said:
But they looked like fools by building servers with SATA port multipliers. Anyone that knows *anything* about file servers knows you do not use those.. EVER. So who the hell did they hire that was so incompetent and unskilled to not know better?

I think that one of the founders of the company got personally invested in the technology. They could buy a Supermicro chassis for about the same money they spend building these "Frankenstein" systems, that would hold just as many drives and be much better than building their own. They have no excuse for what they are doing.

danb35 said:
Of course, if you have a tested spare on hand, the time drops dramatically--but most people won't, I bet.

I try to remember to suggest that when I am making build suggestions. I think it is a good idea, so much so that I do it myself.

Chris Moore · Feb 1, 2018

Nick2253 said:
Since I haven't seen any discussion on this issue (well, anything more than speculation), I thought I'd go ahead and do some math on exactly what is the likelihood of HDD failure (and, by extension, array failure).

I like the write-up. Very good information, thanks for sharing. I notice it is a little dated though, would you be willing to update with more recent drive models?

Nick2253 · Feb 1, 2018

Chris Moore said:
I like the write-up. Very good information, thanks for sharing. I notice it is a little dated though, would you be willing to update with more recent drive models?

Thanks Chris! I'm glad you found it informative.

I keep coming back to this article over the years, and I've wanted to try and update it with more information, but there just isn't that much information out there. Google's study is still the best out there. Backblaze is definitely getting better in their reporting, and I've thought about trying to take their raw data and spin it into something a little more rigorous. However, I still feel that the systematic bias in their workload and environment would severely limit the kinds of conclusions you can draw from the data, especially with regards to typical FreeNAS use case.

The other thought I have mulled about is crowdsourcing the data collection. How better to determine failure rates during typical FreeNAS uses than to have it come directly from FreeNAS users? However, I've been trying to figure out how you can get accurate, ongoing data, without artificially biasing your crowdsourced data with self-selected participants. Not an easy problem. In a perfect world, I'd love to set up some kind of automatic reporting script that users can put on their server(s), but how can such a script know when a drive actually fails? At best, the drive would just disappear, and we'd be left wondering if it failed, or was just removed.

If you have any thoughts or feedback on solving the problem, I'm very interested.

isch · Apr 13, 2019

@Nick2253 Interesting data! I have been searching as of late for analyses on URE/UBER in an attempt to weigh implicated failure probabilities before choosing between ZFS or traditional RAID for my next build. Kudos for putting this work in, as it's the best I've found. Would you be willing to post your spreadsheet, or expand upon what you've posted to include disks of higher capacity?

Important Announcement for the TrueNAS Community.

The Math on Hard Drive Failure

solarisguy

Guru

solarisguy

Guru

Z300M

Guru

Sir.Robin

Guru

NASbox

Guru

danb35

Hall of Famer

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Nick2253

Wizard

isch

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

The Math on Hard Drive Failure

Guru

Guru

Guru

Guru

Guru

Hall of Famer

Hall of Famer

Hall of Famer

Wizard

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The Math on Hard Drive Failure"

Similar threads