Assessing the Potential for Data Loss

Davvo · Jul 8, 2023

Davvo submitted a new resource:

How to Calculate the Probability of Data Loss - Disk failures, UREs and the imprecision of calculations.

This guide was written to be read from top to bottom without jumps, with the intent of spreading awareness to both new and experienced users; the author encourages of this document welcomes and encourages any contributions. It's assumed the understanding of the concepts explained in the following resources:

ZFS Introduction
...

Read more about this resource...

Davvo · Jul 8, 2023

Consider this a revised draft, in the following days I will polish/add things.
Please help with the naming. Every contribution is accepted.

Redcoat · Jul 8, 2023

Davvo said:
Please help with the naming.

How about "Assessing the Potential for Data Loss" as a title?

samarium · Jul 8, 2023

Most people while phase out as soon as they see the formulas.

Maybe you should have a table up front before the formulas scare people off?

And other people will be put of by a table of numbers too, but there has to be a limit of the hand holding I guess.

Answers I would be looking for:

P(pool loss)? single, double, triple disk failure vs pool geometry mirror, raidz1 (i know, but people will want to know, then ignore it), raidz2, raidz3.
mirror double/triple, raidz2 2/4/6/8D+2P, raidz3 similar I guess, haven't considered z3 much I guess. Too many would be too dense
Not sure there is enough out there about draid to add something useful?
How does multiple vdevs affect things. I know it depends, and is complicated, but everyone wants single number.
Combining P(disks failure) and P(ure) somehow, again to come up with a single number to rank options.
How does reduced duration of mirror resilver vs raidz* affect P(pool loss) or P(data loss)
How does sending a disk for RMA and waiting for replacement, which increases exposure time, ie a week or 2 or 3 or 4, affect P(data/pool) loss? Might convince people to keep cold spare?

Looks difficult, maybe impossible, not even sure if they are good questions.

Reading thru though, it reinforces my plan of 1 HDD pool, ~10TB 3x mirror, replacing one disk a year with best available price/capacity modulo features(esp URE), keep new disk as cold spare.

It seems I might have to consider 4x mirror sooner than I was expecting, given 1−(1−10^−15)^(80%*10*10^12*8) ~= 6.19%
Even reducing to 40% capacity, only brings down to 3.13%

Or should I be thinking P(lose 2 disks from 3 disk mirror vdev)=P(lose)=1.19% and P(ure)=6.19%, so

P(data loss) = 1.19% * 6.19% = 0.000738 = 0.0738% or should should I be doing that differently, stats not being my forte.
P(pool loss) should be much lower since meta data is duplicated at least, and should be recoverable vs 1 URE, and I could set ncopies=2 for critical datasets.

Interestingly, P(ure with 10^-15) seems to be roughly linear vs data, double disk size => 12%, halve data load to 40% => 3%, not what I was expecting in a exponent, but I guess (1-10^-15) is very close to one.

Using 10^-14 is however entirely different, base=47%, then 72% and 27% so P(data loss, 10^-14, 3xM 10TB@80%) = 47%* 1.19% ~= 10%

Davvo · Jul 9, 2023

samarium said:
P(pool loss)? single, double, triple disk failure vs pool geometry mirror, raidz1 (i know, but people will want to know, then ignore it), raidz2, raidz3.

mirror double/triple, raidz2 2/4/6/8D+2P, raidz3 similar I guess, haven't considered z3 much I guess. Too many would be too dense

I tought that leaving the formulas and explainations on how to use them was enough: this way, everyone can do its own calculations based on his layouts.

samarium said:
Not sure there is enough out there about draid to add something useful?

Haven't looked into the math of it; not being an option in TN (afaik) made me forget about it... though the URE calculation shouldn't be much different.

samarium said:
How does multiple vdevs affect things. I know it depends, and is complicated, but everyone wants single number.

Combining P(disks failure) and P(ure) somehow, again to come up with a single number to rank options.

I have addressed both of those points, I will try to clear things a bit to make them stand out more. There isn't however much point in a single number, because a low RAIDZ1 P(LOSS) might contain a 90% P(URE) that doesn't stand out in a comparision by single numbers; the point of this resource is to inform users that pool resiliency doesn't depend on a single variable.

samarium said:
How does reduced duration of mirror resilver vs raidz* affect P(pool loss) or P(data loss)

How does sending a disk for RMA and waiting for replacement, which increases exposure time, ie a week or 2 or 3 or 4, affect P(data/pool) loss? Might convince people to keep cold spare?

Will expand upon those points, thank you.

samarium said:
Reading thru though, it reinforces my plan of 1 HDD pool, ~10TB 3x mirror, replacing one disk a year with best available price/capacity modulo features(esp URE), keep new disk as cold spare.

It seems I might have to consider 4x mirror sooner than I was expecting, given 1−(1−10^−15)^(80%*10*10^12*8) ~= 6.19%
Even reducing to 40% capacity, only brings down to 3.13%

If I understand correctly, your plan is to have a single 3-way mirror composed by 10TB disks with an URE of 1e-15, right?
The P(2 out of 3 drives failing) is ~ 0.3% and the URE during such an operation is ~ 6%: the P(LOSS) is ~ 0.02%. I don't think you should consider 4-way mirrors, especially if you regularly monitor your drives either manually or with tools such as the multi report script. To further mitigate UREs, you can also keep the drive being replaced attached until the new one finishes resilvering.
The discrepancy in the official values and the actual URE events is also not to be dismissed imho.

Anyway, I use a simple excel file to make all the calculations. Maybe it could be worth sharing it.

samarium · Jul 9, 2023

Thanks. Seems I skipped reading a bit and just pulled the wrong number from your page for 2 disks lost, 2 of 6 is different from 2 of 3.
So P(lose 2/3) = 0.26%, or 0.3% as you say, not 1.19%. 4way is for the future to think about, although I'm hoping flash will kill rust by then.
I monitor pools with smartmontools, and get emails, and scrub regularly, and have email alerting, and nagios, so I should get notified, and I have a cold spare. I always prefer to replace in place to maintain as much redundancy data as possible.

Calculation was a little intereting, first choice is usually bc, but it rejected the large exponent. Linux desktop calculator works fine tho.

Actual UREs would be good, I agree. Regular disclosure like backblaze would be great, but we aren't likely to see that.

Davvo · Jul 9, 2023

samarium said:
Regular disclosure like backblaze would be great, but we aren't likely to see that.

Sadly their data is mostly unusable.

Davvo · Jul 9, 2023

Davvo updated Assessing the Potential for Data Loss with a new update entry:

Added paragraph explaining the meaning of different notation use in URE values.

Probably for marketing reasons, different manufacturers use different notations to exprime their drives' URE value; comparing WD and Seagate shows that the former uses 1...

Read the rest of this update entry...

morganL · Jul 11, 2023

@Davvo

"Assuming the probability of any single drive failing is p, the VDEV size/width is n, and the number of drives that fail simultaneously in that VDEV is X, then:
P(X) = C(n,X) * (p)^X * (1-p)^(n-X)"

The formula seems to be missing the analysis for resilver time.. The failure is only simultaneous if the second failure happens before the 1st resilver is completed. Larger drives need more resilver time and hence more protection.

Davvo · Jul 12, 2023

morganL said:
The formula seems to be missing the analysis for resilver time.. The failure is only simultaneous if the second failure happens before the 1st resilver is completed. Larger drives need more resilver time and hence more protection.

You are right, it's not directly included: the formula uses an AFR value to calculate the chance of X disks simultaneously dying and it's an approximation, while the issue regarding resilvering time and stress is addressed later in the resource; the need of greater protection for bigger drives is not explicitly stated, but there is emphasis on the resilvering time: will expand on this asap, thank you for pointing it out.

As far as I know there is no study or publication that can quantify how the resilvering process influences the AFR of the surviving drives: all we can say is that (a) it'a stressful operation that can possibly increase the chance of a drive failure, (b) the process in a RAIDZ layout is harsher on the drives and takes longer to complete compared to a mirror layout, and (c) until the process is completed the pool/vdev is in a degraded state that depending on the remaining parity exposes it to risk.

As such, from my understanding it's not possible to integrate such evaluations in the formula withouth greatly increasing its complexity: for the purpose of that calculation the probability of a second drive failure during the resilvering of another is the probability of both drives failing at the same time, because reaslistically the resilvering time doesn't influence the AFR in a significant way. Granted it's an imprecision, but not a decisive one overall.

I want to specify that this is not my area of expertise and I might be awfully wrong; if anyone with greater competences wants to chime in please feel free to do so, be it directly or by linking to publications regarding these arguments.

morganL · Jul 12, 2023

Davvo said:
You are right, it's not directly included: the formula uses an AFR value to calculate the chance of X disks simultaneously dying and it's an approximation, while the issue regarding resilvering time and stress is addressed later in the resource; the need of greater protection for bigger drives is not explicitly stated, but there is emphasis on the resilvering time: will expand on this asap, thank you for pointing it out.

As far as I know there is no study or publication that can quantify how the resilvering process influences the AFR of the surviving drives: all we can say is that (a) it'a stressful operation that can possibly increase the chance of a drive failure, (b) the process in a RAIDZ layout is harsher on the drives and takes longer to complete compared to a mirror layout, and (c) until the process is completed the pool/vdev is in a degraded state that depending on the remaining parity exposes it to risk.

As such, from my understanding it's not possible to integrate such evaluations in the formula withouth greatly increasing its complexity: for the purpose of that calculation the probability of a second drive failure during the resilvering of another is the probability of both drives failing at the same time, because reaslistically the resilvering time doesn't influence the AFR in a significant way. Granted it's an imprecision, but not a decisive one overall.

I want to specify that this is not my area of expertise and I might be awfully wrong; if anyone with greater competences wants to chime in please feel free to do so, be it directly or by linking to publications regarding these arguments.

Disagree.

If the resilver time is less than a week..... the you can do the calculations based on weekly failure rates.

Much less chance of 2 "simultaneous" failures in a week. ... even if there are more weeks.

Constantin · Jul 14, 2023

The stated MTBFs etc. for drives are mostly meaningless for the average SOHO user - they only become relevant IF you are a very large data center user, have a lot of failures, and hence enough economic "damage" to make stated MTBFs even actionable. The rest of us replace the drive, ideally while still under warranty. To me, the realities of the above (and thank you for the resource as well as the excellent / interesting articles you linked to) are yet another reason not to buy new HDDs. Take the discount and the automatic "batch scrambler bonus", burn in a few cold spares, and call it a day.

What might be worth mentioning in this context as well is the significant difference by drive type re: wear endurance and hence expected life by application. Optane for SLOGs, datacenter Intel S or P series SSDs vs. consumer grade stuff for sVDEVs, etc. For example, a lot of folk continue to be confused why their glowing-LED-encrusted series of M2 sticks only last a few months in a SLOG application or why some consumer-grade SSDs have crashed and burned badly re: reliability in a TrueNAS setup.

Johnny Fartpants · Jul 14, 2023

Has the use of hot-spares been factored in also as resilver would be triggered immediately and require no wait time for sys-admin to replace and trigger?

Davvo · Jul 14, 2023

Exploring the following publication. @morganL @Johnny Fartpants

https://www.researchgate.net/publication/234812118_Mean_time_to_meaningless_MTTDL_Markov_models_and_storage_system_reliability

And the following article.

ZFS data protection comparison

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementa...

blog.richardelling.com

morganL · Jul 14, 2023

Davvo said:
Exploring the following publication. @morganL @Johnny Fartpants

https://www.researchgate.net/publication/234812118_Mean_time_to_meaningless_MTTDL_Markov_models_and_storage_system_reliability

And the following article.

ZFS data protection comparison

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementa...

blog.richardelling.com

Richard Elling was a very expert guy.

Yes, this algorithm is a better approximation. MTTR is the key statistic.
MTTR needs to me measured in years.. same unit of time as MTBF.
So a 1 week resliver = 0.02 MTTR if there is a hot spare.

For non-protected schemes (dynamic striping, RAID-0)

MTTDL[1] = MTBF / N

For single parity schemes (2-way mirror, raidz, RAID-1,RAID-5):

MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)

For triple parity schemes (4-way mirror, raidz3):

MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)

Davvo · Jul 14, 2023

morganL said:
Richard Elling was a very expert guy.

Yes, this algorithm is a better approximation. MTTR is the key statistic.
MTTR needs to me measured in years.. same unit of time as MTBF.
So a 1 week resliver = 0.02 MTTR if there is a hot spare.

I assume the MTTR is gonna be (7*24)/(365*24) for a single disk. However, I have doubts about the usefulness of such calculation; quoting the study previously linked:

Both calculations for MTTDL and BHL result in reliability metrics that are essentially meaningless. Even in a RAID4 system with the threat of sector errors (2.6% chance when reading an entire disk) both metrics produce numbers that are well beyond the lifetime of most existing systems. In addition, both metrics produce results that are not comparable between systems that differ in terms of technology and scale.

morganL · Jul 14, 2023

Davvo said:
I assume the MTTR is gonna be (7*24)/(365*24) for a single disk. However, I have doubts about the usefulness of such calculation; quoting the study previously linked:

resilver times depend on the size and speed of the drive.... but the MTTR calculation is a good approximation.

Yes, I'd agree that the MTTDL numbers gove very low numbers with dual parity. Fire, flood and earthquake become more significant. But that is the gaol.. eliminate simple drive failures causing data loss. When drive failures cause pool failures, the costs can be very high both in terms of information and people time to resolve.

Constantin · Jul 14, 2023

morganL said:
When drive failures cause pool failures, the costs can be very high both in terms of information and people time to resolve.

Exactly. Rebuilding pools, associated permissions, etc potentially takes a very long time. This is why I was so cross re: being told that my pool should be destroyed on multiple occasions even though it likely had nothing more than a loose power connector on the backplane.

ICPete · Oct 31, 2023

Probably for marketing reasons, different manufacturers use different notations to exprime their drives' URE value; comparing WD and Seagate shows that the former uses...

Sorry to resurrect an old thread, but I'm studying all the threads I can find on this subject.
I have to say, my overwhelming reaction when seeing Seagate's datasheet parameter listed as "1 in 10E15", is that someone either left out a decimal point, or was simply confused about proper usage of scientific notation. Because while it's technically true that 10E15 = 1E16 (or as you reported, 10E14 = 1E15), I think it's more likely they actually meant to write "1 in 1.0E15" or "1 in 1E15" or possibly "1 in 10^15". I've witnessed multiple occasions that someone (even an engineer who should know better) confuses the "E" with the "^" symbol, therefore writing "10E15" when they really mean "10^15". In the end, of course, we can't know for sure, unless someone can contact Seagate and get to the bottom of the issue.

Davvo · Nov 1, 2023

ICPete said:
I have to say, my overwhelming reaction when seeing Seagate's datasheet parameter listed as "1 in 10E15", is that someone either left out a decimal point, or was simply confused about proper usage of scientific notation.

Highly unlikely, all of their documents have this kind of formatting... and I find very strange for such a mistake to happen on this scale with none realizing.

Important Announcement for the TrueNAS Community.

Assessing the Potential for Data Loss

MVP

MVP

MVP

Contributor

MVP

Contributor

MVP

MVP

Captain Morgan

MVP

Captain Morgan

Vampire Pig

Guru

MVP

Captain Morgan

MVP

Captain Morgan

Vampire Pig

Cadet

MVP