Assessing the Potential for Data Loss

Assessing the Potential for Data Loss

This guide was written to be read from top to bottom without jumps, with the intent of spreading awareness to both new and experienced users; the author of this document assumes the understanding of the concepts explained in the following resources:​
DISK FAILURE RATE

Assuming the probability of any single drive failing is p, the VDEV size/width is n, and the number of drives that fail simultaneously in that VDEV is X, then:
P(X) = C(n,X) * (p)^X * (1-p)^(n-X)
see wikipedia to understand what is a combination

This formula allows us to calculate the probability of X drives failing at the same time in our VDEV.
Now, assuming that p = 0.03 (3%) and that n = 2 (vdev is a 2-way mirror), we get the following numbers:
X​
0​
1​
2​
P(X)​
0.941​
0.058​
0.002​
%
94.09
5.82
0.18


Which means that we have a 94% probability of losing neither disks, a 5.82% of losing either disks, and a 0.18% of simultaneously losing both disks. Because we are using a 2-way mirror, we encounter data loss ony when both drives organize to go on strike together: as such, the data loss probability of our VDEV is 0.18%.

The data loss probability of a POOL composed by more than a single VDEV is simply the individual VDEV data loss probability multiplied by the number of VDEVs; assuming a POOL composed by 3 VDEVs in 2-way mirror, we get the following numbers: 3 * 0.18% = 0.54%.

If we consider instead a single VDEV composed of 6 disks in RAIDZ2, we get the following equation:
P(X) = C(6,X) * (0.03)^X * (1-0.03)^(6-X)
n = 6 because the VDEV has six drives

X​
0​
1​
2​
3​
P(X)​
0.833​
0.155​
0.012​
0.001​
%
83.29
15.46
1.19
0.05


Which means that we have a 83.29% probability of losing no disks, a 15.46% of losing a single disk, a 1.19% of simultaneously losing two disks, and a 0.05% of simultaneously losing three disks. Because we are using RAIDZ2 (2 parity drives) we encounter data loss only when three disks organize to go on strike together: the data loss probability of our VDEV is 0.05%.

Compared to the first example (3 mirror VDEVs), this POOL's data loss probability is an order of magnitude lower while using the same amount of drives and the difference grows larger as the POOL expands, as shown in the following graphs.

Screenshot_1.png
Screenshot_2.png

Granted, performance wise the RAIDZ VDEV is going to be left in the dust; this also includes stressful operations such as resilvering and scrubbing:
A resilver and a scrub are very similar in that they walk the pool in a virtually identical manner. [...] However, when resilvering, or even when just repairing checksum errors during normal read operations, you are also doing an additional write operation and some other stuff. [...]

For a single disk, writing a single sector shouldn't be terribly hard. [...]

However, for a single SMR disk, writing a single sector involves rewriting the entire shingle, and we already know that this can get very hard on pools, even outside of a scrub operation, if more than a small number of rewrites are involved. This is what led to the original kerfuffle about SMR disks: people had pools that were failing to resilver, even if they had RAIDZ2 or RAIDZ3 protection.

Worse, for even a CMR disk, the sustained write activity increases stress particularly on the target (drive being replaced), increasing temperatures. It is not just a function of reading the existing data sectors and verifying the parity sectors. It is reading the block's sectors, back-calculating the missing data or parity sectors, and then writing that out to the replaced disk. This is more work than just reading all the disks. Reading is relatively trite and some of it is mitigated by drive and host caching. Writing semirandom sectors to rebuild ZFS blocks typically requires a seek for each ZFS block, which may be harder on the drive being written to. More work equals more heat.

Finally, resilvers on mirrors are somewhat easier than RAIDZ because you might only be involving two or three disks (meaning only two or three disks are running warm). RAIDZ, on the other hand, involves each disk in the vdev, and because the process is slower due to the nature of RAIDZ, all the component drives run busier, for longer, get warmer, and it just isn't really a great thing for them. [...]​
[...] With respect to the risk of a further mechanical failure during the resilver, and possibly caused by strain from the resilver, one may argue that resilvering a mirror is less stressful than resilvering a raidz# array (simple huge sequential read of the surviving drive vs. mixed read/write workload with parity data). [...]​
[...] The fact that the probability of other drives' failure in a RAIDZ goes up for all the other drives involved vs just the sibling drive due to two confounding variables:
- Resilvering puts load on all the other n-1 drives in the vdev (which is typically the entire pool) vs just the siblings in the striped mirrors.
- Resilvering time is orders of magnitudes longer putting a lot more load on the rest of the surviving drives for a lot longer.

[...] In a 6-wide RAIDZ1 (can't speak for RAID5), for each block resilvered, a block has to be read from each of the surviving drive putting FIVE times the I/O load vs a 6-drive mirrors where you only need to read a block from the sibling drive. You can clearly observe this from the significant difference in resilver times between the two topologies even if you don't use the pool at all while degraded. This is also why your degraded RAIDZ pool performance is slow as snail when you're resilvering. On the other hand, a 6-drive striped mirrors should still perform fairly well while resilvering. [...]​
[...] resilvers can be orders of magnitudes faster and generate much less I/O demand/load in mirrors than in RAIDZ since you're only loading 1 other drive in the vdev vs however many remaining drives in the RAIDZ vdev.

Suppose a 4-drive mirrors vs 4-drive RAIDZ1. A resilver in the mirrors would just load 1 vs 3 other drives in RAIDZ because for each block resilvered, a block also has to be read from each of the remaining drives. Now scale that up to 6. The mirrors stay at 1, while the RAIDZ now loads 5 drives instead of 3. That is now 5 times the I/O demand. Add more drives and you can see how RAIDZ array resilvers also scales linearly everytime you add another drive to the vdev.​

About the resilvering process and the sequential resilver introduced in OpenZFS 2.0: Video and Slides.

Those partial quotes are extracts of posts regarding a much wider discussion; what does matter to us is that RAIDZ (in any of its configurations) gets hit harsher by such operations.
THE URE IMPACT

Mechanical failures are not the only way spinning rust could harm our data: today's drives use ECC in order to properly assure they read back the correct data, and when that fails you have an URE or Unrecoverable Read Error: it means that something has happened that has caused the reading of a sector to fail that the drive cannot fix. Quoting this answer on su:
In the latter case the drive does not normally return any contents whatsoever; it just returns a status indicating the error. This is because it is not possible to know which bits are suspect, let alone what their values should be. Therefore the entire sector (ECC bits and all) is untrustable. It is impossible to determine which part of the bad sector is bad, let alone what its contents should be. The ECC is a "gestalt" that is calculated across the entire sector content, and if it doesn't match, it's the entire sector that isn't matched.​


Probably for marketing reasons, different manufacturers use different notations to exprime their drives' URE value; comparing WD and Seagate shows that the former uses 1 in 10^14 and the latter 1 per 10E14: the difference is massive because the first notation means 1e-14 while the second actually means 1e-15, a whole order of magnitude smaller! It has however been stated that the value has to be read as 1e-14.

Most drive manifacturers report events of an URE every 1e-14 (e = 10^) bits read for common grade disks and 1e-15 for enterprise/pro ones... but how does this influence our POOL?

In order to find out, we need to go back to our single 2-way mirror VDEV: assuming we are using 6TB drives and that our pool is 80% full, let's calculate the probability of an URE during resilver.
P(URE) = 1 - (1-URE)^bit_read

where bit_read is the data to resilver expressed in bits: in this case 6*8e12*0.8 bits.
8e12 is the euquivalent in bits of a single TB

Running the numbers tells us the probability of encountering a URE during the resilver of our VDEV is 32%; if we use a drive with an URE of 1e-15, said probability drops to just 4%.

Switching to a 6 wide RAIDZ2 without parity disk remaining (having lost two out of 6 drives), the bit_read value is obtained by 6*8e12*0.8 multiplied by 4 (the surviving drives).

Running the numbers with an URE of 1e-14 gives out a chilling 78% probability of scrambling for backup and coffee; using drives with an URE of 1e-15, the probability drops to 14%.
REASONABLE CRITICISM

Scary numbers, aren't they? While everything we have seen so far is correct it's time to address a few issues that arise from the same root: calculations are as valid as the the parameters used.

The proper name of what we previously called probability of any single drive failing (the p value in the first formula) is Annualized Failure Rate: the AFR is an estimated probability that a disk will fail during a full year of use, a relation between the mean time between failure (MTBF) and the hours that a number of devices are run per year.
AFR = 8766 or 8760 / MTBF

Here we need to lay a few points: both AFR and MTBF as given by vendors are population statistics that can not predict the behaviour of an individual disks; second, the AFR will increase towards and beyond the end of the service life (read warranty) of the drive. The 3% used in the calculations so far is an estimated mean over an analisys of over 5 years of replacement logs for a large sample of drives.

[...] Their first major result was that the real-world annualized failure rate (average percentage of disks failing per year) was much higher than the manufacturer's estimate - an average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers obviously can't test disks for a year before shipping them, so they stress test disks in high-temperature, high-vibration, high-workload environments, and use data from previous models to estimate MTTF. Only one set of disks had a real-world failure rate less than the estimated failure rate, and one set of disks had a 13.5% annualized failure rate!

In another surprise, they debunked the "bathtub model" of disk failure rates. In this theory, disks experience a higher "infant mortality" initial rate of failure, then settle down for a few years of low failure rate, and then begin to wear out and fail. The graph of the probability vs. time looks like a bathtub, flat in the middle and sloping up at the ends. Instead, the real-world failure rate began low and steadily increased over the years. Disks don't have a sweet spot of low failure rate.

Failures within a batch of disks were strongly correlated over both short and long time periods. If a disk had failed in a batch, then there was a significant probability of a second failure up to at least 2 years later. If one disk in your batch has just gone, you are more likely to have another disk failure in the same batch. [...]

Now for the most surprising result. In Google's population of cheap ATA disks, high temperature was negatively correlated with failure! [...]​

This is a partial quote of an excellent article that I strongly recomend reading, written by Valerie Henson back in the 2007.

Before moving on I want to signal the following analysis of WD's RED PRO low endurance ratings (dated March 2022); it would merit a few lines, but this resource is already awfully long so go take a look.

In regards to the URE numbers the elephant in the room is that despite calculations foretelling catastrophic rates during resilvering/rebuilding (this point is intimately linked with the RAID5/Z1 is dead argument), very few of such events seem to happen.


I haven't been able to find reliable information about known how the OEMs test UREs and what are the legal constrains behind this process: we don't know values such as the variance or standard deviation between the reported value and the actual values, but we are observing a huge discrepancies; having a scientific background this annoys me greatly.
Question aside, it is important to keep in mind that manufacturers increase the URE probability for consumer-class drives for marketing reasons (sell more enterprise-class drives), therefore even consumer-class HDDs are expected to achieve 1E-15 URE/bit read. [...] And real enterprise drives have an even higher reliability [...]​

Personally I can't endorse this statement without reliable data, and I can't suggest the use of a 22TB drive with a declared URE of 1e-14; what's certain is that the lack of independent and reliable testing (scientific method and large enough population) in conjunction to perhaps a grey area in regulations is hurting us consumers.
THE SAFETY MINDSET

The point of this resource is to expand the constantly growing safety mindest that this community constantly brings on the table and shares, providing the necessary tools to identify the dangers of data corruption.

The formula that ties the two main parameters we have analyzed is the following.
P(LOSS) = P(DISK) * P(URE) * N

where N is the number of VDEVs that compose our POOL and P(DISK) is the probability of losing all the parity drives on your single VDEV.
6 disks layout, 80% full​
P(LOSS) with URE = 1e-14​
P(LOSS) with URE = 1e-15​
3 mirror VDEVs​
5.6%​
0.7%​
1 raidz2 VDEV​
0.9%​
0.2%​

Not so high numbers now, right? That's because as high as the calculated P(URE) is, it really matters only when no parity is in place in our VDEVs.

As we have seen in this resource, numbers are not absolute: almost every value considered in this document is the result of population statistics and averages: keeping in mind that defective drives exists is a factor of primal importance that needs to be taken into consideration during the designing of our POOLs.
[...] it is a known issue that ZFS arrays that are resilvering usually also increase in temperature, sometimes by more than 10'C. For many pools, this is the only time they experience thermal stresses, so being unnecessarily dismissive of a real threat is not helpful. Resilvering represents one of those operations where the potential for something to go wrong certainly exists, so at a minimum you should consider whether it would be prudent to take additional steps to mitigate the risk. This might be as simple as blowing a fan on your NAS during a resilver.​

As we understood at the beginning of this resource, resilvering is a critical time for any VDEV and mirrors handle this stressful process more gracefully than any RAIDZ: they reduce the time this process is completed, the larger the drives size the greater this improvement. Especially on elderly drives, this reduces the risks of faults popping out (from internal components failure to UREs).

Another vital points that we must considerate in order to evaluate the risks of our VDEVs are our monitoring capabilities and our response time.
[...] Think about it this way: You are already running in a degraded mode when a resilver is going on, so the risk of further degradation is more serious than when you're just doing a scrub and reading all the data. Scrubbing a RAIDZ2 that has no errors is pretty safe. However, yank one of those disks out, simulating a disk failure, and you suddenly have something that resembles a RAIDZ1 in terms of redundancy. Now take a hammer and start tapping on one of the other drives to represent an already marginal drive. How certain are you that the pool will survive? 90%? 95%? Fine. You have every right to decide on whatever level of resiliency floats yer boat. But it's important to understand that while your system is degraded, it is DEGRADED in multiple ways -- slower performance, less resiliency. It's easy to think "oh but it's RAIDZ2" and pretend this isn't a thing. But fate has this tendency to pick on the unprepared. If you go full on RAIDZ3 with a warm spare, you'll probably never see a disk fail. That's no fun for fate. It's the goober who decides to rely on RAIDZ1, where a disk fails, and then just one more bad thing happens during resilver, and the pool consistency is now in an indeterminate state.​


Proper (frequent and regular, by scripts or manually) monitoring greatly enhances the resiliency of any VDEV, allowing us to either preemptively intervene or to prepare us to a critical situation, reducing our response time: how quickly we are to address an critical event (ie a faulty drive) is as important as the parity level of our VDEVs; the faster we are able to bring a POOL out of a degraded status, the safer our data is.

In the end risk acceptance is subjective, data loss being painful is not.​
Author
Davvo
Views
3,180
First release
Last update
Rating
0.00 star(s) 0 ratings

More resources from Davvo

Latest updates

  1. Added proper interpretation of Seagate's ECC notation to previous update.

    [...] a whole order of magnitude smaller! It has however been stated that the value has to be...
  2. Added paragraph explaining the meaning of different notation use in URE values.

    Probably for marketing reasons, different manufacturers use different notations to exprime their...
Top