Can my current system handle my new hard drives?

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
ahh. i missed that. was just pointing out that there is nothing strictly "wrong" with using 2 HBAs, even if expanders do make much sense.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
I made math on the fly (based on this)... considering URE 1e-15 and a drive failure rate of 3%, at 80% full (14TB) the pool's probability of data loss should be not lower than 55%.

And WD's RED PRO disks have an URE of 1e-14. Source.
Hi, I've almost struggled my way through this but would like confirmation on what these numbers actually mean... I'm trying to decide 8x18TB in RAIDZ3 is safe enough.

For the first part where X = the number of drives failing simultaneously, for one, two and three drives respectively I came up with the following %s:
1688211597778.png


Which all look reasonable, and RAIDZ3 would seem pretty safe.... But then I get to the Pr(URE) part and for my 18TB URE=1e-14 drives, @ 80% full this number seems super high:
1688211758863.png


Thoughts? Should I be good with RAIDZ3?

Thank you
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Hi, I've almost struggled my way through this but would like confirmation on what these numbers actually mean... I'm trying to decide 8x18TB in RAIDZ3 is safe enough.

For the first part where X = the number of drives failing simultaneously, for one, two and three drives respectively I came up with the following %s:
View attachment 67852

Which all look reasonable, and RAIDZ3 would seem pretty safe.... But then I get to the Pr(URE) part and for my 18TB URE=1e-14 drives, @ 80% full this number seems super high:
View attachment 67853

Thoughts? Should I be good with RAIDZ3?

Thank you
That number seems super high because it is: are you sure your 18TB disks have an URE of 1e-14 and not of 1e-15? All the drives I have seen that big have the latter value WD is crap, it might not seem a big difference but it does change everything (from 69% to 11%): this value is the probability of a single disk experiencing an URE during resilver; in the worst case scenario (when you have no parity) you have 5.

I would not use 18TB disks that have an URE of 1e-14 no matter what the pool structure is, but aside from that if you want resiliency yes, RAIDZ3 is your best shot.

However that while the probability of an URE is high, the probability of 3 drives diying at the same time it's, as you correctly calculated, absimal: as such, URE shouldn't be that impactful as long as you have parity (which is why you can mitigate that risk by increasing the dataset copies value at the expense of usable space).

I don't have the calculations to back the last paragraph statement, maybe I will post something if I get the time and the will to do them, but I am confident enough in my statement; I still stand by my opinion that no production system focused on storage and resiliency should use 1e-14 drives, but as always risk acceptance is a matter of personal choice.

Have I correctly addressed your doubts?
 
Last edited:

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
That number seems super high because it is: are you sure your 18TB disks have an URE of 1e-14 and not of 1e-15? All the drives I have seen that big have the latter value WD is crap, it might not seem a big difference but it does change everything (from 69% to 11%): this value is the probability of a single disk experiencing an URE during resilver; in the worst case scenario (when you have no parity) you have 5.

I would not use 18TB disks that have an URE of 1e-14 no matter what the pool structure is, but aside from that if you want resiliency yes, RAIDZ3 is your best shot.

However that while the probability of an URE is high, the probability of 3 drives diying at the same time it's, as you correctly calculated, absimal: as such, URE shouldn't be that impactful as long as you have parity (which is why you can mitigate that risk by increasing the dataset copies value at the expense of usable space).

I don't have the calculations to back the last paragraph statement, maybe I will post something if I get the time and the will to do them, but I am confident enough in my statement; I still stand by my opinion that no production system focused on storage and resiliency should use 1e-14 drives, but as always risk acceptance is a matter of personal choice.

Have I correctly addressed your doubts?
Thank you. Yes, that all makes sense and I'll look into dataset copies, thank you for that suggestion.

I am stuck with these drives, but would you mind going to a little more depth about what a URE really causes? I googled it quickly but didn't find any ELI5 answers. Lol

Most of my files are non-critcal 25-50GB+ video files for Plex. I also have computer backups, etc, again most of which are non-critical. Finally, I have thousands of smaller business documents and files, these are the most critical but for the most part I have backups of everything on a Synology NAS.

If a URE occurs, in a video file for instance, do I lose the entire file? Do a couple seconds glitch during playback?


Will a URE cause a single office document to corrupt or hundreds of documents at once?

I'm just trying to gauge the risk exposure level.

Thank you
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Thank you. Yes, that all makes sense and I'll look into dataset copies, thank you for that suggestion.
Because you suffer a 1/2 or 2/3 available space penalty (only for this aspect, consider it as if you were using a single vdev made of multiple mirrors inside a single drive) I don't think it's worth for anything but the most critical applications, and I'm talking about really extremes cases (definitely not home use) since the URE issue in this case is a problem (and a big one) when you don't have parity anymore (so if you let 3 drives die in a RAIDZ3 configuration you basically go back to the old RAID5/Z1 is dead reason)... but again, I don't have numbers and it's just a logical conclusion.

EDIT: being a dataset property it might be worth for critical office documents; definitively not for large video files. Also, maybe it's more correct to refer to it as write amplification rather than space penalty.

I am stuck with these drives, but would you mind going to a little more depth about what a URE really causes? I googled it quickly but didn't find any ELI5 answers. Lol

If a URE occurs, in a video file for instance, do I lose the entire file? Do a couple seconds glitch during playback?

Will a URE cause a single office document to corrupt or hundreds of documents at once?
Today's drives use ECC in order to properly assure they spit out the correct data, and when that fails you have an URE or Unrecoverable Read Error: it means that something has happened that has caused the reading of a sector to fail that the drive cannot fix. Quoting this answer on su:
In the latter case the drive does not normally return any contents whatsoever; it just returns a status indicating the error. This is because it is not possible to know which bits are suspect, let alone what their values should be. Therefore the entire sector (ECC bits and all) is untrustable. It is impossible to determine which part of the bad sector is bad, let alone what its contents should be. The ECC is a "gestalt" that is calculated across the entire sector content, and if it doesn't match, it's the entire sector that isn't matched.

I don't feel confident enough in my ZFS knowledge to tell you the implications on a file system level, but without parity on TN you should either get an error saying something like "this file(s) has (have) been corrupted, restore from a backup or deleted them" or see your pool go offline: basically, you experience data loss (even though you might be able to recover the files and find out that only a small part is corrupted/glitched).

ZFS wizards and experts on data recovery might give you a more in-depth answer regarding this.
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
A heads up since I have written a resource that goes in depth about the topics we touched in this thread, hopefully clarifying even further them.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
Cool, ty. I'll check it out.
 
Top