RAIDZ3 optimal width (9, 10 or 11 drives) for data protection - resilvering time

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
Hello,

I need to decide between 9, 10 or 11-drive RAIDZ3 vdev (I believe smaller than 9 is not recommended and also not very efficient space wise).
This will be a single vdev in the system.
Drives: WD GOLD 10 TB.
I plan to use the default (128K) recordsize.
My data contain 8 million of small files (typically 1-10 MB in size) which are not compressible
(more than 90% of the number of files and data are mostly already heavy compressed files like JPEG XL or image based PDFs which internally contain JPGs).

My first priority is data protection, hence RAIDZ3,
but the overall drive space/cost is also a consideration (hence this question about the vdev width).

Advantages of 9-drive pool are:
- losing 3 drives from a pool of 10 or 11 is more likely than from the pool of 9.
- resilvering 10 or 11-drive vdev would take longer than 9 drive vdev,

I'm wondering how big would the difference in the resilvering time between 9, 10 and 11-drive pools.

Advantage of 10 or 11-drive pool is a smaller percentage of 'wasted' drives,
so the available drive space/cost is better, while still being quite secure.

So far, I found the following recommendations:
- for 9-drive vdev: sources A and B
- for 10-drive vdev: source C (it is mentioned in the context of a large 240 array, so this is not applicable to my use case)
- for 11-drive vdev: source C (only when using very small recordsize (4K or 8K) - so not my use case (I'll be the using the default 128K recordsize)

I'm wondering if there are any other issues to consider?
Many thanks in advance,
Wojciech

source A.
A RAIDZ-3 configuration maximizes disk space and offers excellent availability because it can withstand 3 disk failures. Create a triple-parity RAID-Z (raidz3) configuration at 9 disks (6+3).

source B.

RAIDZ Configuration Requirements and Recommendations​

  • Start a triple-parity RAIDZ (raidz3) configuration at 9 disks (6+3)
  • (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 6
source C.
 
Last edited:

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
- losing 3 drives from a pool of 10 or 11 is more likely than from the pool of 9.

Yes and no, if you are buying all the drives from the same place at the same time, if one drive fails, chances are higher others will also fail during rebuild (faulty batch). Also many other factors to play into it.
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
My first priority is data protection, hence RAIDZ3,
but the overall drive space/cost is also a consideration (hence this question about the vdev width).

Advantages of 9-drive pool are:
- losing 3 drives from a pool of 10 or 11 is more likely than from the pool of 9.
- resilvering 10 or 11-drive vdev would take longer than 9 drive vdev,

I'm wondering how big would the difference in the resilvering time between 9, 10 and 11-drive pools.

Advantage of 10 or 11-drive pool is a smaller percentage of 'wasted' drives,
so the available drive space/cost is better, while still being quite secure.

So far, I found the following recommendations:
- for 9-drive vdev: sources A and B
- for 10-drive vdev: source C (it is mentioned in the context of a large 240 array, so this is not applicable to my use case)
- for 11-drive vdev: source C (only when using very small recordsize (4K or 8K) - so not my use case (I'll be the using the default 128K recordsize)
Depends on where you get the drives. If you are absolutely intent on data protection, I would suggest you source your drives from different vendors(purchase 4 drives from Amazon, 4 drives from Neweg, ect...). As another poster stated, if all your drives are purchased at the same time from the same vendor, they likely will be from the same lot number, built at the same time, at the same factory. Once you have an issue, the chances are much greater that you will have other issues, especially during the resilver.

As far as resilver times go, I am not an expert here, but given the wide variety of systems we deploy, vdev sizes don't appear to be a factor in resilver times. I could certainly be wrong but the bottleneck in most resilver situations is very likely the drive being resilvered. I would expect a vdev of 9 drives to behave identically to a 10 or 11 drive vdev in terms of resilver performance. As an example, we have several fairly old 60 bay JBODs running TrueNAS core, with varying vdev sizes depending on age and purpose. These systems are installed with 4, 8, and 12TB drives. We have several running 4TB drives in a 6 x 10 config, but also a couple running a 5 x 12 config. We do not typically see any difference in our resilver times which run about 12 hours for our 4TB drives and 18 hours for our 12TB drives, depending on how heavily the system is being used at the time.

Where vdev sides DOES matter, in my experience, is in the CPU overhead required for parity calculations. I think that's a valid consideration for you, given that you are talking about storing a lot of small files. Remember that you need to take into account the increased CPU load during a resilver and a smaller vdev size will benefit you here.
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
Yes and no, if you are buying all the drives from the same place at the same time, if one drive fails, chances are higher others will also fail during rebuild (faulty batch). Also many other factors to play into it.
Thanks, right, I'm getting them one by one from different sources with different dates of production.
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
Depends on where you get the drives. If you are absolutely intent on data protection, I would suggest you source your drives from different vendors(purchase 4 drives from Amazon, 4 drives from Neweg, ect...). As another poster stated, if all your drives are purchased at the same time from the same vendor, they likely will be from the same lot number, built at the same time, at the same factory. Once you have an issue, the chances are much greater that you will have other issues, especially during the resilver.
Thanks, right, I'm getting them one by one from different sources with different dates of production.

As far as resilver times go, I am not an expert here, but given the wide variety of systems we deploy, vdev sizes don't appear to be a factor in resilver times. I could certainly be wrong but the bottleneck in most resilver situations is very likely the drive being resilvered. I would expect a vdev of 9 drives to behave identically to a 10 or 11 drive vdev in terms of resilver performance. As an example, we have several fairly old 60 bay JBODs running TrueNAS core, with varying vdev sizes depending on age and purpose. These systems are installed with 4, 8, and 12TB drives. We have several running 4TB drives in a 6 x 10 config, but also a couple running a 5 x 12 config. We do not typically see any difference in our resilver times which run about 12 hours for our 4TB drives and 18 hours for our 12TB drives, depending on how heavily the system is being used at the time.
Thanks, that's helpful!

Where vdev sides DOES matter, in my experience, is in the CPU overhead required for parity calculations. I think that's a valid consideration for you, given that you are talking about storing a lot of small files. Remember that you need to take into account the increased CPU load during a resilver and a smaller vdev size will benefit you here.
I see. The CPU used is Intel Xeon Silver 4310 2,1 GHz (12C;24T).
I'll be using only a single vdev, and there is nothing else running on the CPU, so it will be free to do just resilvering when needed.
I guess I should be fine, right?
 
Joined
Jun 15, 2022
Messages
674
RAIDZ3 isn't really a "3-2-1 backup strategy," things can still go all stiff (like with a power surge when a RF filter cap in the PSU pops and dumps line voltage onto the 12-volt rail).

If I remember correctly there are a few special-case 12-drive vdevs working fine, though if I recall they store large contiguous files with very few deletes.
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
RAIDZ3 isn't really a "3-2-1 backup strategy," things can still go all stiff (like with a power surge when a RF filter cap in the PSU pops and dumps line voltage onto the 12-volt rail).

If I remember correctly there are a few special-case 12-drive vdevs working fine, though if I recall they store large contiguous files with very few deletes.
Thanks, sure, it's always good to remember about the backup.
May I ask how wide vdev do you typically use with RAIDZ3?
 
Joined
Jun 15, 2022
Messages
674
Thanks, sure, it's always good to remember about the backup.
May I ask how wide vdev do you typically use with RAIDZ3?
Depends on the usage. File size, number of files, rewrite frequency, drive size, data transfer rate, volume access metrics, everything is taken into consideration during the planning phase, then the real-world throws that in the circular bin and says "start over."

The one constant seems to be excellent hardware properly configured to the workload can move an incredible amount of data.
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
Depends on the usage. File size, number of files, rewrite frequency, drive size, data transfer rate, volume access metrics, everything is taken into consideration during the planning phase, then the real-world throws that in the circular bin and says "start over."

The one constant seems to be excellent hardware properly configured to the workload can move an incredible amount of data.

Right, thanks for your reply!
One last thing: could you please provide an estimate of the practical range for the number of HDDs you would personally use in a single RAIDZ3 vdev? What would be the minimum and maximum numbers, ideally never less than X and never more than Y?"
Does a range of 9-11 sound appropriate to you, or would you consider extending it to, say, 8-12 HDDs depending on the purpose?
 

PhilD13

Patron
Joined
Sep 18, 2020
Messages
203
I have found the information and calculators on this WintelGuy site to be very useful. You can play around with numbers and configurations and get a ggod idea of cost/space/reliability. Then you can decide what YOU are most comfortable with. Just always be sure you have recovery options as like someone else said the real world will throw everything into the circular bin and say "start over" or "look what I did to your server while you were asleep".


Reliability calculator

Capacity calculator
 
Joined
Jun 15, 2022
Messages
674
Every usage situation is so different its quite the guess to say what might work, 5 to 8 drives seems typical, 9+ seems to work fine for non-demanding workloads.

A Production Environment that has to be running 24/7/365 is far different from a home system that's mostly idle except when someone is accessing recipes or streaming a movie, and even then a family of 9 that likes a lot of entertainment will access differently than an academic family of 4. @PhilD13 provides a link to a useful tool, the best you can do is lay out a good plan and hope you're right. You'll probably do "okay," and when something goes wrong will find out how the system reacts under duress. I purposefully set to test environments and cause failures and try to recover the systems under test--not only does it show how the system will react but its also good training for myself as I'll have actually experienced what a particular type of failure looks like in the real world (without real-world time constraints regarding fixing the problem, or losing valuable data). That's actually why they pay me the ridiculous salary they do, because their systems are rock-solid reliable and i keep working to improve them (which is why I'm here learning from others). I expect at some point the company could be sold and my position under-appreciated, but that'll be someone else's problem.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I recommend reading either one of these versions of similar material from the same author (a key contributor to the OpenZFS project).


But if you couldn't be bothered, here's the key:
Matt Ahrens said:
TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the amount of space you are willing to devote to parity information. If you need more IOPS, use fewer disks per stripe. If you need more usable space, use more disks per stripe. Trying to optimize your RAID-Z stripe width based on exact numbers is irrelevant in nearly all cases.

Although there is a lean in the article mentioning RAIDZ3 can go well as 11 disks wide if you must maximize capacity.
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
Many thanks for everyone's replies!
As suggested, I calculated ZFS practical usable storage capacity by using:
https://wintelguy.com/zfs-calc.pl and the following input:

RAID type: RAID-Z3
Drive sector size: 4KiB (I assume this is the correct value for a 512e HDD, i.e. 512 emulation drives. Could someone please confirm this?)
Number of drives per RAID group
Number of RAID groups: 1
ZFS record size: 1
Take into account: 20% free space limit

#drives --> practical usable storage capacity

8 drives --> 33.14 TiB
9 drives --> 40.12 TiB (6.98 TiB more than the previous #drives)
10 drives --> 48.27 TiB (8.15 TiB more than the previous #drives)
11 drives --> 58.18 TiB (9.91 TiB more than the previous #drives)
12 drives --> 63.42 TiB (5.24 TiB more than the previous #drives)
13 drives --> 68.65 TiB (5.25 TiB more than the previous #drives)
14 drives --> 73.89 TiB (5.24 TiB more than the previous #drives)

Therefore, 11 drives looks like a sweet spot between reliability and capacity per dollar.
Using more than 11 drives doesn't seem to make sense as the capacity gains are low,
and reliability falls.

Many thanks,
Wojciech
 
Last edited:

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
I have found the information and calculators on this WintelGuy site to be very useful. You can play around with numbers and configurations and get a ggod idea of cost/space/reliability. Then you can decide what YOU are most comfortable with. Just always be sure you have recovery options as like someone else said the real world will throw everything into the circular bin and say "start over" or "look what I did to your server while you were asleep".


Reliability calculator

Capacity calculator

Thanks! One quick question: the links you provided for the reliability calculator and the capacity calculator are the same.
Could you please check if that's what you meant?
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17
Just wanted to add one more source here: (not sure how reliable it is)
https://calomel.org/zfs_raid_speed_capacity.html,
which mentions this:

The current rule of thumb when making a ZFS raid is:
  • MIRROR (raid1) used with two(2) to four(4) disks or more.
  • RAIDZ-1 (raid5) used with five(5) disks or more.
  • RAIDZ-2 (raid6) used with six(6) disk or more.
  • RAIDZ-3 (raid7) used with eleven(11) disks or more.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Generally, the more drives you have the higher the chance of a drive failure... but it's relative. If you want space and a resilient pool, a 12-way RAIDZ3 VDEV is your best shot (dropping one or two drives marginally increases the resiliency).

I plan to use the default (128K) recordsize.
My data contain 8 million of small files (typically 1-10 MB in size) which are not compressible
Why not 1M recordside then?
 

wojciech

Dabbler
Joined
Mar 12, 2023
Messages
17

Generally, the more drives you have the higher the chance of a drive failure... but it's relative. If you want space and a resilient pool, a 12-way RAIDZ3 VDEV is your best shot (dropping one or two drives marginally increases the resiliency).
Thanks, I'll probably make 2 separate 11-drive wide vdevs. This seems like the best safety/$ ratio.
Why not 1M recordside then?
That's a good point about increasing the record size, thanks!
I'll use 512 KB, as while most of my files are over 1 MB in size, I still have lots of files in the 512 KB - 1 MB range.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
I see. The CPU used is Intel Xeon Silver 4310 2,1 GHz (12C;24T).
I'll be using only a single vdev, and there is nothing else running on the CPU, so it will be free to do just resilvering when needed.
I guess I should be fine, right?
You should be fine with a single vdev and this CPU.

Something I've done in the past before putting a system into service is to create a bunch of random data on the pool, forcibly fail a drive either by removing it via the GUI, or just physically pulling it, then replacing it with a spare and seeing how the system behaves. That way I can make sure it performs the way I expect and that there are sufficient resources available to the TrueNAS.
 
Top