Pool Layout Question

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
I have decided to change things up and get rid of some old drives (some are going on 8 years of in-service time) and a lot of them are 5400 RPM SMR Seagates.

I have decided to go with 4 20TB CRM drives in a mirrored pool (2x2 mirrored vDevs) and I just wanted to check in with the community before I do so.

My current spinning rust pools are all RaidZ2. I plan on replacing my production pool (2x6 RaidZ2 made up of 6TB (mostly CMR) and 4TB (all SMR) drives) with the above mirrored pool. After that, I will reconfigure the old production drives to a new backup pool so that I can get rid of my current 2x9 RaidZ2 backup pool made up of older 4TB and 3TB SMR drives. Overall goal is to remove 18 spinning drives from my system due to their age as well as their energy/heat use/production. I will likely go with a wider RaidZ2 on the backup pool as I care not about its speed. I may even go RaidZ1 (hurts my soul to even say that). We will see what I come up with...

My thoughts are that mirrored vDevs make it much easier to expand in the future. Adding 2 more drives is easy. Yes it is space inefficient but with large drives, I do not much care about space efficiency.

Currently my production pool has 2@16GB (CAM Control) SSDs in a ZIL. I plan on moving those to the new mirrored production pool. Use case of my production pool is bulk storage as well as an iSCSI target for ESXi VMs. I also have another faster SSD based pool for some other VMs using iSCSI.

All this is running on SAS 2008 HBA infrastructure (see sig).

Anyway, any comments are welcome as I do not want to make a fatal error as I upgrade my system.

Thank you,
 
Last edited:

DigitalMinimalist

Contributor
Joined
Jul 24, 2022
Messages
162
How important is transfer speed?

Best Options for 4 drives:
Speed: RAID10
Reliability/Integrity: RAIDZ2

I prefer RAID10 with proper backup
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
My thoughts are that mirrored vDevs make it much easier to expand in the future. Adding 2 more drives is easy.
Correct… but in the even of a drive failure, there would be no redundancy for up to 20 TB of data on the remaining drive and the pool would be at risk during a significant time.
Basically, I fear that the "RAID5 is dead" motto is now applying to 2-way mirrors. For safety, I would consider 3-way mirrors (if performance trumps efficiency) or raidz2 with such large drives.

Currently my production pool has 2@16TB (CAM Control) SSDs in a ZIL.
Er… 16 TB SSDs for a SLOG which only needs about 10 GB worth of space?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
My thoughts are that mirrored vDevs make it much easier to expand in the future.
Correct… but in the even of a drive failure, there would be no redundancy for up to 20 TB of data on the remaining drive and the pool would be at risk during a significant time.
Basically, I fear that the "RAID5 is dead" motto is now applying to 2-way mirrors. For safety, I would consider 3-way mirrors (if performance trumps efficiency) or raidz2 with such large drives.
I would have to agree that with drives larger than say 10TB, and certainly 20TB, the risk of un-recoverable read error during disk replacement starts to get high. Thus, 3 way mirrors or RAID-Z2.

Of course, if you have good backups or end up being lucky, then no problem.

From what I can tell about the statistics on URE, if you do loose a file, and have good backups, it might be as simple as restoring that one file. That is one very nice feature of ZFS compared to regular RAID schemes with conventional file systems on top. If you have a regular RAID failure, you might have to restore the whole file system. Just to be sure you got rid of the corrupted data.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Correct… but in the even of a drive failure, there would be no redundancy for up to 20 TB of data on the remaining drive and the pool would be at risk during a significant time.
Basically, I fear that the "RAID5 is dead" motto is now applying to 2-way mirrors. For safety, I would consider 3-way mirrors (if performance trumps efficiency) or raidz2 with such large drives.


Er… 16 TB SSDs for a SLOG which only needs about 10 GB worth of space?
I have backups of everything so I am not too worried about mirrors.

And, 16GB for ZIL (fixed above). Lol.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
I would have to agree that with drives larger than say 10TB, and certainly 20TB, the risk of un-recoverable read error during disk replacement starts to get high. Thus, 3 way mirrors or RAID-Z2.

Of course, if you have good backups or end up being lucky, then no problem.

From what I can tell about the statistics on URE, if you do loose a file, and have good backups, it might be as simple as restoring that one file. That is one very nice feature of ZFS compared to regular RAID schemes with conventional file systems on top. If you have a regular RAID failure, you might have to restore the whole file system. Just to be sure you got rid of the corrupted data.
My system has 100% backups to a separate pool. I do not have physical diversity, but for stuff I cannot get back, it is all backed up to the cloud too.

I considered RaidZ2 but adding space needs 4 more disks. With backups, I am comfortable with mirrored pairs. If I get freaked out, I can always add another mirrored drive to the vDev.

We run a lot for Raid 10 systems at work. Never have had a problem and we have single disk failures from time to time.

Thanks for the response.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
How important is transfer speed?

Best Options for 4 drives:
Speed: RAID10
Reliability/Integrity: RAIDZ2

I prefer RAID10 with proper backup
My system is a home system with a gigabit network. Transfer speed is not critical. Mostly just streaming to multiple users.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
My thoughts are that mirrored vDevs make it much easier to expand in the future.

Agree.

Adding 2 more drives is easy.

Yes, but auto-expand is also easier and faster with only 2 drives to upgrade before getting the new space online.

I would have to agree that with drives larger than say 10TB, and certainly 20TB, the risk of un-recoverable read error during disk replacement starts to get high. Thus, 3 way mirrors or RAID-Z2.

I add my voice to that : Raid-Z2 will offer better protection.

I considered RaidZ2 but adding space needs 4 more disks.

Yes, but auto-expand is still possible and 4 drives is not too bad...

So I would go with the Raid-Z2, but I would also apply an important rule in designing such a system and that is to put all the required space needed on day 1, plus some extras and some more. It is way better to have all the space from day 1 instead of adding more space here and there. If you think about the need to increase soon enough, that suggests that you should go for more space right away.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
What model of SSD are you using here? I see them referred to as "Intel" in your signature but the model is important.
Likely not the right kind. Intel 730s DC 3510 Series. I think this is OK for SLOG.
 
Last edited:

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Agree.



Yes, but auto-expand is also easier and faster with only 2 drives to upgrade before getting the new space online.



I add my voice to that : Raid-Z2 will offer better protection.



Yes, but auto-expand is still possible and 4 drives is not too bad...

So I would go with the Raid-Z2, but I would also apply an important rule in designing such a system and that is to put all the required space needed on day 1, plus some extras and some more. It is way better to have all the space from day 1 instead of adding more space here and there. If you think about the need to increase soon enough, that suggests that you should go for more space right away.
Expanding with 4 new drives always leaves you with 4 extra drives. I have the bays, but I do not need that much space.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
After thinking on this, perhaps I will make the new 20TB disks my backup pool and get rid of my 18 oldest drives... And it lets me get rid of an entire storage shelf.

May even go RaidZ1 with 4 20TB disks cause it is a backup pool. But I will likely stick with mirrors. But doing this still leaves me with some one SMR disk in my production pool (wow, I was sure I had more). So far, it has not fookd me but it would be nice to get rid of it... I also have 1 SMR disk in my backup pool that needs to go too.

The choices are real with ZFS!!
 
Last edited:

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
If speed is not of essence and you do consider expansion, then maybe 5x20TB RAID-Z2 would be good measure of efficiency all around if you stick with 20TB?

Though IMO rebuild time on reasonably filled 20TB HDD in any array would be too long. And restoring from backup if rebuild fails.

Considering size of HDDs grow much faster then their speed, these (big disk) pool restoration going to require special design considerations in the future. RAID-Z3 is one option to reduce probability of that scenario, but not to prevent it.

We may need not just a space to backup data in case backup fails, but whole reserve solution which can run in place if main pool fails for extended period of time when this scenario happen

1. 1st disk fail
2. during rebuild more disks fail (possibly after week+ long rebuild process while pool performance for services is severely reduced)
3. enough disks fail so we have to recreate pool and restore from backup
this could take month+ of moving data

So either data per disk should be reasonably limited or whole different desings should be considered. Like, main+backup concept may be outdated. Rather two mains with some space reserved on both to backup each other?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Likely not the right kind. Intel 730s DC 3510 Series. I think this is OK for SLOG.

They're okay certainly for SATA SLOGs and put up some decent results:


Keep an eye on the endurance levels as they're designed more for repeated reads rather than write-intensive, but if you're happy with the performance and you have a separate fast SSD pool for your major workloads, no reason not to keep them where they are. Barring the Intel S3700/S3710 series there isn't really a drop-in upgrade that would be bother easy and inexpensive - you've got to go NVMe or new SAS.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
If speed is not of essence and you do consider expansion, then maybe 5x20TB RAID-Z2 would be good measure of efficiency all around if you stick with 20TB?

Though IMO rebuild time on reasonably filled 20TB HDD in any array would be too long. And restoring from backup if rebuild fails.

Considering size of HDDs grow much faster then their speed, these (big disk) pool restoration going to require special design considerations in the future. RAID-Z3 is one option to reduce probability of that scenario, but not to prevent it.

We may need not just a space to backup data in case backup fails, but whole reserve solution which can run in place if main pool fails for extended period of time when this scenario happen

1. 1st disk fail
2. during rebuild more disks fail (possibly after week+ long rebuild process while pool performance for services is severely reduced)
3. enough disks fail so we have to recreate pool and restore from backup
this could take month+ of moving data

So either data per disk should be reasonably limited or whole different desings should be considered. Like, main+backup concept may be outdated. Rather two mains with some space reserved on both to backup each other?
I hear ya - 20TB drives seem ridiculous but I could not help myself as the price was right.

Given my use case is storage of media that can be replaced, and it will only really affect my personal enjoyment, I am not as concerned as a business would be about downtime. Hell, I run all of my stuff on ESXi so I already live a little on the edge. :)

I think that I will likely use the 20TB drives to create a new backup pool (mirrors) and then get rid of the 18 old 3TB and 4TB drives. I can also shuffle a couple things around to get rid of ALL SMR drives from my system (was thrilled to confirm that I only have 2 of those in my system). I think I need 4 more 6TB drives for my production pool to replace the remaining 4TB drives the pool will grow by 8TB. Given I have 9TB free today, that should last me for a while.

Cheers,
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
They're okay certainly for SATA SLOGs and put up some decent results:


Keep an eye on the endurance levels as they're designed more for repeated reads rather than write-intensive, but if you're happy with the performance and you have a separate fast SSD pool for your major workloads, no reason not to keep them where they are. Barring the Intel S3700/S3710 series there isn't really a drop-in upgrade that would be bother easy and inexpensive - you've got to go NVMe or new SAS.
I did see that link. I have been happy with the S3700 series. As for endurance, I used camcontrol to size them to 16GB from 256GB so, if I understand things properly, that should greatly increase their endurance. My system is not terribly busy, but I will keep an eye on them.

Once I can get DDR4 RAM from work (retiring older servers) I will update my system and likely go with NVMe SLOG at that time. I really want an X10 or X11 SuperMicro MB but I have over a TB of DDR3 ram in bags at home due to old servers from work, so I cannot justify going to DDR4 at this time. On that note, if anyone needs DDR3 ECC Registered server RAM (8GB and 32GB), shoot me a PM.

Cheers,
 

nabsltd

Contributor
Joined
Jul 1, 2022
Messages
133
As for endurance, I used camcontrol to size them to 16GB from 256GB so, if I understand things properly, that should greatly increase their endurance.
Reducing the total drive size doesn't help with endurance if you weren't going to fill up the drive. As long as the SSD controller can find enough space to keep the virtual map using different cells, it will essentially cycle through the entire flash, then repeat. So, each 256GB of write reduces every cell's endurance by 1.

16GB of SLOG would be over 50 seconds of write at 300MB/sec (which is likely beyond the limit of those drives). SLOG will never hold that much data, so it doesn't matter if you reduce the size or not, since there will always be lots of less used cells to use for writes. For some older controllers that aren't as good at handling the issue, reducing the size would signal that the rest of the drive could never be written with data.

Endurance takes a nosedive when the drive is mostly full, since the controller has to either write to the same cells over and over, or move data from existing cells to wear level.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Reducing the total drive size doesn't help with endurance if you weren't going to fill up the drive. As long as the SSD controller can find enough space to keep the virtual map using different cells, it will essentially cycle through the entire flash, then repeat. So, each 256GB of write reduces every cell's endurance by 1.

In theory it doesn't, but according to some Intel testing, it actually does, because of the reduction in write amplification (because even in the short lifespan of SLOG data, the SSD is still trying to rearrange and clear NAND pages in firmware)

PDF warning:


The DWPD (Drive Writes Per Day) rating on a DC S4600 roughly tripled by doing heavy overprovisioning (58%)

16GB of SLOG would be over 50 seconds of write at 300MB/sec (which is likely beyond the limit of those drives). SLOG will never hold that much data, so it doesn't matter if you reduce the size or not, since there will always be lots of less used cells to use for writes. For some older controllers that aren't as good at handling the issue, reducing the size would signal that the rest of the drive could never be written with data.

Endurance takes a nosedive when the drive is mostly full, since the controller has to either write to the same cells over and over, or move data from existing cells to wear level.

In this case, I believe @Scharbag has a virtual TrueNAS, so its "line rate" is whatever speed the vSwitch can shove data - that's only about 16 seconds at 10Gbps, and 4 seconds at 40Gbps. (Assuming that the dirty_data_max has been bumped to the appropriate 16G value.)

I did some testing back many years ago and saw that while there was roughly the same average performance with assigning the "full drive" vs using HPA/overprovisioning, the duration that the drive could deliver the sustained write speeds was slightly longer when overprovisioned.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
In theory it doesn't, but according to some Intel testing, it actually does, because of the reduction in write amplification (because even in the short lifespan of SLOG data, the SSD is still trying to rearrange and clear NAND pages in firmware)

PDF warning:


The DWPD (Drive Writes Per Day) rating on a DC S4600 roughly tripled by doing heavy overprovisioning (58%)



In this case, I believe @Scharbag has a virtual TrueNAS, so its "line rate" is whatever speed the vSwitch can shove data - that's only about 16 seconds at 10Gbps, and 4 seconds at 40Gbps. (Assuming that the dirty_data_max has been bumped to the appropriate 16G value.)

I did some testing back many years ago and saw that while there was roughly the same average performance with assigning the "full drive" vs using HPA/overprovisioning, the duration that the drive could deliver the sustained write speeds was slightly longer when overprovisioned.
Yeah, my system is running a 10G vSwitch for iSCSI. Again, my system is not that busy and all of the data consumers are running on 1G physical (or mostly wireless). The VM drives are the only thing that actually uses the vSwitch so I am sure I am not loading the SLOG overly much. I do run most of my VMs on my production pool as I found the performance is fine. I reserve my SSD pool, which is small, for only my most heavily loaded VMs as well as all of my jails.

Thank you for all the great info. These conversations just continue to point out how little I know about this stuff. :) And I will be looking into the "dirty_data_max" setting tonight as I have no idea what it is currently set to.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
And I will be looking into the "dirty_data_max" setting tonight as I have no idea what it is currently set to.

Default is 4G - here's a few of my posts from the past to get you started on some digging.



Second post has a handy dtrace script that will tell you in real-time how much you're actually using.
 
Top