Z2 or Z3 for SSD pool?

NugentS · Aug 21, 2023

Patrick_3000 said:
I question the relevance of the current hard drives I use in a thread regarding the RAID level for SSD pool I plan to switch to. However, for what it's worth, my current drives are 12 TB Seagate Ironwolf Pro.

And the make and model of the new drives? Really what I was after

Patrick_3000 · Aug 21, 2023

NugentS said:
And the make and model of the new drives? Really what I was after

I haven't bought them yet, except that one will be a TEAMGROUP T-Force CARDEA A440 Pro Aluminum Heatsink 4TB with DRAM SLC Cache 3D TLC NAND NVMe PCIe Gen4 x4 M.2 2280, because I have it lying around.

The other four (or five if I do Z3) will probably be a couple of low end TEAMGROUPs, like a Gen 3 Teamgroup that sells in the $165 price range, one or two low end Crucial Gen 3's which are around $190, and possibly one or two Western Digital, like WD Blue or Black.

NickF · Aug 21, 2023

Patrick_3000 said:
You have mostly just taken this thread as an opprtunity to expound your philosophy which is in favor of SATA over NVME. I disagree with you on this point. SSD, even consumer grade QLC, is probably more reliable than HDD, which people have been using and discussing in this forum uncritically for years. That's essentially a guess, but it's an educated guess based on what I know about SSDs, what I've seen online, and the fact that there are no moving parts.

Moreover, NVME SSD, even consumer grade, is sure to be faster than anything SATA. As one example, on my 10gbe ethernet network, I routinely max out the network speed or come close (like 1000 or 1100 MB/sec) when transferring consumer-grade NVME SSD to NVME SSD (in a transfer between a Linux desktop system to a Windows desktop system), but I only get something like half that (around 500 MB/sec) when transferring SATA SSD to SATA SSD.

We are in a new era when it comes to SSD. Prices have dropped significantly in the past year, and going forward, SSD, particularly M.2 NVME SSD, are going to be in a lot of places where they weren't previously. SATA SSD, not so much given its maintenance hassle, larger size, and higher power consumption.

The lack of need for cable maintenance alone makes NVME SSD a no-brainer in my opinion. I currently have three NAS hard drives (Seagate Ironwolf Pro) in SCALE, and just a few weeks ago, one showed as offline, but when I opened the case and looked, it was a loose and failing SATA cable, which I had to replace. That problem would still occur with SATA SSD.

We can agree to disagree here. Also, FWIW, I am historically a proponent of NVME. I just disagree that it makes sense for your use case. Also I think you are spending alot of your energy concerned with cable management inside of a system that you will likely spend very little time back inside of. It's a strange thing to fixate on. What happens when the M.2 slot on your motherboard has bad pins and your drive doesn't work properly any longer? At least cables are replaceable. I've seen stranger failure-modes.

I've spent a fair time over the past 2 years playing with ZFS and NVME. This is NOT me saying NVME is bad. This is me saying NVME for storing documents is a waste of NVME.

TrueNAS Scale NVME Performance Scaling

Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based...

www.truenas.com

Patrick_3000 · Aug 21, 2023

NickF said:
We can agree to disagree here. Also, FWIW, I am historically a proponent of NVME. I just disagree that it makes sense for your use case. Also I think you are spending alot of your energy concerned with cable management inside of a system that you will likely spend very little time back inside of. It's a strange thing to fixate on. What happens when the M.2 slot on your motherboard has bad pins and your drive doesn't work properly any longer? At least cables are replaceable. I've seen stranger failure-modes.

I've spent a fair time over the past 2 years playing with ZFS and NVME. This is NOT me saying NVME is bad. This is me saying NVME for storing documents is a waste of NVME.

TrueNAS Scale NVME Performance Scaling

Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based...

www.truenas.com

Thanks for your input.

Regarding your point "this is me saying NVME for storing documents is a waste of NVME," here is some more context. Sometimes those documents, including relatively large (50 GB or 100 GB) files or folders of files, need to be copied to or from another computer, which is often though not always one of the VMs running on SCALE and stored in the same pool as the documents, and then manipulated (either edited or compressed or extracted with 7-zip).

Currently, transfers from the SCALE server to the VM in the same pool, which currently is an HDD pool, are rather slow: like around 100 MB/sec even though the SCALE server and the VM each have a separate 10gbe network adapter. In addition, manipulating large files or folders in the VM is relatively slow since the pool is on HDD.

One of the advantages--perhaps the main advantage other than cable management--of migrating the pool to SSD is to speed up transfers between the SCALE server and VMs running on the server and in the same pool and speed up operations on the VM. I suspect that these transfers and VM operations will be substantially faster with an NVME pool compared to a SATA pool, in addition to the cable management advantages which, as noted, I consider somewhat significant.

sfatula · Aug 21, 2023

Patrick_3000 said:
Yes, but the stuff is rather critical, especially the documents for my spouse's home-based business and for my job, and a bunch of other personal data. Also contains the virtual disks for two VMs.

I would actually go Z1 as rebuild times will be so fast. Here's the thing - raidz level is for availability, but are by no means a backup. Backups are for emergencies or other events.

NickF · Aug 21, 2023

Patrick_3000 said:
Thanks for your input.

Regarding your point "this is me saying NVME for storing documents is a waste of NVME," here is some more context. Sometimes those documents, including relatively large (50 GB or 100 GB) files or folders of files, need to be copied to or from another computer, which is often though not always one of the VMs running on SCALE and stored in the same pool as the documents, and then manipulated (either edited or compressed or extracted with 7-zip).

Currently, transfers from the SCALE server to the VM in the same pool, which currently is an HDD pool, are rather slow: like around 100 MB/sec even though the SCALE server and the VM each have a separate 10gbe network adapter. In addition, maniipulating large files or folders in the VM is relatively slow since the pool is on HDD.

See. This is why I asked about context and use case. Spouses at home business wasn't a give away for me, so the context helps.

Given that, I'm far less hesitant about the hardware given something a little more stressful than a document dump. For a low power home virtualization server, NVME is not a bad direction to go. For lots of parallel operations it's the bets choice.

My previous points regarding CPU overhead still applies here though. Z2/Z3 with compression on your going to peg your CPU. Which is unfortunate, because virtual machines tend to compress very well.

But for THAT workload, I'd encourage you to design a pool with more than 1 VDEV. Here's some pretty good assumptions you can start with:

We're going to bottleneck on memory bandwidth in the system we have.
We should consider disabling compression to save CPU cycles for KVM.
Disabling compression should save RAM cycles and save space, freeing up cycles should help incoming TXG aggregation, thus better ZIL, thus better write performance. Let’s not forget ARC caching.
We should use a RaidZ-1 topology with two VDEVs of 3 Drives to save CPU cycles on parity for KVM
NVME likes to parallelize operations. More VDEVs should help keep the drives fed, closer to their potential.
NVME likes workloads greater than QD1 to perform at their best. Virtualization should help keep queues full.
We should migrate VM data to a smaller volblocksize than the default used on hard drives. Somewhere between 4-32k you will have a sweetspot. There's a decrease in compression efficiency with smaller volblocksize.

There are alot of ZFS tunables and kernel level things to try to tweak NVME performance. You should not touch any of these unless you understand what we've done thusfar and are cool with "BEEWARE THERE BE MONSTERS"

I wouldn’t even make any other changes beyond one or two ARC tunables to be fair. Poll queuing would be next stop but…do we really want to mess with kernel level parameters ?

Patrick_3000 · Aug 21, 2023

NickF said:
We should use a RaidZ-1 topology with two VDEVs of 3 Drives to save CPU cycles on parity for KVM

Thanks for your input. I've been using Truenas or its predecessor, Freenas, since 2016 and have never had the need for multiple VDEVs. So, I'm not very familiar with them, and forgive me if I'm misunderstanding this, but with two VDEVs, couldn't two drive failures take down the entire pool if they're the wrong drives? That is, if two corresponding drives in each VDEV failed, the pool would be unrecoverable, right? That to me is a major downside.

With RAID Z2 (five SSDs) in one VDEV, the pool could always survive two drive failures, and with RAID Z3 (six SSDs), it could always survive three drive failures. That would make two VDEVs less robust, despite its speed advantage, which I hadn't realized was the case and I appreciate learning.

Sure, I could probably survive a failure of the pool and restore the pool from my backup SCALE server, but the process would be disruptive for my entire household due to downtime, time-consuming on my part, and risky because, though unlikely, something could go wrong with the restore process and I could lose data.

I will think about it, but at least tentatively, a single VDEV with the greater fault tolerance of Z2 or Z3 seems preferable, despite the speed advantage of two VDEVs

NickF · Aug 21, 2023

Your understanding seems correct. If two drives in the same VDEV fail you lose the pool. You can however lose a single drive from each VDEV. Having only a single z2 solves that problem.

But it’s at the cost of half the IOPs and a lot more CPU overhead. This is a steep cost, and a particularly steep cost for NVME.

We can assume the perceived slowdown is greater on NVME than with regular SSDs. That’s because we have shifted the bottleneck away from an HBA and put it squarely on the host CPU. Which will cause your virtual machine performance to suffer.

Raid is not a backup. You already seem to understand this. Even in the enterprise a two vdev raidz1 would be a valid option for the right workload.

Take my advise seriously.
When you come back for help because your system runs crappy…you either going to hear other folks say similar things or not hear much at all…the collective knowledge base in the community doesn’t have a lot of data on nvme.

At least not nearly as much as on SATA and SAS. In some sense NVME has been around a while. In others it’s still an infant.

sfatula · Aug 21, 2023

Yeah, I think there might be something under 1% additional risk for raidz1 compared to raidz2 with fast devices list this, just my guess not based on hard data, it's going to be minimal. The cost of that additional "security" is steep indeed.

Once you have 2+ vdevs, that's where a hot spare can be useful as well. But if he wants the penalty of raidz2 and a single vdev, and it makes him feel safer, go for it. The only thing that makes me feel safe is multiple off site backups.

NickF · Aug 21, 2023

Yeah I don't have any hard data to share here - I am explaining only as I understand ZFS and with my anecdotal experience. But I also agree that the increased risk is negligible. 2/3 data to parity ratio is good enough reason for me to recommend RaidZ1 for this type of application. Else mirrors. But OP's mind is already likely made up in choosing Z2/Z3.

Patrick_3000 · Aug 21, 2023

Thanks, everyone, for your thoughts. I may not agree with what everyone is saying, but the discussion has helped clarify my thinking.

To sum up, it sounds like I certainly don't need RAID Z3, since it's likely overkill from a data integrity standpoint and slower even than RAID Z2 and with a greater hardware cost.

A couple of commenters are advocating two VDEVs of RAID Z1, which would be the fastest approach. However, that's riskier than RAID Z2 since two SSD failures could destroy the pool.

I don't need blinding fast speed. If I could get somewhere between 500 MB/sec and 1000 MB/sec transfers to and from the SCALE server on my 10 gbe network and decent speed when running large jobs on VMs with virtual drives on the pool, I'd be happy. My current HDD pool, of course, gets nowhere near that.

I'm more worried about pool failure. I've got a backup SCALE server on-site and multiple off-site backups, both cloud and physical, but I'd rather avoid the hassle, downtime, and risk of ever having to rely on restoring from backups.

What I think I'll do is start out with a 5-SSD RAID Z2 pool and see what type of speeds I get. If speed is a problem, then In a controlled way, I'll rebuild the pool with 6 SSDs and two VDEvs.

If anyone has additional thoughts or insights, feel free to share them.

sfatula · Aug 21, 2023

No matter what you do, I'd suggest keeping a spare on hand. If you are worried about a couple minute rebuild (my last one took 10 seconds), then you should have a spare drive on hand IMHO.

Patrick_3000 · Aug 21, 2023

sfatula said:
No matter what you do, I'd suggest keeping a spare on hand. If you are worried about a couple minute rebuild (my last one took 10 seconds), then you should have a spare drive on hand IMHO.

Thanks for your feedback that resilvering is only likely to take a few minutes. That's helpful. I've had to resilver with hard drives a couple of times over the years, and the process has generally taken a day or two. I knew it would be faster with SSD, but I didn't realize how much faster.

A 12 TB pool, containing around 6 TB of data, spread across five 4-TB SSDs in a RAID Z2 configuration. That's what I'll have. I'm not super-familiar with the technical details of ZFS RAID implementation, but if an SSD fails and is replaced, resilvering would have to involve moving several terabytes of data to the new SSD.

My quick calculations suggest that with NVME SSD transfers of somewhere between 1000 MB/sec and 5000 MB/sec, which seems like a reasonable range for consumer grade SSDs in my setup, that would be between 1 GB/sec and 5 GB/sec, which means that transferring 1 TB could take something like 1000 seconds (around 17 minutes) on the high end and 200 seconds (around 3 minutes) on the low end.

Bottom line: as an extremely rough approximation, resilvering my pool could take somewhere between about 10 minutes and two hours.

Given that rough approximation and your feedback on resilvering with SSD, I'm comfortable with RAID Z2. I'll look into the hot spare idea and am leaning toward doing it, although I live in a city and could probably go out and get an SSD at a retailer relatively quickly if I needed to.

NickF · Aug 22, 2023

You are making an assumption that NVME drives will perform in alignment with your expectations. They probably should be in that range, sure.

Given the set of circumstances, I want to make my point clear. This system needs to be in BALANCE. Using NVME in this way on the platform you have chosen will work. But you are going to run into system-level bottlenecks that are going to cause your VM workloads to suffer, soley because you chose NVME. System tuning and sacrifices will have to be made.

Just because the performance needs you have are not great, doesn't mean that the tuning and logic I have posed is not relevant. If you were ONLY using the system as a data target, those concerns wouldn't really matter. Given that you are running other workloads - they ABSOLUTELY do. When we are talking about NVME...Every ZFS design decision will have potential system-level performance impact that transcends the storage subsystem. So in using compression and RAID Z2, as examples, you are robbing from Peter to pay Paul. Those CPU and RAM cycles are now used up in the storage subsystem before they can ever be used by KVM.

Important Announcement for the TrueNAS Community.

Z2 or Z3 for SSD pool?

NugentS

MVP

Patrick_3000

Contributor

NickF

Guru

TrueNAS Scale NVME Performance Scaling

Patrick_3000

Contributor

TrueNAS Scale NVME Performance Scaling

sfatula

Guru

NickF

Guru

Patrick_3000

Contributor

NickF

Guru

sfatula

Guru

NickF

Guru

Patrick_3000

Contributor

sfatula

Guru

Patrick_3000

Contributor

NickF

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Z2 or Z3 for SSD pool?

MVP

Contributor

Guru

Contributor

Guru

Guru

Contributor

Guru

Guru

Guru

Contributor

Guru

Contributor

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Z2 or Z3 for SSD pool?"

Similar threads