Expanding zpool / performance tuning

ray201

Cadet
Joined
Nov 28, 2023
Messages
4
Hello Peeps,

About a year ago i created a high-end Truenas Core server (running 13.0-stable) on a Dell R730XD 12 bay 3.5 and 2x 2.5" slots for Truenas sysvol.

More HW setup details:

12x Seagate iron wolf 16TB
Dual 4x NvME PCI controllers with dual nvme assigned to cache,metadata and logs. This means that the pool has 2 identical nvme for cache and metadata on a different controllers (16x slot with bifurcation) for redundancy. Not sure if this is the best practice or if i should use single nvme for cache/metadata and log?

What happens if a pool looses a nvme assigned for meta, cache og log? My challenge is that all nvme's are internal (non hotswap) and i wanted to create as much uptime as possible.

There is also 512GB ECC RAM in the server. (768GB max)
The pool is mainly used for VM's (vmware) over iscsi

Now, all disk bays are populated and i'm looking to expand my existing pool.
I was thinking to purchase a JBOD (been looking at Netapp DS4246 12gbps) with a LSI HBA (IT mode) which holds 24x 3.5 bays.

Iron Wolf 16TB's are close to impossible to find in any stores, and the best buy in my area of the world atm is seagate 18TB exos.
I see from my existing pool that my drives are divided into clusters of 6 and 6 drivers (screenshot).

1701178951488.png


If i add 12-24 18TB exos HDDs to my existing pool will i get to use those extra TBs or are they lost? Should i stick with 16TB's?
Can i also have 1-2 hot spares (18tb) on the Netapp JBOD and have them cover the internal 16TBs as well? I take it that i have to expand the pool with even number disks?

I'm i right in believing that the slowest/crappiest drive will be the bottle neck in the pool?

Any recommendations on how i should expand this pool? Im not keen on a new pool as i dont have enough nvme or free PCI slots for cache, log etc.

Thanks for your time
Kind Regards - Reidar
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What happens if a pool looses a nvme assigned for meta,
You lose the whole pool, irrecoverably.
What happens if a pool looses (...) cache [or] log
Pool performance slows down until the situation is corrected.

Dual 4x NvME PCI controllers with dual nvme assigned to cache,metadata and logs. This means that the pool has 2 identical nvme for cache and metadata on a different controllers (16x slot with bifurcation) for redundancy. Not sure if this is the best practice or if i should use single nvme for cache/metadata and log?
It sounds like you have a total of 6 SSDs, two each for each of the roles. That part is mostly fine. Mostly because:
  1. For the special vdev (metadata vdev), you have a lot less redundancy (1) than for the rest of the pool (3). You almost certainly want to make this at least a three-way mirror.
  2. It's not clear that your SSDs are adequate for SLOG. If you're using SSDs that do not have power loss protection, you are effectively risking that the writes still pending in the ZIL will end up corrupted by the SSD as it loses power, making the SLOG completely pointless. What SSDs are these?
  3. It's not clear that you even benefit from an SLOG in the first place. What's your workload?
  4. It's not clear that you benefit from two SSDs' worth of L2ARC. Again, what's your workload?
I see from my existing pool that my drives are divided into clusters of 6 and 6 drivers (screenshot).
It's an oddball, but very safe and extremely conservative pool layout. A bit inefficient for my tastes, but it can take a lot of bad disks with good luck (and even up to three with extreme bad luck). Too bad the metadata vdev ruins that...
If i add 12-24 18TB exos HDDs to my existing pool will i get to use those extra TBs or are they lost? Should i stick with 16TB's?
If you add additional vdevs with 18 TB disks, you get to use them fully.
Can i also have 1-2 hot spares (18tb) on the Netapp JBOD and have them cover the internal 16TBs as well?
Yes, of course, though two hot spares on top of all this redundancy is really getting into overkill territory. I'd definitely make sure the backup strategy is rock-solid before worrying about more disk failures on this machine - apart from the metadata vdev, of course.
I take it that i have to expand the pool with even number disks?
No, though that does not mean you can do thing willy-nilly. For maximum OCD, you can add additional 6-wide RAIDZ3 vdevs. If you're comfortable with a greater risk of disk failure, you can probably add maybe 10-wide RAIDZ3 vdevs and more than double the added space with just four more disks relative to the 6-wide scenario.
Real example: I run 12-wide RAIDZ2 vdevs at work. They're a bit wide, but that was chosen to fit in a typical chassis, it's fine for my very sequential sort of workload. That's fine because the guys on site can replace a disk with a cold spare within a day or two if there's no emergency, and faster if there is one. If it were a server sitting away from any qualified hands, I would've gone for RAIDZ3 plus a hotspare or two.
I'm i right in believing that the slowest/crappiest drive will be the bottle neck in the pool?
Sort of. It gets complicated, but that's not a bad starting point for your analysis.
 

ray201

Cadet
Joined
Nov 28, 2023
Messages
4
Hi Eric and thanks a bunch for your quick reply - much appriciated!

It sounds like you have a total of 6 SSDs, two each for each of the roles. That part is mostly fine. Mostly because:
  1. For the special vdev (metadata vdev), you have a lot less redundancy (1) than for the rest of the pool (3). You almost certainly want to make this at least a three-way mirror.
  2. It's not clear that your SSDs are adequate for SLOG. If you're using SSDs that do not have power loss protection, you are effectively risking that the writes still pending in the ZIL will end up corrupted by the SSD as it loses power, making the SLOG completely pointless. What SSDs are these?
  3. It's not clear that you even benefit from an SLOG in the first place. What's your workload?
  4. It's not clear that you benefit from two SSDs' worth of L2ARC. Again, what's your workload?
There are no SSD's, only SATA and M.2 NvME's. I actually have 8 NvME's but 2 which is not in use as of yet. Call them cold spares if you want :)
The NvME's are from 512GB to 1TB Samsung Pro pci3 with 4-5000 mb/sec R/W. To avoid corruption in case of a power loss its connected to a APC UPS which will send a shutdown command to Truenas before cutting off. Uptime is 512 days but i guess i just jinxed it :)
At the moment there is still 141GB RAM free for L1. It uses about 300GB atm. Below is a screenshot from the ZFS report:

1701182794250.png


My workload is about 75 virtual machines over 40gbit iscsi. A mix of webservers and smaller DBs. Hypervisor is vmware.




It's an oddball, but very safe and extremely conservative pool layout. A bit inefficient for my tastes, but it can take a lot of bad disks with good luck (and even up to three with extreme bad luck). Too bad the metadata vdev ruins that...
Yeah, i've got 3 drives failure tolerance on the vdevs and 2 on meta....

Yes, of course, though two hot spares on top of all this redundancy is really getting into overkill territory. I'd definitely make sure the backup strategy is rock-solid before worrying about more disk failures on this machine - apart from the metadata vdev, of course.
yeah, your right. My paranoia seem to have gotten the better of me :) I've got a second nas with a bit lower specs for sync/replication.
The vm's are also backed up with its own backup software in addition to ZFS replication.
No, though that does not mean you can do thing willy-nilly. For maximum OCD, you can add additional 6-wide RAIDZ3 vdevs. If you're comfortable with a greater risk of disk failure, you can probably add maybe 10-wide RAIDZ3 vdevs and more than double the added space with just four more disks relative to the 6-wide scenario.
Using different vdev sizes wont affect the performance? Any advantage of keeping it identical except pleasing my paranoia?

Again, thanks for your time
//Reidar
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
There are no SSD's
NVMe disks are SSDs.
The NvME's are from 512GB to 1TB Samsung Pro pci3 with 4-5000 mb/sec R/W
You mean Samsung 970 Pro or similar? Those would not be adequate for SLOG, but fine for the other roles.
My workload is about 75 virtual machines over 40gbit iscsi. A mix of webservers and smaller DBs. Hypervisor is vmware.
Ok, that justifies SLOG and possibly L2ARC, but you have so much RAM that the L2ARC is mostly idling.
Yeah, i've got 3 drives failure tolerance on the vdevs and 2 on meta....
No, only one on the metadata vdev.
Using different vdev sizes wont affect the performance? Any advantage of keeping it identical except pleasing my paranoia?
The rule of thumb is "don't overdo it and don't be silly". Mixing similar but not-quite-identical vdevs is fine. All combinations of vdevs are expected to work, but the weirder they get, the less likely it is they're a good idea.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...
Yeah, i've got 3 drives failure tolerance on the vdevs and 2 on meta....
...
The suggested rule for Special / Metadata vDevs is the same amount of disk failure tolerance as the data vDevs. The reason for this is that loss of the Special / Metadata vDev means loss of the pool, unrecoverably.

As Eric pointed out, a 2 way Mirror Special / Metadata vDev is not the same redundancy as RAID-Z3. This is what we mean;

RAID-Z1, 1 disk of redundancy - 2 way Mirror for Special / Metadata vDev, (which is 1 disk of redundancy)
RAID-Z2, 2 disks of redundancy - 3 way Mirror for Special / Metadata vDev, (which is 2 disks of redundancy)
RAID-Z3, 3 disks of redundancy - 4 way Mirror for Special / Metadata vDev, (which is 3 disks of redundancy)

Now perhaps a 4 way Mirror is a bit over-kill. But, if you are paranoid, at least have a 3 way Mirror for your Special / Metadata vDev.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
iSCSI would do much better on mirrors than on raidz3. So if you were to rebuild, a stripe of 3-way mirrors for the HDDs and a 3-way mirror for special would be the way.
 

ray201

Cadet
Joined
Nov 28, 2023
Messages
4
You mean Samsung 970 Pro or similar? Those would not be adequate for SLOG, but fine for the other roles.
yes, today its got 2 pcs 512GB samsung pro nvme. One on each pci controller. Might be a dumb question but why are they not adequate? Any suggestion for replacement? when i installed them i was focusing on mb/sec.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
yes, today its got 2 pcs 512GB samsung pro nvme. One on each pci controller. Might be a dumb question but why are they not adequate? Any suggestion for replacement? when i installed them i was focusing on mb/sec.
No Power Loss Protection. There are a few M.2 SSDs that do have that feature, but they're on the large side and may not fit. I think the Micron 7400 is available in M.2 with PLP, and there may be others.
 
Top