Expanding zpool / performance tuning

ray201 · Nov 28, 2023

Hello Peeps,

About a year ago i created a high-end Truenas Core server (running 13.0-stable) on a Dell R730XD 12 bay 3.5 and 2x 2.5" slots for Truenas sysvol.

More HW setup details:

12x Seagate iron wolf 16TB
Dual 4x NvME PCI controllers with dual nvme assigned to cache,metadata and logs. This means that the pool has 2 identical nvme for cache and metadata on a different controllers (16x slot with bifurcation) for redundancy. Not sure if this is the best practice or if i should use single nvme for cache/metadata and log?

What happens if a pool looses a nvme assigned for meta, cache og log? My challenge is that all nvme's are internal (non hotswap) and i wanted to create as much uptime as possible.

There is also 512GB ECC RAM in the server. (768GB max)
The pool is mainly used for VM's (vmware) over iscsi

Now, all disk bays are populated and i'm looking to expand my existing pool.
I was thinking to purchase a JBOD (been looking at Netapp DS4246 12gbps) with a LSI HBA (IT mode) which holds 24x 3.5 bays.

Iron Wolf 16TB's are close to impossible to find in any stores, and the best buy in my area of the world atm is seagate 18TB exos.
I see from my existing pool that my drives are divided into clusters of 6 and 6 drivers (screenshot).

If i add 12-24 18TB exos HDDs to my existing pool will i get to use those extra TBs or are they lost? Should i stick with 16TB's?
Can i also have 1-2 hot spares (18tb) on the Netapp JBOD and have them cover the internal 16TBs as well? I take it that i have to expand the pool with even number disks?

I'm i right in believing that the slowest/crappiest drive will be the bottle neck in the pool?

Any recommendations on how i should expand this pool? Im not keen on a new pool as i dont have enough nvme or free PCI slots for cache, log etc.

Thanks for your time
Kind Regards - Reidar

Ericloewe · Nov 28, 2023

ray201 said:
What happens if a pool looses a nvme assigned for meta,

You lose the whole pool, irrecoverably.

ray201 said:
What happens if a pool looses (...) cache [or] log

Pool performance slows down until the situation is corrected.

ray201 said:
Dual 4x NvME PCI controllers with dual nvme assigned to cache,metadata and logs. This means that the pool has 2 identical nvme for cache and metadata on a different controllers (16x slot with bifurcation) for redundancy. Not sure if this is the best practice or if i should use single nvme for cache/metadata and log?

It sounds like you have a total of 6 SSDs, two each for each of the roles. That part is mostly fine. Mostly because:

For the special vdev (metadata vdev), you have a lot less redundancy (1) than for the rest of the pool (3). You almost certainly want to make this at least a three-way mirror.
It's not clear that your SSDs are adequate for SLOG. If you're using SSDs that do not have power loss protection, you are effectively risking that the writes still pending in the ZIL will end up corrupted by the SSD as it loses power, making the SLOG completely pointless. What SSDs are these?
It's not clear that you even benefit from an SLOG in the first place. What's your workload?
It's not clear that you benefit from two SSDs' worth of L2ARC. Again, what's your workload?

ray201 said:
I see from my existing pool that my drives are divided into clusters of 6 and 6 drivers (screenshot).

It's an oddball, but very safe and extremely conservative pool layout. A bit inefficient for my tastes, but it can take a lot of bad disks with good luck (and even up to three with extreme bad luck). Too bad the metadata vdev ruins that...

ray201 said:
If i add 12-24 18TB exos HDDs to my existing pool will i get to use those extra TBs or are they lost? Should i stick with 16TB's?

If you add additional vdevs with 18 TB disks, you get to use them fully.

ray201 said:
Can i also have 1-2 hot spares (18tb) on the Netapp JBOD and have them cover the internal 16TBs as well?

Yes, of course, though two hot spares on top of all this redundancy is really getting into overkill territory. I'd definitely make sure the backup strategy is rock-solid before worrying about more disk failures on this machine - apart from the metadata vdev, of course.

ray201 said:
I take it that i have to expand the pool with even number disks?

No, though that does not mean you can do thing willy-nilly. For maximum OCD, you can add additional 6-wide RAIDZ3 vdevs. If you're comfortable with a greater risk of disk failure, you can probably add maybe 10-wide RAIDZ3 vdevs and more than double the added space with just four more disks relative to the 6-wide scenario.
Real example: I run 12-wide RAIDZ2 vdevs at work. They're a bit wide, but that was chosen to fit in a typical chassis, it's fine for my very sequential sort of workload. That's fine because the guys on site can replace a disk with a cold spare within a day or two if there's no emergency, and faster if there is one. If it were a server sitting away from any qualified hands, I would've gone for RAIDZ3 plus a hotspare or two.

ray201 said:
I'm i right in believing that the slowest/crappiest drive will be the bottle neck in the pool?

Sort of. It gets complicated, but that's not a bad starting point for your analysis.

ray201 · Nov 28, 2023

Hi Eric and thanks a bunch for your quick reply - much appriciated!

Ericloewe said:
It sounds like you have a total of 6 SSDs, two each for each of the roles. That part is mostly fine. Mostly because:

For the special vdev (metadata vdev), you have a lot less redundancy (1) than for the rest of the pool (3). You almost certainly want to make this at least a three-way mirror.

It's not clear that your SSDs are adequate for SLOG. If you're using SSDs that do not have power loss protection, you are effectively risking that the writes still pending in the ZIL will end up corrupted by the SSD as it loses power, making the SLOG completely pointless. What SSDs are these?

It's not clear that you even benefit from an SLOG in the first place. What's your workload?

It's not clear that you benefit from two SSDs' worth of L2ARC. Again, what's your workload?

There are no SSD's, only SATA and M.2 NvME's. I actually have 8 NvME's but 2 which is not in use as of yet. Call them cold spares if you want :)
The NvME's are from 512GB to 1TB Samsung Pro pci3 with 4-5000 mb/sec R/W. To avoid corruption in case of a power loss its connected to a APC UPS which will send a shutdown command to Truenas before cutting off. Uptime is 512 days but i guess i just jinxed it :)
At the moment there is still 141GB RAM free for L1. It uses about 300GB atm. Below is a screenshot from the ZFS report:

My workload is about 75 virtual machines over 40gbit iscsi. A mix of webservers and smaller DBs. Hypervisor is vmware.

Ericloewe said:
It's an oddball, but very safe and extremely conservative pool layout. A bit inefficient for my tastes, but it can take a lot of bad disks with good luck (and even up to three with extreme bad luck). Too bad the metadata vdev ruins that...

Yeah, i've got 3 drives failure tolerance on the vdevs and 2 on meta....

Ericloewe said:
Yes, of course, though two hot spares on top of all this redundancy is really getting into overkill territory. I'd definitely make sure the backup strategy is rock-solid before worrying about more disk failures on this machine - apart from the metadata vdev, of course.

yeah, your right. My paranoia seem to have gotten the better of me :) I've got a second nas with a bit lower specs for sync/replication.
The vm's are also backed up with its own backup software in addition to ZFS replication.

Ericloewe said:
No, though that does not mean you can do thing willy-nilly. For maximum OCD, you can add additional 6-wide RAIDZ3 vdevs. If you're comfortable with a greater risk of disk failure, you can probably add maybe 10-wide RAIDZ3 vdevs and more than double the added space with just four more disks relative to the 6-wide scenario.

Using different vdev sizes wont affect the performance? Any advantage of keeping it identical except pleasing my paranoia?

Again, thanks for your time
//Reidar

Ericloewe · Nov 28, 2023

ray201 said:
There are no SSD's

NVMe disks are SSDs.

ray201 said:
The NvME's are from 512GB to 1TB Samsung Pro pci3 with 4-5000 mb/sec R/W

You mean Samsung 970 Pro or similar? Those would not be adequate for SLOG, but fine for the other roles.

ray201 said:
My workload is about 75 virtual machines over 40gbit iscsi. A mix of webservers and smaller DBs. Hypervisor is vmware.

Ok, that justifies SLOG and possibly L2ARC, but you have so much RAM that the L2ARC is mostly idling.

ray201 said:
Yeah, i've got 3 drives failure tolerance on the vdevs and 2 on meta....

No, only one on the metadata vdev.

ray201 said:
Using different vdev sizes wont affect the performance? Any advantage of keeping it identical except pleasing my paranoia?

The rule of thumb is "don't overdo it and don't be silly". Mixing similar but not-quite-identical vdevs is fine. All combinations of vdevs are expected to work, but the weirder they get, the less likely it is they're a good idea.

Arwen · Nov 28, 2023

ray201 said:
...
Yeah, i've got 3 drives failure tolerance on the vdevs and 2 on meta....
...

The suggested rule for Special / Metadata vDevs is the same amount of disk failure tolerance as the data vDevs. The reason for this is that loss of the Special / Metadata vDev means loss of the pool, unrecoverably.

As Eric pointed out, a 2 way Mirror Special / Metadata vDev is not the same redundancy as RAID-Z3. This is what we mean;

RAID-Z1, 1 disk of redundancy - 2 way Mirror for Special / Metadata vDev, (which is 1 disk of redundancy)
RAID-Z2, 2 disks of redundancy - 3 way Mirror for Special / Metadata vDev, (which is 2 disks of redundancy)
RAID-Z3, 3 disks of redundancy - 4 way Mirror for Special / Metadata vDev, (which is 3 disks of redundancy)

Now perhaps a 4 way Mirror is a bit over-kill. But, if you are paranoid, at least have a 3 way Mirror for your Special / Metadata vDev.

Etorix · Nov 28, 2023

iSCSI would do much better on mirrors than on raidz3. So if you were to rebuild, a stripe of 3-way mirrors for the HDDs and a 3-way mirror for special would be the way.

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

ray201 · Nov 30, 2023

Ericloewe said:
You mean Samsung 970 Pro or similar? Those would not be adequate for SLOG, but fine for the other roles.

yes, today its got 2 pcs 512GB samsung pro nvme. One on each pci controller. Might be a dumb question but why are they not adequate? Any suggestion for replacement? when i installed them i was focusing on mb/sec.

ray201 · Nov 30, 2023

thanks all for taking time to reply!

Ericloewe · Nov 30, 2023

ray201 said:
yes, today its got 2 pcs 512GB samsung pro nvme. One on each pci controller. Might be a dumb question but why are they not adequate? Any suggestion for replacement? when i installed them i was focusing on mb/sec.

No Power Loss Protection. There are a few M.2 SSDs that do have that feature, but they're on the large side and may not fit. I think the Micron 7400 is available in M.2 with PLP, and there may be others.

Important Announcement for the TrueNAS Community.

Expanding zpool / performance tuning

ray201

Cadet

Ericloewe

Server Wrangler

ray201

Cadet

Ericloewe

Server Wrangler

Arwen

MVP

Etorix

Wizard

Resource - The path to success for block storage

ray201

Cadet

ray201

Cadet

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Expanding zpool / performance tuning

Cadet

Server Wrangler

Cadet

Server Wrangler

MVP

Wizard

Cadet

Cadet

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Expanding zpool / performance tuning"

Similar threads