Special Allocation Class vdev & ESXi

rungekutta · Dec 15, 2020

Hi,
I’m intrigued by the new special allocation class vdevs (catchy name!) but have some questions for this great community if anyone can chip in...

-First of all; my TrueNAS is used to store VMs hosted on ESXi. Also acts as media server, backup server for all desktops and laptops, etc. Raidz2 x6 and ssd slog. Performance has been ok, although not fantastic when several VMs are doing heavy i/o at the same time. In any case I enjoy tinkering and would like to see if I can push it further.

- ESXi keeps the VMs in vmdk files on an NFS share. Record size on that dataset is 16k.

- First question; since ESXi runs the VM’s file system inside a flat (vmdk) file on the NAS, presumably it constantly reads and writes in small chunks at different locations across the file. And given the 16k record size, presumably the files (many gigabytes obviously) are split evenly into many 16k block, as TruesNAS/ZFS has no understanding of the contents of the file. So related to that - presumably the “small file blocks” in the special vdev would be of little use here as all blocks are 16k, so I would either have to set the threshold to catch all of them (then a in effect moving all VM storage to the special vdev), or none? Am I thinking about this in the right way?

- And second question, somewhat related. Do you think metadata storage on an ssd special vdev would be helpful here? Or would the metadata typically be small enough to fit in RAM (ARC) anyway? (currently 6x3TB pool and 32GB RAM in my case).

In short I can totally see the strength of the special allocation class vdev in a use case where the ZFS file system is used directly by the host OS and in situations requiring small file i/o (e.g. compiling stuff). I just don’t really understand if the same benefits can be reached for hosted VMs via NFS.

Thoughts, or direct experience?

HoneyBadger · Dec 16, 2020

Before we dig too deep, understand that RAIDZ2 with block storage runs relatively poorly. See "The path to success for block storage" for some details on that, including the thread linked therein about the mirrors vs. RAIDZ performance characteristics. I understand you're trying to run a pool for multiple uses here (media + backups are fine for RAIDZ) but just be aware that this might be where the buck stops if you're chasing VMFS performance.

rungekutta said:
- First question; since ESXi runs the VM’s file system inside a flat (vmdk) file on the NAS, presumably it constantly reads and writes in small chunks at different locations across the file. And given the 16k record size, presumably the files (many gigabytes obviously) are split evenly into many 16k block, as TruesNAS/ZFS has no understanding of the contents of the file. So related to that - presumably the “small file blocks” in the special vdev would be of little use here as all blocks are 16k, so I would either have to set the threshold to catch all of them (then a in effect moving all VM storage to the special vdev), or none? Am I thinking about this in the right way?

Correct on everything. The VMDK is being split into records of maximum 16K each, so the best you could do is avoid relocating the media and backup files on the special_small_blocks devices. Not worth running here.

rungekutta said:
- And second question, somewhat related. Do you think metadata storage on an ssd special vdev would be helpful here? Or would the metadata typically be small enough to fit in RAM (ARC) anyway? (currently 6x3TB pool and 32GB RAM in my case).

This will likely help more than you think; while metadata reads will already be served from your RAM (ARC), the writes to your metadata are likely being choked off pretty hard by both the RAIDZ2 vdev configuration and the fact that "small random writes" don't play nice with HDDs generally. Since ZFS works with metadata on a per-record basis, not a per-file basis, you'll still reap all the benefits from having those updates hit fast NAND as opposed to slow HDD.

However, since you're using RAIDZ2 for your data vdevs, understand that adding a special/meta vdev is currently irreversable. You can only remove a top-level vdev if all top-level vdevs are mirrors, so you're stuck once you add them. We want to be certain you're adding the right SSDs for the right purpose.

What make/model is your SLOG SSD, and what is your proposed meta SSD? You should be looking for "SLOG-lite" style of drives; it doesn't need the write endurance, but it should still have good random and mixed I/O performance relative to your data vdevs. Not hard to do when your data vdevs are HDDs, but you don't want to get cheapo DRAMless TLC or QLC drives.

rungekutta · Dec 16, 2020

Thanks very much for your reply. It was very informative in itself and it's also reassuring to know I wasn't completely barking up the wrong tree... As for pool layout - it being a general purpose setup, I just couldn't justify the space utilization overhead of striped mirrors while at the same time being vulnerable to losing the whole pool if the wrong combination of 2 disks break at once.

To answer your questions - the SLOG currently is a Samsung SM863 on SATA. My plan for a special vdev would be a PCI/NVME adapter card and for example 2x Samsung 970 EVO in a mirror.

The motherboard is a SuperMicro X11SSM-F. I'm out of ports as well as space in the case as I'm currently using a SATA DOM for boot, abovementioned SSD for SLOG, and 6x spinning HDDs (that I'm currently replacing one by one in order to gain more space). So I'm thinking a PCI/NVME card would be an easy and relatively cost effective way of adding more SSDs.

HoneyBadger · Dec 18, 2020

Fair point on the configuration as RAIDZ2, I just wanted to make sure you understood the balance in capacity vs. performance. If VMware performance isn't high on the priority list relative to usable space for media/files, then go right ahead. The use of special vdevs for metadata will help this as well.

rungekutta said:
To answer your questions - the SLOG currently is a Samsung SM863 on SATA. My plan for a special vdev would be a PCI/NVME adapter card and for example 2x Samsung 970 EVO in a mirror.

SM863 are good SATA SLOGs but aren't screaming fast compared to the newer SAS/NVMe options (obligatory Optane plug). In pure write workloads might still have a bottleneck here, but regular mixed read/writes would still be able to benefit from meta SSDs. The 970 EVOs are quite fast, but keep an eye on their wear leveling/usage over time.

rungekutta said:
The motherboard is a SuperMicro X11SSM-F. I'm out of ports as well as space in the case as I'm currently using a SATA DOM for boot, abovementioned SSD for SLOG, and 6x spinning HDDs (that I'm currently replacing one by one in order to gain more space). So I'm thinking a PCI/NVME card would be an easy and relatively cost effective way of adding more SSDs.

The SuperMicro should let you use a PCIe adapter with bifurcation, since that board should support the "multiple PCIe devices in one parent slot" logic. Try to avoid add-in cards that use a "PLX" or "switch" chip on them, as the latency takes a hit from all of the PCIe traffic having to shuffle through that.

Again though, we're talking about "reduced SSD performance" which is still orders of magnitude ahead of "full speed HDD" when it comes to random I/O. I've yet to benchmark the impact of metadata vdevs for RAIDZ2 data when specifically talking about VMFS workloads but it only mitigates the metadata workloads. The underlying characteristics of RAIDZ2 vdevs for the data, space amplification, etc - no working around that.

rungekutta · Dec 20, 2020

Thanks. That all sounds great. I’ll be trying this out later. Just need to decide if I should get a dual adapter for a pair of nvme ssds in special vdev mirror, or if I should get a quad card and throw in yet another nvme ssd for l2arc for kicks. Doesn’t seem like motherboard/bios compatibility with the quad cards are a given though.

HoneyBadger · Dec 20, 2020

rungekutta said:
Thanks. That all sounds great. I’ll be trying this out later. Just need to decide if I should get a dual adapter for a pair of nvme ssds in special vdev mirror, or if I should get a quad card and throw in yet another nvme ssd for l2arc for kicks. Doesn’t seem like motherboard/bios compatibility with the quad cards are a given though.

Not sure about a quad card, but your board specifically calls out support for the AOC-SLG3-2M2 as supported for bifurcation in BIOS update 2.1a so it will allow an x8 slot to split into x4x4 for sure. Not certain if it will allow x2x2x2x2, but normally SuperMicro limits their quad-cards to x16 slots and your board's slot is only x16 physically, it's still x8 electrically. So unfortunately I think you're limited to two there.

L2ARC isn't significantly harmed by being on the SATA bus though. You could get a USB-to-SATA adaptor and hang your SATADOM off of that internally, that would free up a port. (Or just buy two 2-way cards, if you've got the slots!)

rungekutta · Dec 20, 2020

Absolutely right; the manual confirms the x16 slot on the X11SSM is in fact wired for x8, and out of the three x8 slots, only one is wired for x8, the other 2 only for x4! I had no idea. But that puts x16 quad nvme cards out of the question for sure.

I guess another option would be to partition up the 2 nvme drives into 2 partitions each and build a mirror vdev out of 2 (across drives), and a striped l2arc across the other 2. Would have to think about write endurance on the drives then. Bad idea...?

rungekutta · Dec 20, 2020

Oh. Or, one x4 pci-nvme card for a single nvme ssd for l2arc. Alongside the AOC-SLG3-2M2 card in the x8 slot and with mirror nvme SSDs for special vdev.

Rand · Feb 1, 2021

Ignoring the specific use case in this thread (Raid-Z),

has it ever been determined whether special vdevs are useful for (nfs) based VM storage pools?
Have tried reading up a bit but this was the only thread I found targeting this specific question but it derailed a bit so it actually didnt answer it.

sretalla · Feb 2, 2021

rungekutta said:
partition up the 2 nvme drives into 2 partitions each and build a mirror vdev out of 2 (across drives), and a striped l2arc across the other 2. Would have to think about write endurance on the drives then. Bad idea...?

Be aware of the risky path you're taking here...

The metadata VDEV and other special VDEVs aren't like L2ARC and SLOG in that they are integral to the pool and if lost, your pool will be lost with them.

I'm not saying you can't or that it's not feasible to do, just that you're talking about CLI-only commands to manage something that can kill your pool. Know your stuff before going that way.

rungekutta · Feb 2, 2021

Thanks for the warning and yes I’ve pretty much abandoned that idea... as and when or if I do this (metadata vdev) I’ll do it with a mirror of nvme drives and without partition trickery.

Chris Moore · Apr 23, 2021

I have a question... Since I added a special vdev, I have not been able to see that any data is being written to it and it does not appear to have changed the performance of the pool. I think I am missing something in the setup. Is there a guide? What should I be looking at?

sretalla · Apr 23, 2021

zpool list -v should show the usage.

You will likely need to rewrite a bunch of data to see it starting to kick in.

It probably won't help for cases where you weren't metadata heavy (bigger files), but will tend to be useful where you have large numbers of small files.

I'm not sure about how useful it can be for block storage.

HoneyBadger · Apr 23, 2021

Metadata is handled at the record level, so it still gets leveraged for block data. As mentioned though it won't see any new data until you do writes or rewrites.

Chris Moore · Apr 23, 2021

HoneyBadger said:
Metadata is handled at the record level, so it still gets leveraged for block data. As mentioned though it won't see any new data until you do writes or rewrites.

I created a new pool with the special vdev for metadata, I filled the pool by doing a ZFS send and receive. Could it be that the metadata was not written to the special vdev because of using send and receive instead of a "normal" file copy?

Important Announcement for the TrueNAS Community.

Special Allocation Class vdev & ESXi

rungekutta

Contributor

HoneyBadger

actually does care

rungekutta

Contributor

HoneyBadger

actually does care

rungekutta

Contributor

HoneyBadger

actually does care

rungekutta

Contributor

rungekutta

Contributor

Rand

Guru

sretalla

Powered by Neutrality

rungekutta

Contributor

Chris Moore

Hall of Famer

sretalla

Powered by Neutrality

HoneyBadger

actually does care

Chris Moore

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

Special Allocation Class vdev & ESXi

Contributor

actually does care

Contributor

actually does care

Contributor

actually does care

Contributor

Contributor

Guru

Powered by Neutrality

Contributor

Hall of Famer

Powered by Neutrality

actually does care

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Special Allocation Class vdev & ESXi"

Similar threads