Seeking help with deteriorating performance of TrueNAS as a vSphere storage backend

iigx · May 29, 2023

I have a vSphere cluster that uses TrueNAS-SCALE and iSCSI as the storage backend, and usually I have 90 to 150 windows virtual machines running in this vSphere cluster.
Before this problem occurred, it took me about 2 minutes to clone a 60 GB VM in vCenter.However, a few days ago, the time to clone a VM from the same template was increased to 6 minutes. I checked the network and used the SSD performance test tool to test disk performance on the VM, but all the test results were normal.
The TrueNAS-SCALE has been running for 75 days and restarting the vSphere cluster and TrueNAS-SCALE will not solve the problem.
Hardware configuration of TrueNAS-SCALE-22.02.4:
CPU: 1xIntel 4210R
RAM: 128G ECC
Disk: 1*500G EVO as TrueNAS system disk,23*4T EVO as data pool
Data Pool: Type=FileSystem, Compression=lz4, Dedup=Off

If you have also encountered such problems or have optimization or diagnosis suggestions, please help me, thank you.

Ericloewe · May 29, 2023

iigx said:
23*4T EVO as data pool

23? What's the layout of this pool?

iigx · May 29, 2023

Ericloewe said:
23? What's the layout of this pool?

it's just RAIDZ3

jgreco · May 29, 2023

iigx said:
it's just RAIDZ3

Well there's your problem. How many vdevs?

Please read and review The path to success for block storage paying particular attention to points 2), 3), and 4).

jgreco · May 29, 2023

iigx said:
RAM: 128G ECC

iigx said:
TrueNAS-SCALE-22.02.4:

Also this puts your ARC at the bare minimum of 64GB, and why are you using SCALE for this anyways? Ditch the Linux, switch to Core, add 128GB more RAM, switch to using mirrors, make sure you have a good bit of free space, and that system will fly.

morganL · May 29, 2023

As the ZFS pool gets full, the SSDs also get full. SSDs with no free capacity are much slower than new SSDs.

By using Z3 (is that 23 wide?) , the io/s to the drives ends up being very small and makes the drives slower as well.

jgreco · May 29, 2023

morganL said:
As the ZFS pool gets full, the SSDs also get full. SSDs with no free capacity are much slower than new SSDs.

Which is why we still recommend sticking with the 50-60% pool fill recommendation even for SSD.

morganL said:
By using Z3 (is that 23 wide?) ,

I was afraid to ask.

morganL · May 29, 2023

jgreco said:
Which is why we still recommend sticking with the 50-60% pool fill recommendation even for SSD.

Even if the pool is only 50% full.... the SSDs can be 90% full.
Daily TRIM can help.

jgreco · May 29, 2023

morganL said:
Even if the pool is only 50% full.... the SSDs can be 90% full.
Daily TRIM can help.

The free page bucket can be running low, yes, but this is usually only a problem if you're relying only on garbage collection. Best design is to use TRIM automatically, which ZFS supports, though if your TRIM implementation sucks, ZFS does also support manual TRIM from the command line. Using automatic TRIM makes best use of the controller's resources because page erasing is generally a background process, and automatic TRIM tends to get you a larger free page pool because it provides a prompting to the controller that there is a newly freed block. Modern controllers supposedly benefit. Whether or not a controller's firmware actually checks the status of the other blocks on the flash page is, of course, another matter. In theory this process should prevent a situation where the SSD is 90% dirty while the pool is only 50% full.

That was the point of an argument I once had with Jordan about overprovisioning of SLOG devices, actually leaving the partition size down to a reasonable size so that the controller's mapper mapped unpartitioned LBA's to the zero block. This means that it is easier for the controller to recognize, even via basic garbage collection, which dirty pages were no longer used, and would therefore erase them and return them to the free page pool in a more expeditious fashion, meaning you could blow stuff out to SLOG with less of a chance of hitting the SLOG device's write speed capacity. I suppose this concept isn't relevant with Optane or NVDIMM as they lack the flash erasure performance tax.

NickF · May 29, 2023

jgreco said:
That was the point of an argument I once had with Jordan about overprovisioning of SLOG devices, actually leaving the partition size down to a reasonable size so that the controller's mapper mapped unpartitioned LBA's to the zero block. This means that it is easier for the controller to recognize, even via basic garbage collection, which dirty pages were no longer used, and would therefore erase them and return them to the free page pool in a more expeditious fashion, meaning you could blow stuff out to SLOG with less of a chance of hitting the SLOG device's write speed capacity. I suppose this concept isn't relevant with Optane or NVDIMM as they lack the flash erasure performance tax.

As in Hubbard? And you were advocating to not overprovision SATA SSDs ala?

SLOG Overprovisioning

Describes how to configure SLOG over-provisioning on TrueNAS CORE.

www.truenas.com

Or to just not overly aggressively do so?

Also @iigx can you go to the shell and type

Code:

zpool list-v

And share the output with the class here?

morganL · May 29, 2023

jgreco said:
Well there's your problem. How many vdevs?

Please read and review The path to success for block storage paying particular attention to points 2), 3), and 4).

The Mirror recommendation is solid for HDDs.

SSDs cost more and perform better... we find up to 5WZ1 is a pretty good match for SSDs. The SSds have to be reliable... not random junk.

NickF · May 29, 2023

iigx said:
have 90 to 150 windows virtual machines running in this vSphere cluster.

morganL said:
SSDs cost more and perform better... we find up to 5WZ1 is a pretty good match for SSDs. The SSds have to be reliable... not random junk

I agree that toplogy is a good one given the appropriate amount of VDEVs for the workload. In this case, 4 VDEVs of that topology with a few spares would be substantially faster than a single Z3 VDEV. That would likely yield a greater than 4x IOPs improvement over the current design. However, just because SSDs are faster doesn't immediately make a RAIDZ vs mirrors recommendation more or less valid. 100+ VMs may very well benefit from the performance differences (in IOPS, latency) between mirrors and RAIDZ1.

There may be other factors at play here like record size, as an example. Having the wrong record size on RAIDZ hurts a bit more than with mirrors. OP didn't mention what he had set which probably leads me to believe that they are using the default 128k record size which is not ideal for the workload given. Then there is also sync/async.

Part of the reason I want to see the output of the zpool list is so that we can determine fragmentation rates, which if high usually correspond with a record size that's too large.

Also, obviously all of these recommendations are destructive actions which will require OP to move all of his data off and back on again. I think we can all agree that OP should rip off the bandaid and consider re-engineering his pool, but there will be a loss of storage space and an outage associated no matter which path they choose. We may be able to get the performance a little bit better- but it's never going to be great given the pool design as it is.

jgreco · May 29, 2023

NickF said:
As in Hubbard? And you were advocating to not overprovision SATA SSDs ala?

SLOG Overprovisioning

Describes how to configure SLOG over-provisioning on TrueNAS CORE.

www.truenas.com

I was forcefully advocating to overprovision. I got blown off big time, I believe directly by Jordan, with a "come back when you have statistics to prove this."

It wouldn't be the first time that they refused to implement something I suggested and then did it later on.

I'm actually still curious where the incorrect term "overprovision" grew out of. Typically when you undersize things, such as limiting a circuit capable of 1Gbps down to 900Mbps, this is referred to as "underprovisioning" in the industry. The incorrect "overprovision" seemed to pop up around 2015-2016 and may have come from the gamers. Either way, I talked about it X-provisioning often enough.

https://www.truenas.com/community/threads/slog-underprovisioning.38374/

how to partition zil ssd drive to underprovision ?

Hi! I'm new to BSD (mostly do linux) and new to ZFS. I'm testing FreeNAS for possible use as a store in a SMB setting (about 200 users, NFS, Samba, and shared storage for VM disk images - ESXi ) I've read quite a bit about configuring ZFS and am trying to figure out the steps to...

www.truenas.com

https://www.truenas.com/community/threads/how-to-add-an-slog.16766/post-87036

Finding suitable ZIL and L2ARC drives

We have a small ESXi cluster in our office running on a FreeNAS array with sync disabled... I want to put a ZIL and L2ARC cache on the box and was wondering if purchasing two Samsung 840 Pro drives and partitioning both so p1 is mirrored for the ZIL and p2 is striped for the L2ARC. Does that...

www.truenas.com

Revisit ZIL on Ramdisk

I am using my FreeNAS 8.3.0 box as a backup for my ESXi boxes. I have 6 WD 2TB drives in a raidz2 pool. The problem is that when the backups run at night, I am slowing down at the NAS. Since this is actually Windows backups (mix of 2003, 2008, XP, and 7), it is fairly large writes. It is...

www.truenas.com

Finding suitable ZIL and L2ARC drives

scotch_tape, you are correct that you don't need to underprovision the new Intel drives, but assigning more than 1GB or 2GB to the ZIL makes a mess because you couldn't possibly use more than 2GB of a ZIL. The recommendation is to not have more than 5-10 seconds worth of maximum transfer rates...

www.truenas.com

I probably have a bug number for underprovisioning on the FreeNAS bug tracker along with a blowoff closure from Jordan.

NickF · May 29, 2023

Out of curiousity, do you have any thoughts on using NVME namespaces in contrast with traditional partitions for underprovisioning?

jgreco · May 29, 2023

SSD's work by having a controller chip that can rapidly manage mapping an incoming request for an LBA to a particular flash page and block. In many cases, a write to an SSD involves pulling a page out of the free page pool, copying data from the old page to the new page, integrating the LBA block data into that page, and then updating the block pointers to point to the new page and blocks.

I would guess that NVMe namespaces are effectively mapped to some mapping mechanism that gives each namespace a separate LBA-to-flash-location map. If that is the case, this suggests that the namespaces are quite likely pulling from the same free pool of pages. It would be a lot of work to implement multiple free page pools and I cannot think of a benefit other than providing fair access to demands for new space inside each namespace.

If we assume this to be correct, then the use of a NVMe namespace of limited size would mean that the remaining (out-of-range) LBA's point to the pseudo "zero block" and are not allocated to any actual flash page/block.

NickF · May 29, 2023

jgreco said:
If that is the case, this suggests that the namespaces are quite likely pulling from the same free pool of pages.

So, for the application of a single device as a SLOG, we're likely not going to see much of a difference, partitions vs namespaces?

What may be interesting is potentially adding multiple (3 or 4) "overprovisioned" 4GB namespaces to a single NVME drive and then striping those "devices" as a SLOG.
Micron's marketing numbers suggest that, at the very least, it may be beneficial if the workload can take advantage of it.

Using Namespaces on the Micron 9300 NVMe SSD to Improve Application Performance

As hardware continues to progress in leaps and bounds, getting full utilization of your hardware with legacy software can be a challenge. NVMe Namespaces are one tool to help get higher utilization.

www.micron.com

jgreco · May 29, 2023

NickF said:
So, for the application of a single device as a SLOG, we're likely not going to see much of a difference, partitions vs namespaces?

I am hard pressed to imagine a way where the implementations end up behaving significantly differently. It would mostly be a matter of who was implementing the "space" mapping. For partitions, the host is adding an offset to the LBA to get access to a particular portion of the SSD's LBA map. For namespaces, the controller is doing something to quickly transform that, and it's likely just an addition of some sort. So just a matter of who is doing the adding, I suspect.

NickF said:
What may be interesting is potentially adding multiple (3 or 4) "overprovisioned" 4GB namespaces to a single NVME drive and then striping those "devices" as a SLOG.

I see what you're saying with respect to the increased TPS at NS={2,3,4,...} but I am skeptical. Plus it's read traffic, while SLOG is exclusively write traffic.

This really gets to questions of how ZFS works with a SLOG device. I knew more about this ten years ago when I had gone in to self-solve the issues related to bug #1531 at which time I learned lots about the whole ZFS write process. So I'm going to toss out some ideas about which I might be right or wrong.

First, striping namespaces on a single NVMe makes an assumption that you can make two namespaces somehow go faster than a single namespace. I agree the graph appears to suggest that. But I'm wondering if this is just due to latency effects, the way that we obsess about queue depth on SAS. I'd like to understand the mechanics of what's going on in the graph. It's clear because this evaporates after NS=5 that some inefficiency in the system is being exploited and transformed into small extra amounts of performance.

If you had TWO NVMe devices, you might be able to make one namespace on one striped with one namespace on the other go faster. But I am skeptical even there. At some point you may hit a synchronization penalty because the striping inherently wants to read both sides of a block at about the same time. Even if you take advantage of NVMe's huge NCQ (up to 64K queues of 64K entries) you are not guaranteed to be able to match speeds to keep sync.

I would like to understand why Micron claims handing a single NVMe device two transactions to complete to a single namespace is going to be less efficient than handing a single NVMe device one transaction each to two namespaces. It might not be much more efficient, but even just processing the switch between namespaces likely takes a tiny bit of time. Or are they just cramming NCQ up to the gills with entries and finding that 128k queues to two namespaces is slightly more efficient than 64k queues to one?

I have a hard time seeing how a single NVMe device handed one transaction each to two namespaces could complete that operation FASTER than two transactions to a single namespace unless there was concurrency available in the controller or some other inefficiency to be squeezed out.

That's the controller-level analysis I have in my head. Correct me anywhere you please.

The next bit is ZFS and is a bit more nebulous. I do not recall ZFS being able to handle concurrency in the ZIL. zil_commit is designed as singlethreaded and causes other calls to zil_commit to block.

This probably made sense years ago when the ZIL mechanism was designed; I don't think anyone was thinking of the possibility of parallel log devices.

Now, theoretically, if you did have two log devices, and please note I am NOT saying "striped" here, let's just go with a POSIX-satisfying theoretical design, if you had one writer client, then you can probably only use one logger write thread as well, because the latency in the operation will primarily be the stuff happening in the SSD controller and flash. The messaging back and forth from the SSD controller to the CPU is happening at NVMe speed (hella fast). But if you had two writer clients, having a second logger write thread allows you to ingest at up to 2x the speed because you can write to both devices. So that's potentially workable.

The problem with striping is that you get into a lot of finicky management of trying to keep the write pace even between the two devices, and you are going to lose some potential speed due to the need to block when one device is a bit slower. Better to firehose data down to each device at whatever speed can be managed. It's still sync writes but you can do the log commits in parallel. That is also scalable to multiple devices (beyond 2).

NickF · May 29, 2023

Still reading your response, but I figured we could take the rest of this convo here. I just tested exactly this.

SLOG benchmarking and finding the best SLOG

Intel optane 900P 480GB: 4 kbytes: 27.2 usec/IO = 143.7 Mbytes/s 8 kbytes: 27.5 usec/IO = 283.8 Mbytes/s 4 kbytes: 25.8 usec/IO = 151.6 Mbytes/s 8 kbytes: 27.7 usec/IO = 281.7 Mbytes/s 4 kbytes: 25.7...

www.truenas.com

Ericloewe · May 30, 2023

Regarding multiple namespaces, it could be as simple as the controller having multiple cores it wants to keep busy and someone then came up with the idea of dumping the work of issuing reads in parallel onto the OS.
Honestly feels dodgy.

Arwen · May 30, 2023

My own opinion of the "over-provisioning" naming is that it likely came from over-provisioning of the spare space. All modern drives, SSD & HDD, have spare sectors available for sparing out of faulty blocks.

I never really bought into partition based over-provisioning of spare space. It just did not seem like it would work perfectly. Now using the SATA protocol to over-provision does seem like it would do the correct thing;

NAME
hdparm - get/set SATA/IDE device parameters
...
-N Get/set max visible number of sectors, also known as the Host Protected Area setting. Without a
parameter, -N displays the current setting, which is reported as two values: the first gives the
current max sectors setting, and the second shows the native (real) hardware limit for the disk.
The difference between these two values indicates how many sectors of the disk are currently hidden
from the operating system, in the form of a Host Protected Area (HPA)
...

As a side note, we are getting close to having spare storage in DDR5 memory. With built in ECC, independent of CPU <-> Memory ECC, in theory the memory modules can start sparing out bad cell lines. Not sure how I feel about this, higher density but with some flaws needing sparing. Or lower density and no sparing.

Back to the original poster, we do need to know the pool layout. RAID-Z3 & a total of 23 drives is not enough information. More than 10 to 12 drives in a RAID-Zx vDev tend not to perform well, especially when fragmentation gets high.

Important Announcement for the TrueNAS Community.

Seeking help with deteriorating performance of TrueNAS as a vSphere storage backend

Cadet

Server Wrangler

Cadet

Resident Grinch

Resident Grinch

Captain Morgan

Resident Grinch

Captain Morgan

Resident Grinch

Guru

Captain Morgan

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Server Wrangler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Seeking help with deteriorating performance of TrueNAS as a vSphere storage backend"

Similar threads