Seeking help with deteriorating performance of TrueNAS as a vSphere storage backend

iigx

Cadet
Joined
Aug 21, 2021
Messages
4
I have a vSphere cluster that uses TrueNAS-SCALE and iSCSI as the storage backend, and usually I have 90 to 150 windows virtual machines running in this vSphere cluster.
Before this problem occurred, it took me about 2 minutes to clone a 60 GB VM in vCenter.However, a few days ago, the time to clone a VM from the same template was increased to 6 minutes. I checked the network and used the SSD performance test tool to test disk performance on the VM, but all the test results were normal.
The TrueNAS-SCALE has been running for 75 days and restarting the vSphere cluster and TrueNAS-SCALE will not solve the problem.
Hardware configuration of TrueNAS-SCALE-22.02.4:
CPU: 1xIntel 4210R
RAM: 128G ECC
Disk: 1*500G EVO as TrueNAS system disk,23*4T EVO as data pool
Data Pool: Type=FileSystem, Compression=lz4, Dedup=Off

If you have also encountered such problems or have optimization or diagnosis suggestions, please help me, thank you.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
As the ZFS pool gets full, the SSDs also get full. SSDs with no free capacity are much slower than new SSDs.

By using Z3 (is that 23 wide?) , the io/s to the drives ends up being very small and makes the drives slower as well.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Which is why we still recommend sticking with the 50-60% pool fill recommendation even for SSD.
Even if the pool is only 50% full.... the SSDs can be 90% full.
Daily TRIM can help.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Even if the pool is only 50% full.... the SSDs can be 90% full.
Daily TRIM can help.

The free page bucket can be running low, yes, but this is usually only a problem if you're relying only on garbage collection. Best design is to use TRIM automatically, which ZFS supports, though if your TRIM implementation sucks, ZFS does also support manual TRIM from the command line. Using automatic TRIM makes best use of the controller's resources because page erasing is generally a background process, and automatic TRIM tends to get you a larger free page pool because it provides a prompting to the controller that there is a newly freed block. Modern controllers supposedly benefit. Whether or not a controller's firmware actually checks the status of the other blocks on the flash page is, of course, another matter. In theory this process should prevent a situation where the SSD is 90% dirty while the pool is only 50% full.

That was the point of an argument I once had with Jordan about overprovisioning of SLOG devices, actually leaving the partition size down to a reasonable size so that the controller's mapper mapped unpartitioned LBA's to the zero block. This means that it is easier for the controller to recognize, even via basic garbage collection, which dirty pages were no longer used, and would therefore erase them and return them to the free page pool in a more expeditious fashion, meaning you could blow stuff out to SLOG with less of a chance of hitting the SLOG device's write speed capacity. I suppose this concept isn't relevant with Optane or NVDIMM as they lack the flash erasure performance tax.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
That was the point of an argument I once had with Jordan about overprovisioning of SLOG devices, actually leaving the partition size down to a reasonable size so that the controller's mapper mapped unpartitioned LBA's to the zero block. This means that it is easier for the controller to recognize, even via basic garbage collection, which dirty pages were no longer used, and would therefore erase them and return them to the free page pool in a more expeditious fashion, meaning you could blow stuff out to SLOG with less of a chance of hitting the SLOG device's write speed capacity. I suppose this concept isn't relevant with Optane or NVDIMM as they lack the flash erasure performance tax.
As in Hubbard? And you were advocating to not overprovision SATA SSDs ala?
Or to just not overly aggressively do so?
1685403494211.png



Also @iigx can you go to the shell and type
Code:
zpool list-v


And share the output with the class here?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Well there's your problem. How many vdevs?

Please read and review The path to success for block storage paying particular attention to points 2), 3), and 4).
The Mirror recommendation is solid for HDDs.

SSDs cost more and perform better... we find up to 5WZ1 is a pretty good match for SSDs. The SSds have to be reliable... not random junk.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
have 90 to 150 windows virtual machines running in this vSphere cluster.
SSDs cost more and perform better... we find up to 5WZ1 is a pretty good match for SSDs. The SSds have to be reliable... not random junk
I agree that toplogy is a good one given the appropriate amount of VDEVs for the workload. In this case, 4 VDEVs of that topology with a few spares would be substantially faster than a single Z3 VDEV. That would likely yield a greater than 4x IOPs improvement over the current design. However, just because SSDs are faster doesn't immediately make a RAIDZ vs mirrors recommendation more or less valid. 100+ VMs may very well benefit from the performance differences (in IOPS, latency) between mirrors and RAIDZ1.

There may be other factors at play here like record size, as an example. Having the wrong record size on RAIDZ hurts a bit more than with mirrors. OP didn't mention what he had set which probably leads me to believe that they are using the default 128k record size which is not ideal for the workload given. Then there is also sync/async.

Part of the reason I want to see the output of the zpool list is so that we can determine fragmentation rates, which if high usually correspond with a record size that's too large.

Also, obviously all of these recommendations are destructive actions which will require OP to move all of his data off and back on again. I think we can all agree that OP should rip off the bandaid and consider re-engineering his pool, but there will be a loss of storage space and an outage associated no matter which path they choose. We may be able to get the performance a little bit better- but it's never going to be great given the pool design as it is.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
As in Hubbard? And you were advocating to not overprovision SATA SSDs ala?

I was forcefully advocating to overprovision. I got blown off big time, I believe directly by Jordan, with a "come back when you have statistics to prove this."

It wouldn't be the first time that they refused to implement something I suggested and then did it later on.

I'm actually still curious where the incorrect term "overprovision" grew out of. Typically when you undersize things, such as limiting a circuit capable of 1Gbps down to 900Mbps, this is referred to as "underprovisioning" in the industry. The incorrect "overprovision" seemed to pop up around 2015-2016 and may have come from the gamers. Either way, I talked about it X-provisioning often enough.







I probably have a bug number for underprovisioning on the FreeNAS bug tracker along with a blowoff closure from Jordan.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
SSD's work by having a controller chip that can rapidly manage mapping an incoming request for an LBA to a particular flash page and block. In many cases, a write to an SSD involves pulling a page out of the free page pool, copying data from the old page to the new page, integrating the LBA block data into that page, and then updating the block pointers to point to the new page and blocks.

I would guess that NVMe namespaces are effectively mapped to some mapping mechanism that gives each namespace a separate LBA-to-flash-location map. If that is the case, this suggests that the namespaces are quite likely pulling from the same free pool of pages. It would be a lot of work to implement multiple free page pools and I cannot think of a benefit other than providing fair access to demands for new space inside each namespace.

If we assume this to be correct, then the use of a NVMe namespace of limited size would mean that the remaining (out-of-range) LBA's point to the pseudo "zero block" and are not allocated to any actual flash page/block.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
If that is the case, this suggests that the namespaces are quite likely pulling from the same free pool of pages.
So, for the application of a single device as a SLOG, we're likely not going to see much of a difference, partitions vs namespaces?

What may be interesting is potentially adding multiple (3 or 4) "overprovisioned" 4GB namespaces to a single NVME drive and then striping those "devices" as a SLOG.
Micron's marketing numbers suggest that, at the very least, it may be beneficial if the workload can take advantage of it.
performance-by-namespace.png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
So, for the application of a single device as a SLOG, we're likely not going to see much of a difference, partitions vs namespaces?

I am hard pressed to imagine a way where the implementations end up behaving significantly differently. It would mostly be a matter of who was implementing the "space" mapping. For partitions, the host is adding an offset to the LBA to get access to a particular portion of the SSD's LBA map. For namespaces, the controller is doing something to quickly transform that, and it's likely just an addition of some sort. So just a matter of who is doing the adding, I suspect.

What may be interesting is potentially adding multiple (3 or 4) "overprovisioned" 4GB namespaces to a single NVME drive and then striping those "devices" as a SLOG.

I see what you're saying with respect to the increased TPS at NS={2,3,4,...} but I am skeptical. Plus it's read traffic, while SLOG is exclusively write traffic.

This really gets to questions of how ZFS works with a SLOG device. I knew more about this ten years ago when I had gone in to self-solve the issues related to bug #1531 at which time I learned lots about the whole ZFS write process. So I'm going to toss out some ideas about which I might be right or wrong.

First, striping namespaces on a single NVMe makes an assumption that you can make two namespaces somehow go faster than a single namespace. I agree the graph appears to suggest that. But I'm wondering if this is just due to latency effects, the way that we obsess about queue depth on SAS. I'd like to understand the mechanics of what's going on in the graph. It's clear because this evaporates after NS=5 that some inefficiency in the system is being exploited and transformed into small extra amounts of performance.

If you had TWO NVMe devices, you might be able to make one namespace on one striped with one namespace on the other go faster. But I am skeptical even there. At some point you may hit a synchronization penalty because the striping inherently wants to read both sides of a block at about the same time. Even if you take advantage of NVMe's huge NCQ (up to 64K queues of 64K entries) you are not guaranteed to be able to match speeds to keep sync.

I would like to understand why Micron claims handing a single NVMe device two transactions to complete to a single namespace is going to be less efficient than handing a single NVMe device one transaction each to two namespaces. It might not be much more efficient, but even just processing the switch between namespaces likely takes a tiny bit of time. Or are they just cramming NCQ up to the gills with entries and finding that 128k queues to two namespaces is slightly more efficient than 64k queues to one?

I have a hard time seeing how a single NVMe device handed one transaction each to two namespaces could complete that operation FASTER than two transactions to a single namespace unless there was concurrency available in the controller or some other inefficiency to be squeezed out.

That's the controller-level analysis I have in my head. Correct me anywhere you please.

The next bit is ZFS and is a bit more nebulous. I do not recall ZFS being able to handle concurrency in the ZIL. zil_commit is designed as singlethreaded and causes other calls to zil_commit to block.

This probably made sense years ago when the ZIL mechanism was designed; I don't think anyone was thinking of the possibility of parallel log devices.

Now, theoretically, if you did have two log devices, and please note I am NOT saying "striped" here, let's just go with a POSIX-satisfying theoretical design, if you had one writer client, then you can probably only use one logger write thread as well, because the latency in the operation will primarily be the stuff happening in the SSD controller and flash. The messaging back and forth from the SSD controller to the CPU is happening at NVMe speed (hella fast). But if you had two writer clients, having a second logger write thread allows you to ingest at up to 2x the speed because you can write to both devices. So that's potentially workable.

The problem with striping is that you get into a lot of finicky management of trying to keep the write pace even between the two devices, and you are going to lose some potential speed due to the need to block when one device is a bit slower. Better to firehose data down to each device at whatever speed can be managed. It's still sync writes but you can do the log commits in parallel. That is also scalable to multiple devices (beyond 2).
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Still reading your response, but I figured we could take the rest of this convo here. I just tested exactly this.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Regarding multiple namespaces, it could be as simple as the controller having multiple cores it wants to keep busy and someone then came up with the idea of dumping the work of issuing reads in parallel onto the OS.
Honestly feels dodgy.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
My own opinion of the "over-provisioning" naming is that it likely came from over-provisioning of the spare space. All modern drives, SSD & HDD, have spare sectors available for sparing out of faulty blocks.

I never really bought into partition based over-provisioning of spare space. It just did not seem like it would work perfectly. Now using the SATA protocol to over-provision does seem like it would do the correct thing;
NAME
hdparm - get/set SATA/IDE device parameters
...
-N Get/set max visible number of sectors, also known as the Host Protected Area setting. Without a
parameter, -N displays the current setting, which is reported as two values: the first gives the
current max sectors setting, and the second shows the native (real) hardware limit for the disk.
The difference between these two values indicates how many sectors of the disk are currently hidden
from the operating system, in the form of a Host Protected Area (HPA)
...

As a side note, we are getting close to having spare storage in DDR5 memory. With built in ECC, independent of CPU <-> Memory ECC, in theory the memory modules can start sparing out bad cell lines. Not sure how I feel about this, higher density but with some flaws needing sparing. Or lower density and no sparing.


Back to the original poster, we do need to know the pool layout. RAID-Z3 & a total of 23 drives is not enough information. More than 10 to 12 drives in a RAID-Zx vDev tend not to perform well, especially when fragmentation gets high.
 
Top