80% max utilization - myth or reality?

ericsmith881 · Apr 24, 2021

jgreco said:
Well, iSCSI isn't expected to work at all well on RAIDZ, at least in my opinion, so this starts off from a bad place and simply confirms that the starting point was a bad place, which we knew. This is discussed at a basic level in the linked post above about RAIDZ vs mirrors linked by @Jailer ...

True on the iSCSI/RAIDZ setup. Do you feel I'm getting everything that's possible performance-wise out of the iSCSI as it's set up? With the exception of jumbo frames, that is? It's not the ultimate performance I'm concerned with. It's the rapid degradation of performance that makes it unusable.

jgreco · Apr 24, 2021

Hard to say. It seems very likely that you can get more out of it, as is usually the case with ZFS, but you have to really understand the ins and outs, and you're starting with a chain that has a very weak link to begin with.

ericsmith881 · Apr 24, 2021

I guess the question is why is iSCSI performance 30%-50% worse than SMB? This shows up even when starting with all RAM available for cache in either scenario, a situation where I would expect source- or network-limited results until the cache fills up? CPU isn't pegging. NIC's aren't pegging. I can't understand why iSCSI isn't running like a bat out of hell until write cache is full.

jgreco · Apr 24, 2021

ericsmith881 said:
I guess the question is why is iSCSI performance 30%-50% worse than SMB? This shows up even when starting with all RAM available for cache in either scenario, a situation where I would expect source- or network-limited results until the cache fills up? CPU isn't pegging. NIC's aren't pegging. I can't understand why iSCSI isn't running like a bat out of hell until write cache is full.

Because ZFS doesn't understand the context of your requests.

https://www.truenas.com/community/r...quires-more-resources-for-the-same-result.41/

When you are sequentially reading a native file on ZFS, there's a very good opportunity for ZFS to recognize the activity and deal with it on that basis. When you are doing it over iSCSI, you are merely asking ZFS for SAN blocks, and there's no way for it to understand the context of what is going on. There's also no particular reason for the blocks to be contiguous (because some other filesystem has laid them out), and there's also a lot of reason to think that the blocks will be fragmented on the ZFS side, because ZFS is a copy-on-write filesystem. So when you create the initial virtual disk for iSCSI, it's a bunch of contiguous blocks of zeroes, but then when you start writing to it, ZFS has to allocate new space elsewhere, so there are seeks and stuff involved.

Your question also appears to misunderstand how cache works. Caching only happens for data that has already been accessed (or through read-ahead). iSCSI loses out on read-ahead because ZFS has no idea "what's next."

ericsmith881 · Apr 24, 2021

jgreco said:
Because ZFS doesn't understand the context of your requests.

https://www.truenas.com/community/r...quires-more-resources-for-the-same-result.41/

When you are sequentially reading a native file on ZFS, there's a very good opportunity for ZFS to recognize the activity and deal with it on that basis. When you are doing it over iSCSI, you are merely asking ZFS for SAN blocks, and there's no way for it to understand the context of what is going on. There's also no particular reason for the blocks to be contiguous (because some other filesystem has laid them out), and there's also a lot of reason to think that the blocks will be fragmented on the ZFS side, because ZFS is a copy-on-write filesystem. So when you create the initial virtual disk for iSCSI, it's a bunch of contiguous blocks of zeroes, but then when you start writing to it, ZFS has to allocate new space elsewhere, so there are seeks and stuff involved.

Yeah this is what I suspected. Thanks for confirming it. Sounds like heavy iSCSI write loads for a datastore are not a good fit for ZFS?

Your question also appears to misunderstand how cache works. Caching only happens for data that has already been accessed (or through read-ahead). iSCSI loses out on read-ahead because ZFS has no idea "what's next."

Aren't writes cached pending commit to disk? I assumed all writes were stored in RAM and spooled to disk (i.e. "write back") as quickly as the disk could take it, with I/O performance falling off after the RAM is full.

ericsmith881 · Apr 24, 2021

jgreco said:
Because ZFS doesn't understand the context of your requests.

https://www.truenas.com/community/r...quires-more-resources-for-the-same-result.41/

I read the above article and it neatly lays out the problem. Now it all makes sense, and more or less cements the idea that using ZFS as an iSCSI SAN for a VMware datastore is probably not the best idea.

Which now raises another question: given that NFS is more of a file-based setup, would you expect better performance there? It kind of sounds like it should but I haven't tested it yet.

AlexGG · Apr 24, 2021

jgreco said:
iSCSI loses out on read-ahead because ZFS has no idea "what's next."

Should not the caller, whoever is using iSCSI, have its own read-ahead?

jgreco · Apr 25, 2021

ericsmith881 said:
Yeah this is what I suspected. Thanks for confirming it. Sounds like heavy iSCSI write loads for a datastore are not a good fit for ZFS?

They're fine, but they have to be resourced appropriately, which is something most people are unwilling to do.

Aren't writes cached pending commit to disk? I assumed all writes were stored in RAM and spooled to disk (i.e. "write back") as quickly as the disk could take it, with I/O performance falling off after the RAM is full.

Well, it's limited to two transaction groups. The other thing is that you are heavily dependent on the pool full percentage and fragmentation. ZFS will tend to like to allocate space contiguously, so if there is lots of free space, write speeds will be good, otherwise, it has to scan for available blocks of memory, which can be an intensive process.

ericsmith881 said:
I read the above article and it neatly lays out the problem. Now it all makes sense, and more or less cements the idea that using ZFS as an iSCSI SAN for a VMware datastore is probably not the best idea.

Well, it *can* be totally awesome. The thing is that you have to give it lots of resources. If you keep your occupancy rates low, like say 10-25%, and have enough RAM and L2ARC to cache your working set, you will have a tough time distinguishing your hard drive based pool from an SSD datastore. You end up with lots of fragmentation, an unavoidable thing on a CoW filesystem, but the low occupancy tends to keep writes zippy, and the large ARC/L2ARC keeps important reads zippy.

The problem here is that you need to burn resources to do this. For example, a NAS with 24 x 2TB HDD, to maintain redundancy in the face of a disk failure, needs 3-way mirrors, so you can have eight 2TB three-way mirrors, for a 16TB pool. However, if you follow the sizing I've suggested, you can really only use 1.6TB-4TB of it and get massively awesome write performance. If you couple that with 256GB of RAM and 1TB L2ARC, you will get great read performance too.

This is all just compsci trickery, trading one thing for another. As SSD's have gotten cheaper, it may not make quite as much sense to throw large amounts of hard disk to gain sufficient free space to tackle the CoW space allocation problem -- but SSD's have their own issues too.

Which now raises another question: given that NFS is more of a file-based setup, would you expect better performance there? It kind of sounds like it should but I haven't tested it yet.

You still have the fundamental problem of being a CoW filesystem. If you use NFS for VM virtual disks, you still have a lot of complexity and block rewrites going on within the files, but at least ZFS understands some of the moving bits a bit better, which can be a plus. On the other hand, iSCSI supports UNMAP (think: TRIM) which gives you a lot of the same sorts of ability to understand what data is no longer needed, which is one of the important variables when storing random access data like VM's or databases.

jgreco · Apr 25, 2021

AlexGG said:
Should not the caller, whoever is using iSCSI, have its own read-ahead?

Yes. However, as you get farther away from the disks, there's more latency involved. The best ("fastest") read-ahead happens when you have something immediately adjacent to the disks that can predict what will need to be read next, not something that is connected over a much slower network connection.

ericsmith881 · Apr 25, 2021

jgreco said:
They're fine, but they have to be resourced appropriately, which is something most people are unwilling to do.

I don't know if I'd go so far as to say "unwilling" so much as wondering about the cost/benefit analysis. At some point you have to ask how much better off you are using massively underprovisioned spinning rust instead of just going with a pure flash array or a tiered-storage flash/HDD hybrid. Obviously there's no "perfect" solution and everything has a tradeoff.

This is all just compsci trickery, trading one thing for another. As SSD's have gotten cheaper, it may not make quite as much sense to throw large amounts of hard disk to gain sufficient free space to tackle the CoW space allocation problem -- but SSD's have their own issues too.

My main concern about performance here isn't the ultimate performance of the array but its average performance across the workload. The high-resource performance is indeed spectacular. It's the rapid and spectacular falloff in iSCSI performance that makes it a non-starter for my workload. I'd gladly trade some of the top end to make the low end...well...less low. With 24 CPU cores and 120GB of RAM, I figured it would have plenty of resources to still perform decently under my iSCSI workload. I hoped I'd end up with performance more akin to a low-end enterprise iSCSI SAN but obviously I need to rethink it. I desperately want some cost-effective shared storage for vMotion but so far, this isn't it or at least isn't with the hardware I'm able to throw at it. There are limits to just how ridiculous even I'm willing to get for what is essentially a home media server.

None of this is to say the product is useless or I'm giving up. Far from it. Understanding the benefits and limitations of ZFS motivates me to find new ways to apply it. Even as a NAS, it's a better use of my hardware than sitting on a shelf gathering dust.

jgreco · Apr 25, 2021

ericsmith881 said:
I don't know if I'd go so far as to say "unwilling" so much as wondering about the cost/benefit analysis. At some point you have to ask how much better off you are using massively underprovisioned spinning rust instead of just going with a pure flash array or a tiered-storage flash/HDD hybrid. Obviously there's no "perfect" solution and everything has a tradeoff.

The price differential between rust and silicon has changed over the last 15 years. Unfortunately, the physical limitations of the hard drive haven't, and in fact have degraded in that you still have the same approximately-200-seeks-per-second but now you have 18TB HDD's, so, filling those drives with random accesses is less practical than it used to be. This has, of course, pushed things in favor of flash. That wasn't really an option ten years ago, and very expensive five years ago.

My main concern about performance here isn't the ultimate performance of the array but its average performance across the workload. The high-resource performance is indeed spectacular. It's the rapid and spectacular falloff in iSCSI performance that makes it a non-starter for my workload.

That's a classic symptom of being under-resourced.

I'd gladly trade some of the top end to make the low end...well...less low. With 24 CPU cores and 120GB of RAM, I figured it would have plenty of resources to still perform decently under my iSCSI workload.

Much more CPU than needed, probably less RAM than needed.

I hoped I'd end up with performance more akin to a low-end enterprise iSCSI SAN but obviously I need to rethink it. I desperately want some cost-effective shared storage for vMotion but so far, this isn't it or at least isn't with the hardware I'm able to throw at it. There are limits to just how ridiculous even I'm willing to get for what is essentially a home media server.

Speaking as someone who has worked with low- and mid-tier enterprise iSCSI SAN's, I can say that it isn't that hard to get ZFS to outperform them handily ... but the thing is, you pay mid five figures for that enterprise SAN, you're going to need to pay a good fraction of that to get good performance with ZFS too.

ericsmith881 · Apr 25, 2021

jgreco said:
That's a classic symptom of being under-resourced.

Much more CPU than needed, probably less RAM than needed.

I just bought two quad-core Xeon's for this R710 with much higher clocks (3.6GHz vs 2.9GHz) off eBay for about $40. I'm going to swap them out with the hex-core CPU's and see if that makes any noticeable difference. The core count is, as you said, ridiculous overkill for TrueNAS but then again this server was originally a VM host. Unfortunately I can't add any more RAM to the R710 without junking my existing DIMM's and going with higher capacity ones. I'm not willing to make that investment on such an old piece of hardware, especially when it likely won't give big enough results.

Speaking as someone who has worked with low- and mid-tier enterprise iSCSI SAN's, I can say that it isn't that hard to get ZFS to outperform them handily ... but the thing is, you pay mid five figures for that enterprise SAN, you're going to need to pay a good fraction of that to get good performance with ZFS too.

The benchmark I'm comparing this against is having this same 11x 8TB SATA array as a DAS vs. TrueNAS, both as VMware datastores. As a DAS, I could reliably get almost 300MB/s for large sequential writes and it maintained that performance throughout 50+TB of writes. TrueNAS in iSCSI config gives me almost 600MB/s initially but degrades to under 150MB/s after only 20+TB of writes (I didn't want to see just how awful it got after that).

TrueNAS as SMB gives me at least 600MB/s to start with (likely limited by my source 12x 6TB SAS DAS array) which degrades to around 450MB/s towards the end of 50+TB of writes. It's an improvement over the DAS but one of the main goals was to get shared storage working and SMB doesn't give me that. I guess I can live without it but I'm questioning whether the increased performance of TrueNAS is worth the extra power consumption of an entire R710 for nothing more than non-shared storage

ChrisRJ · Apr 27, 2021

ericsmith881 said:
My main concern about performance here isn't the ultimate performance of the array but its average performance across the workload.

For mission-critical workloads my experience has always been that neither the top nor the average performance are relevant. For a given KPI it is always the lowest value that determines whether or not the SLA is met.

ericsmith881 · Apr 27, 2021

ChrisRJ said:
For mission-critical workloads my experience has always been that neither the top nor the average performance are relevant. For a given KPI it is always the lowest value that determines whether or not the SLA is met.

Based that metric, the TrueNAS is a non-starter then due to its extreme falloff in performance for iSCSI workloads like mine. Yes, it can be overcome by extreme overprovisioning, but that's not the point. The good news is my home media collection isn't under an SLA, and if it were I'd have the budget to mitigate the performance issues I'm experiencing.

Important Announcement for the TrueNAS Community.

80% max utilization - myth or reality?

ericsmith881

Dabbler

jgreco

Resident Grinch

ericsmith881

Dabbler

jgreco

Resident Grinch

ericsmith881

Dabbler

ericsmith881

Dabbler

AlexGG

Contributor

jgreco

Resident Grinch

jgreco

Resident Grinch

ericsmith881

Dabbler

jgreco

Resident Grinch

ericsmith881

Dabbler

ChrisRJ

Wizard

ericsmith881

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

80% max utilization - myth or reality?

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

Contributor

Resident Grinch

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "80% max utilization - myth or reality?"

Similar threads