80% max utilization - myth or reality?

ericsmith881 · Apr 10, 2021

I've seen this discussed elsewhere and it's heavily hinted at in TrueNAS that exceeding 80% utilization of a pool is a Very Bad Idea. But is it? I hear performance gets really bad if you go beyond this usage, but are we talking 80% is fine but 81% it falls over and dies? Or does it just get progressively worse the further past 80% you go?

I ask because I have an 11x 8TB SATA array in an MD1000 SAS enclosure, currently provisioned as RAID-Z. The idea of leaving 16TB of space "unused" seems absurdly wasteful. If I bumped it to 90%, what kind of performance impact should I expect? It's intended to function as an iSCSI target for my VMware cluster, holding a single VDMK that will have a home media library. This library is composed exclusively of very large files (20GB-80GB on average). Disk activity will be very large sequential reads and writes (more reads than writes but writes are important if I need to vMotion things back and forth from my other storage arrays).

My TrueNAS server is:

Dell PowerEdge R710
2x X5670 2.93GHz hex-core Xeon CPU
128GB RAM
LSI 9201-16e 4-port SAS HBA
Dual Broadcom 10Gbe NIC

Any thoughts, comments, or observations are appreciated.

ChrisRJ · Apr 10, 2021

This has been discussed many times in great detail here and the search function will be your friend.

Jailer · Apr 11, 2021

@jgreco has written an article about it in the resources section.

https://www.truenas.com/community/r...and-why-we-use-mirrors-for-block-storage.112/

jgreco · Apr 11, 2021

ericsmith881 said:
I've seen this discussed elsewhere and it's heavily hinted at in TrueNAS that exceeding 80% utilization of a pool is a Very Bad Idea. But is it? I hear performance gets really bad if you go beyond this usage, but are we talking 80% is fine but 81% it falls over and dies? Or does it just get progressively worse the further past 80% you go?

I ask because I have an 11x 8TB SATA array in an MD1000 SAS enclosure, currently provisioned as RAID-Z. The idea of leaving 16TB of space "unused" seems absurdly wasteful. If I bumped it to 90%, what kind of performance impact should I expect? It's intended to function as an iSCSI target for my VMware cluster, holding a single VDMK that will have a home media library. This library is composed exclusively of very large files (20GB-80GB on average). Disk activity will be very large sequential reads and writes (more reads than writes but writes are important if I need to vMotion things back and forth from my other storage arrays).

My TrueNAS server is:

Dell PowerEdge R710
2x X5670 2.93GHz hex-core Xeon CPU
128GB RAM
LSI 9201-16e 4-port SAS HBA
Dual Broadcom 10Gbe NIC

Any thoughts, comments, or observations are appreciated.

Naw, you *can* go right on up to about 95%, but with RAIDZ, you are creating a stressy environment. As in, if you get a lot of fragmentation, your system will eventually feel like you're writing to floppy diskettes. No I am *not* kidding.

Plus, with RAIDZ, the parity overhead is likely to be very large, probably much greater than you expect.

If you want to maintain SSD-like write speeds for VM block storage, you want to be somewhere in probably the 20-40% space USED (this is correct) range. Up to 50-60% is about as far as it should be pushed for block storage. You *can* go beyond this, as I said, up to 95% if you really want, but performance will begin to tank as random write cycles bap up your fragmentation. Your only solution at that point will be to empty a large portion of your pool (like back down to maybe less than 20% occupancy) and then copy stuff back. People are generally resistant to doing this, and generally expect their NAS to be fast at all times, which is why we have stuff like this.

Try to remember that ZFS creates speed through compsci tricks... in this case it can make disks perform much faster than disks normally perform, at the expense of space, because it is trading one thing for another. If you are unwilling to make the Faustian bargain of excessive resources to get amazing speed, then you get to keep your resources but it will eventually suck.

ericsmith881 · Apr 11, 2021

ChrisRJ said:
This has been discussed many times in great detail here and the search function will be your friend.

Yes, I'm quite well aware of that. I'm also well aware of what a search function does, how to use it, and what the results are. I stated in the very first sentence I've seen the other discussions on this subject. I asked because my use case is rather different than what these other articles discuss, that being very large files using large sequential reads and writes on an iSCSI target of a VMware datastore using an unusually large array (80+TB). The majority of the articles cover general file usage with a mix of large/small files and lots of random I/O on smaller arrays. This is not necessarily applicable to my situation so I asked the question. Referring someone to use the search function without trying to understand why the question was asked is not helpful. Whether you intended it or not, it comes across as snide and condescending.

If all you offer is "use the search function" you might do well to simply not comment in the first place.

ericsmith881 · Apr 11, 2021

Jailer said:
@jgreco has written an article about it in the resources section.

https://www.truenas.com/community/r...and-why-we-use-mirrors-for-block-storage.112/

Thank you for a useful, intelligent reply. I will review this.

ericsmith881 · Apr 11, 2021

jgreco said:
Naw, you *can* go right on up to about 95%, but with RAIDZ, you are creating a stressy environment. As in, if you get a lot of fragmentation, your system will eventually feel like you're writing to floppy diskettes. No I am *not* kidding.

Plus, with RAIDZ, the parity overhead is likely to be very large, probably much greater than you expect.

If you want to maintain SSD-like write speeds for VM block storage, you want to be somewhere in probably the 20-40% space USED (this is correct) range. Up to 50-60% is about as far as it should be pushed for block storage. You *can* go beyond this, as I said, up to 95% if you really want, but performance will begin to tank as random write cycles bap up your fragmentation. Your only solution at that point will be to empty a large portion of your pool (like back down to maybe less than 20% occupancy) and then copy stuff back. People are generally resistant to doing this, and generally expect their NAS to be fast at all times, which is why we have stuff like this.

Try to remember that ZFS creates speed through compsci tricks... in this case it can make disks perform much faster than disks normally perform, at the expense of space, because it is trading one thing for another. If you are unwilling to make the Faustian bargain of excessive resources to get amazing speed, then you get to keep your resources but it will eventually suck.

Interesting. I'm not expecting SSD-like write speeds in this scenario, just performance at least as good as I was getting when this array was configured as a DAS. Random reads/writes are virtually non-existent so sequential speed is paramount. I configured this as a RAID-Z pool (80% usage) last night and started playing with it. I vMotion'd a 4TB VMDK to it and got about 450MB/s on average, with the low being about 300MB/s and the high being around 900MB/s. I consider that very good performance for a SATA array. I'll try bumping that to 90% and see what happens.

EDIT: Regarding the performance hit due to parity, is that due to disk overhead or CPU? Because in my setup, the CPU rarely goes over 15% usage. I've literally got 24 CPU's to play with so unless the overhead is disk-bound, it shouldn't matter. Further, according to the article helpfully referenced above, RAID-Z is ideal for large sequential workloads. Not as ideal as a mirror, of course, but much easier on wasted space.

HoneyBadger · Apr 11, 2021

ericsmith881 said:
It's intended to function as an iSCSI target for my VMware cluster, holding a single VDMK that will have a home media library. This library is composed exclusively of very large files (20GB-80GB on average). Disk activity will be very large sequential reads and writes (more reads than writes but writes are important if I need to vMotion things back and forth from my other storage arrays).

Question; the "large media file" workflow really, REALLY works better when presented through NAS protocols (SMB/NFS) - what's the driving force to add the layers of VMDK/VMFS in the middle?

jgreco · Apr 11, 2021

ericsmith881 said:
Interesting. I'm not expecting SSD-like write speeds in this scenario, just performance at least as good as I was getting when this array was configured as a DAS. Random reads/writes are virtually non-existent so sequential speed is paramount. I configured this as a RAID-Z pool (80% usage) last night and started playing with it. I vMotion'd a 4TB VMDK to it and got about 450MB/s on average, with the low being about 300MB/s and the high being around 900MB/s. I consider that very good performance for a SATA array. I'll try bumping that to 90% and see what happens.

You'll get about the same. However, over time, as erase/write cycles increase, it will slow down, dramatically, like falling off a cliff, and you will get to a point where it feels like you're writing to floppy disk. This is why most "testing" or "benchmarking" of ZFS is bovine excrement; people make the mistake of thinking that the sweet performance they're getting will be there later. It's always fast when it's fresh, it's the stuff that happens down the road that is problematic. The only real question is how long it takes to get there. If you are storing mostly huge chunks of sequential data, it could potentially take much longer to get there than normal VM storage, where smaller blocks are constantly being written.

ericsmith881 · Apr 11, 2021

One item of note for VMware users: I noted a substantial performance increase when enabling Storage I/O control for this iSCSI target. I'd be curious if anyone else has noted the same?

ericsmith881 · Apr 11, 2021

HoneyBadger said:
Question; the "large media file" workflow really, REALLY works better when presented through NAS protocols (SMB/NFS) - what's the driving force to add the layers of VMDK/VMFS in the middle?

It's mainly rooted in flexibility, being able to vMotion things around quickly and easily without having to concern myself with reconfiguring anything. I have no doubt performance would be better with SMB/NFS as you suggest, but ultimate performance isn't the driving force. I want something that's roughly as fast (or faster) than a DAS datastore on a PERC H830 RAID controller but with the ability to use it for whatever VM's I might need. I'm willing to give up some performance if I gain flexibility, but I also don't want performance so awful it's not worth it.

EDIT: Also, things like snapshots and using it as a backup target are definitely nice to have.

ericsmith881 · Apr 11, 2021

jgreco said:
If you are storing mostly huge chunks of sequential data, it could potentially take much longer to get there than normal VM storage, where smaller blocks are constantly being written.

That was my thinking as well. I'm going to try filling it up with more VMDK's, move some stuff around, play with it for a bit and see if I can spot any impactful degradation. Probably spend a few days on it observing. Then I'll bump it up to 90% and do the same. If I get any interesting results I'll post them here.

At this point I'm largely toying around, figuring out what works before I put it into production. The whole project came about because I had this old R710 gathering dust in my basement. Years ago it was my main VMware server and I hated seeing it sitting around doing nothing.

HoneyBadger · Apr 11, 2021

ericsmith881 said:
It's mainly rooted in flexibility, being able to vMotion things around quickly and easily without having to concern myself with reconfiguring anything. I have no doubt performance would be better with SMB/NFS as you suggest, but ultimate performance isn't the driving force. I want something that's roughly as fast (or faster) than a DAS datastore on a PERC H830 RAID controller but with the ability to use it for whatever VM's I might need. I'm willing to give up some performance if I gain flexibility, but I also don't want performance so awful it's not worth it.

EDIT: Also, things like snapshots and using it as a backup target are definitely nice to have.

Ah. If it's not exclusively for the media storage, then certainly you can rig it up as mirrors and cut chunks of storage at a time. I've got some thoughts on how to set this up, but I need to get to an actual keyboard. Short version would be to use a large pool of mirror vdevs, and cut small (sparse) ZVOLs out of it for VMware use (with sync=always for safety), while leaving the media share on an SMB or NFS presented dataset that can run async for speed.

ericsmith881 · Apr 11, 2021

HoneyBadger said:
Ah. If it's not exclusively for the media storage, then certainly you can rig it up as mirrors and cut chunks of storage at a time. I've got some thoughts on how to set this up, but I need to get to an actual keyboard. Short version would be to use a large pool of mirror vdevs, and cut small (sparse) ZVOLs out of it for VMware use (with sync=always for safety), while leaving the media share on an SMB or NFS presented dataset that can run async for speed.

My only concern there is what if I need to take down my TrueNAS array for extended maintenance? That would make my media library unavailable for the duration. However, if it's in a VMDK on a datastore, I can vMotion it to one of my other arrays (I have three total, two DAS plus this TrueNAS, all of which are about 80TB) and not have any actual downtime. That's the flexibility I'm after. How would your proposal accommodate this use case?

EDIT: Full disclosure, I have two VMware hosts, one R620 and one R630. The R620 has a PowerVault MD1000 with SATA drives. The R630 has a PowerVault MD1200 with SAS drives. Both are set up as DAS using PERC's right now, both about 80TB in size. The TrueNAS R710 is my attempt to get some of the vMotion flexibility of shared storage which I've lacked in my home setup for years. My background is enterprise IT so I've been spoiled for a long time having big, fast iSCSI and FC arrays to play with.

EDIT EDIT: Yes, I know it's insane to have 240TB of storage at home. I love to tinker, what can I say? ;) Plus it's nice to be able to experiment with stuff like this as I don't have a "play lab" like it at work.

ericsmith881 · Apr 11, 2021

I know synthetic benchmarks don't mean that much but figured they're better than nothing. Here's the 12x 8TB SAS array, set up as a DAS on a PERC H830, RAID6. Disk is formatted NTFS with 2M block size:

Here's the 11x 8TB SATA (RAID-Z) array running on the TrueNAS. Disk formatted NTFS with 4k block size:

Here's a disk on same array but formatted NTFS 2M block size:

ericsmith881 · Apr 13, 2021

Well, you can definitely say performance degrades noticeably as you fill up the array:

This is copying data from my 12x 8TB SAS array (DAS on a PERC H830) to my 11x 8TB SATA array (TrueNAS, RAID-Z, 87% provisioned). It started off being limited to what the SAS array could push, but after about 12 hours the performance of the TrueNAS fell off to where it became the bottleneck.

This is the second test I performed. The first was at 80% provisioned space with the same hardware. TrueNAS performance also fell off over time, but it took a little bit longer and wasn't as quite as steep. Performance towards the end is still decent for a SATA array but less than half what it was giving when the array was relatively unpopulated.

This bears out what was said earlier: if you want the best performance out of your array, don't fill it up. It does, however, feel very strange to intentionally leave a ton of space unused...ON PURPOSE.

jgreco · Apr 13, 2021

ericsmith881 said:
This bears out what was said earlier: if you want the best performance out of your array, don't fill it up. It does, however, feel very strange to intentionally leave a ton of space unused...ON PURPOSE.

Thanks for posting that.. I say this stuff quite frequently and people are often skeptical. The situation only gets worse with time and erase/rewrite cycles. Just a reality.

There is an upside in that getting larger hard drives is much less of an incremental expense than buying more drives or more drive bays. This is why I like to encourage people to find the point at which the cost-per-TB of the created pool, including the server itself, is lowest. This usually means that you end up buying drives that are not themselves the cheapest-per-TB ...

HoneyBadger · Apr 15, 2021

ericsmith881 said:
My only concern there is what if I need to take down my TrueNAS array for extended maintenance? That would make my media library unavailable for the duration. However, if it's in a VMDK on a datastore, I can vMotion it to one of my other arrays (I have three total, two DAS plus this TrueNAS, all of which are about 80TB) and not have any actual downtime. That's the flexibility I'm after. How would your proposal accommodate this use case?

EDIT: Full disclosure, I have two VMware hosts, one R620 and one R630. The R620 has a PowerVault MD1000 with SATA drives. The R630 has a PowerVault MD1200 with SAS drives. Both are set up as DAS using PERC's right now, both about 80TB in size. The TrueNAS R710 is my attempt to get some of the vMotion flexibility of shared storage which I've lacked in my home setup for years. My background is enterprise IT so I've been spoiled for a long time having big, fast iSCSI and FC arrays to play with.

EDIT EDIT: Yes, I know it's insane to have 240TB of storage at home. I love to tinker, what can I say? ;) Plus it's nice to be able to experiment with stuff like this as I don't have a "play lab" like it at work.

If you're in a scenario where the media library being unavailable would be a challenge or a hardship, and you've got the luxury of many TBs of available space (nice setup, by the way) - then by all means run it as a VMDK, but there's some potential gotchas to balance out the ease of migration. In case of a planned downtime you're still looking at having to svMotion that VMDK to a new piece of storage; it's going to take a while, and I don't know that you're really gaining anything there from being able to "right-click, Migrate, choose datastore" versus just doing a "zfs send" or even copying the files themselves via rsync/scp to a separate SMB share. Then just repoint the Plex server or mount the new path under the same drive letter, and magically it's leveraging your other chunk of storage.

You've also seen the synthetic benchmark impact of this where performance degrades as you fill up the space already though. As @jgreco mentioned, it may take longer to degrade (in terms of "time") if the modifications happen at the larger file-level erasure and changes vs. the smaller block-level that happens with VMFS. Thus the suggestion of keeping the media files on file-level datasets, and cutting smaller ZVOLs "as needed" for live VMs.

Another note is that storing the files on VMDK+VMFS introduces another potential failure point via async writes. I noticed you don't have an SLOG device, and VMware likes to constantly be writing little heartbeats to datastores. If they aren't perfectly in order after a crash or sudden stop, it might decide "I don't feel like mounting that datastore" or "that VMDK is corrupt" and then all your files are inaccessible. It's much easier to recover things on a file-by-file basis rather than have to fight through force-mounting/resignature VMFS/etc.

ericsmith881 said:
One item of note for VMware users: I noted a substantial performance increase when enabling Storage I/O control for this iSCSI target. I'd be curious if anyone else has noted the same?

Under heavy workload contention yes, SIOC is good to avoid the "noisy neighbor" problem of VMs. Question, have you also set up MPIO and round-robin path usage for your iSCSI setup?

I don't think there's a specifically "bad answer" - it will all work, just as a tinkerer and guy who works with this stuff all day, I'm trying to help you find the "best answer" as well.

Cheers!

ericsmith881 · Apr 24, 2021

HoneyBadger said:
If you're in a scenario where the media library being unavailable would be a challenge or a hardship, and you've got the luxury of many TBs of available space (nice setup, by the way) - then by all means run it as a VMDK, but there's some potential gotchas to balance out the ease of migration. In case of a planned downtime you're still looking at having to svMotion that VMDK to a new piece of storage; it's going to take a while, and I don't know that you're really gaining anything there from being able to "right-click, Migrate, choose datastore" versus just doing a "zfs send" or even copying the files themselves via rsync/scp to a separate SMB share. Then just repoint the Plex server or mount the new path under the same drive letter, and magically it's leveraging your other chunk of storage.

You've also seen the synthetic benchmark impact of this where performance degrades as you fill up the space already though. As @jgreco mentioned, it may take longer to degrade (in terms of "time") if the modifications happen at the larger file-level erasure and changes vs. the smaller block-level that happens with VMFS. Thus the suggestion of keeping the media files on file-level datasets, and cutting smaller ZVOLs "as needed" for live VMs.

Another note is that storing the files on VMDK+VMFS introduces another potential failure point via async writes. I noticed you don't have an SLOG device, and VMware likes to constantly be writing little heartbeats to datastores. If they aren't perfectly in order after a crash or sudden stop, it might decide "I don't feel like mounting that datastore" or "that VMDK is corrupt" and then all your files are inaccessible. It's much easier to recover things on a file-by-file basis rather than have to fight through force-mounting/resignature VMFS/etc.

Under heavy workload contention yes, SIOC is good to avoid the "noisy neighbor" problem of VMs. Question, have you also set up MPIO and round-robin path usage for your iSCSI setup?

I don't think there's a specifically "bad answer" - it will all work, just as a tinkerer and guy who works with this stuff all day, I'm trying to help you find the "best answer" as well.

Cheers!

I've been doing a ton of testing over the past week or so. Points and observations:

Overall, SMB performance is almost twice as fast as iSCSI performance, which is disappointing but in line with earlier comments. The problem is (apparently) exacerbated by thin provisioned VMDK's. I'm guessing this is due to smaller block writes by iSCSI along with the VMDK growth creating fragmentation for ZFS? Is there any tuning that can be done to improve iSCSI performance that would help in this scenario? To put it in perspective, SMB gives me 500MB/s-600MB/s sustained for over 40TB of writes on a 68TB RAIDZ pool, whereas iSCSI on that same pool (with a 55TB zdev) peaks at around 250MB/s-300MB/s and degrades to barely above 100MB/s after writing only 20TB. It's that bad.
In addition to SMB being faster than iSCSI overall, iSCSI performance drops off much faster and much worse than SMB. Again, I assume this is due to fragmentation? This is observed at 80% utilization and (obviously) gets much worse much faster if I try 90% and 95% utilization.
I haven't tested NFS yet. Any ideas on whether it's worth exploring? I soured on NFS datastores years ago but would be willing to revisit if it's worthwhile.
I considered round-robin for the iSCSI multipathing but decided against it. I'm nowhere near saturating my 10Gbe NIC's (I have four total, two on the TrueNAS and two on the VMware host I'm testing). Unless there's some performance benefit to using round-robin in a situation where you're not saturating any single NIC, I don't think it's worth the trouble for this setup.
Same goes for Jumbo Frames. I have CPU to spare on the host and TrueNAS. The switch is more than capable of handling the pps I'm pushing through it. The added headache of JF isn't worth the 1%-3% performance difference if you ask me.
I had hoped to use this as a big, moderately-fast datastore but the data is leaning towards this being a bad idea. I then considered using it as an iSCSI backup target (I use Veeam) and while performance falloff over time is not as dramatic, overall throughput is still notably inferior to SMB. Using it as an SMB share and doing batch robocopies of my data to it gives excellent performance but basically rules out using it as a datastore unless I section it up...and if I do, it ends up being too small to be both a NAS and a datastore to do me any good.

So, it looks like SMB is the solution for now unless anyone has any better ideas?

Breakdown of the above, the first test was a thick-provisioned VMDK on an iSCSI datastore with default block sizes all around. Second test was SMB with alltime left on by mistake and default record size. Third test was SMB with default record size and alltime disabled. Fourth test was thin-provisioned VMDK on iSCSI datastore, default block sizes. Last test I just started this morning, SMB with 1MB record size and alltime disabled and I'm getting 600MB/s on average thus far. Notice how the SMB tests exhibit basically ZERO degradation over time?

jgreco · Apr 24, 2021

Well, iSCSI isn't expected to work at all well on RAIDZ, at least in my opinion, so this starts off from a bad place and simply confirms that the starting point was a bad place, which we knew. This is discussed at a basic level in the linked post above about RAIDZ vs mirrors linked by @Jailer ...

Important Announcement for the TrueNAS Community.

80% max utilization - myth or reality?

Dabbler

Wizard

Not strong, but bad

Resident Grinch

Dabbler

Dabbler

Dabbler

actually does care

Resident Grinch

Dabbler

Dabbler

Dabbler

actually does care

Dabbler

Dabbler

Dabbler

Resident Grinch

actually does care

Dabbler

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "80% max utilization - myth or reality?"

Similar threads