How to avoid disk overload?

2twisty · May 17, 2023

Had an uncommon event tonight. I had a metric crapton of torrents all download and complete in the space of about an hour. Now my write speed to the hard disks has tanked below 700kb/s while the array tries to deal with the insane amount of data I threw at it all at once.

This is an abnormal situation, since I rarely have so many torrents complete all at once.

Here's the "setup."

Pool is 9 6tb HDD in RaidZ2. (I know, Mirror is better for VMs, but normally my VMs are pretty quiet IO-wise, so I opted for more space with Z2)

VM on separate host is running torrent client. Storage for the VM is via NFS mount on the server. The VM downloads to "local" (but is really remote) disk. When torrent completes, it moves the completed file from the "local" storage to an SMB share on the same server that hosts the VM storage.

VM host is connected to NAS by 10G network. All writes to NFS and SMB are sync=always since I don't want to risk data loss from async.

So, data comes in from the internet and is sent to the NAS in the VHD file. When complete, it reads from the VHD and also writes via SMB to the same pool, then deletes from the VHD.

Eventually, everything came to nearly a screeching halt.

Aside from the classic Billy Crystal Joke ("Patient: Doctor! It hurts when I do this with my arm. Doctor: Then don't do that with your arm!") answer, is there anything I can do to cause the network to throttle sooner so that the disks have a chance to catch up before my write speed completely plummets?

Granted, this is a rare occurance, but I'd like to understand how this storm happened and what can be done to mitigate it?

2twisty · May 17, 2023

My temporary mitigation strategy was to pause the VM running the torrent client so the disks could catch up a bit. Since I can't throttle below a 1GB connection on my switch, this was the only way I could see to slow the data ingress down. Obviously, this will work to restore function for now, but it would be good to have a better way to throttle activity before the disks go bonkers.

NickF · May 17, 2023

2twisty said:
Eventually, everything came to nearly a screeching halt.

What exactly happened?

Are your VMs hosted in an NFS mount on the TrueNAS on the same pool?

jgreco · May 17, 2023

2twisty said:
Pool is 9 6tb HDD in RaidZ2. (I know, Mirror is better for VMs, but normally my VMs are pretty quiet IO-wise, so I opted for more space with Z2)

And you went way above the recommended 30-40% utilization for the pool, caused a bunch of fragmentation, and now it hurts bad? Stuff like torrents are effectively database/block storage due to the patterns of I/O so you should not use RAIDZ for this.

There is not an "you get to opt for more space with Z2" choice; compsci is a cruel thing and once you screw up your pool, it becomes very hard to un-screw it up.

1) Evacuate all the data from the pool to someplace safe.

2) Pull apart your pool and "opt" for mirrors.

3) Restore all the data. Probably put the torrent data in a size-limited dataset.

4) Religiously maintain utilization less than maybe 40% on your new mirror pool.

Once you spam up a RAIDZ2 with a bunch of VM block storage, you cause a hell of a lot of fragmentation that is going to impact the pool until such time as you remove all the data and then bring it back. Your other option is to replace the drives with much larger drives and then DON'T USE THE NEW SPACE.

2twisty · May 18, 2023

Well, thankfully, this condition is rare....and this is a homelab, so I am able to tolerate things that would be absolutely unacceptable in a production environment.

I plan to eventually separate my data into 2 pools (mirror and RaidZ2) but sadly, funding is an issue these days and I have to try to make do.

With the drives in mirror, I don't have enough space to house all my data, let alone keep to <40% utilization.

ZFS is wonderful, but she sure is a detailed taskmaster, lol.

I was hoping to see if there were any mitigation strategies (that don't involve buying a lot more storage right now) to help reduce the chance of this recurring.

My total block storage for VMs (including backups via Xen Orchestra) is < 1TB. I don't see my block storage needs growing heavily in the forseeable future, so I could probably get by with an array size of about 4TB.

What is the wisdom on mirrors? What's the best geometry? Are more spindles more better? Pure mirrors or striped mirrors? If I move that data to flash vs rust, does that change the geometry recommendation at all?

I do have a complete backup of all my data, replicated to another ZFS box every hour, so my data is safe, and when it comes time for a rebuild, it won't be quite so bad since I have the copy already.

NickF · May 18, 2023

Without jumping into the next steps and future ideas, you never answered my questions above. I'd be happy to give you some advice, but I genuinely don't know what the issue you are having is. While, certainly, we can chalk up a general lack of performance to RAIDZ2, there may be a path forward for you to work around the issue.

2twisty · May 18, 2023

My apologies. I actually mentioned that they were on the same pool in my original post, but's kinda buried in the weeds toward the end.

The issue (which is very rare in my setup, but annoying AF when it happens) is when I had a ton of torrents all complete at once and qBittorrent was trying to move them from the "incomplete" directory (NFS-mounted block storage on RaidZ2 pool) to the "completed" directory (SMB-mounted share attached to qBittorrent host, also on the same RaidZ2 pool).

This results in a double data traversal over the network (from pool to VM over NFS block storage to SMB share), itsa 10G network, so that's not really a bottleneck here.

This also results in simultaneous reading of the block storage and writing to the SMB share, which taxes the hell out of the drives, fetching the data and then immediately writing it back to the same disks.

Normally, this isn't an issue because it might be trying to do that with 1-2 files at a time, and the disk performance is still acceptable.

What happened was I had about 20 multi-GB files all trying to do that move operation simultaneously, and there's no way to tell qBittorrent to take it easy. This resulted in 20 or so simutaneous random IO operations and my disks slowed down to less than 1Mb/s each. And even though that was spread across 9 drives (2 are "parity" so really 7, I guesss), getting 7-ish Mb/s total write performance meant that it was going to take HOURS to complete the write operation. Meanwhile, my VMs are like "HEEELLOOOOOO DISK?"

I was able to bring the situation under control by killing (and then restoring about a minute later) the network connection to the qBittorrent host, which stopped the move operation and made qB decide it wanted to check the integrity of the downloads. This allowed me to manually do the check and then move operations sequentially. It took me a while to do all that, but it was certainly better than waiting for the pool to recover while my VMs timed out or were uselessly slow.

Like I said initially, this is a rare event that so many large torrents would complete at once, but I'm not always going to be around to mitigate it, nor might it be possible to do manually like what worked in this case.

I do plan to split my block storage to a different pool eventually, but that's not in this HomeLab's budget right now. So If there's a way to tune some behavior to reduce the chance / impact of this kind of thing in the future, I'm all ears.

I am still waiting for a response from @jgreco about the "best" geometry for a block storage pool, and whether there is a difference in recommendation based on whether that pool with be rust or flash.

Thanks, guys, for all your help. Learning the subtleties of ZFS is actually fun for me, and while I can appreciate the "just do it the official way" advice, it doesn't help me learn / understand the underlying workings of ZFS. Which is, of course, one of the primary raisons d'etre of my homelab.

Whattteva · May 18, 2023

2twisty said:
I am still waiting for a response from @jgreco about the "best" geometry for a block storage pool, and whether there is a difference in recommendation based on whether that pool with be rust or flash.

Regardless of rust or flash, if you're going for IOPS, the best is always mirrors.

Also, for SSD's, don't go for cheap QLC SSD's. They can be even worse than spinning rust once they run out of cache. Aim for MLC or TLC at least.

jgreco · May 19, 2023

2twisty said:
I am still waiting for a response from @jgreco about the "best" geometry for a block storage pool, and whether there is a difference in recommendation based on whether that pool with be rust or flash.

You don't really get much in the way of choices for geometry. You have two basic top level choices in that you can avoid redundancy (one-wide "mirrors" == a.k.a. just a disk) or redundancy via N-wide mirrors. You can then also have one or multiple vdevs of those, in which case ZFS interleaves access. We normally advise against the nonredundancy options. You can pick something that fits your expected IOPS profile. Mirrors are much better at IOPS than RAIDZ, and of course SSD is much better at IOPS than HDD. But you can structure your workload in a variety of ways. If you have a bunch of ARC and L2ARC, for example, and a read-optimized workload, you can end up serving ALL reads from ARC+L2ARC and only doing writes to your pool.

HoneyBadger · May 19, 2023

2twisty said:
9 6tb HDD in RaidZ2.

...

All writes to NFS and SMB are sync=always since I don't want to risk data loss from async.

This is a pretty strong "anti-pattern" - you've got HDDs in RAIDZ, and I haven't seen mention of an SLOG.

While I understand the concerns about async writes, since your workflow of writing to the SMB share is a "copy" from your VM, I'd recommend returning to sync=standard on that dataset. I'm not sure what the conversion is from metric to regular craptons, but even a handful of competing sequential workloads turns into "effectively fully random" pretty fast, and HDDs are bound by those pesky laws of physics to be particularly poor at handling them.

Davvo · May 19, 2023

As others are saying, raidz + hdd + sync=always is bad for performance in a standard condition, even more now that your free space seems to be scarce and fragmented.

On my (much smaller) sistem I run the torrent jail (CORE system) on a single SSD (I know but hey, I have 4 drives in total including boot drive) and once the download is complete the torrent automatically moves the files to the HDD storage pool: I do this in order to avoid fragmentation; another way to possibly prevent it would be to always download torrents sequentially (however if you download more than 1 torrent at the same time you will still get fragmentation). Such a thing is easily done in jails thanks to mount points, and I'm not aware of a similar approach with VMs; maybe a network share would work.

What you can do now to fix the (probabile since I haven't seen any numbers yet) fragmentation is free some space deleting files in order to get more space in the pool than the larger single file in your pool (with good overhead in order to not kill performance) and then run this rebalancing script: it will basically defragment your pool by copy and rewrite of each file (given your situation it might take a long time).

Obliviously stop the VM and any other activity on the pool while executing this operation.
EDIT: make sure to read the github page before running it since there are a few things to be aware of (ie snapshot behaviour). Regarding disk geometry, the following resource might interest you.

ZFS Storage Pool Layout

This resource was originally created by user: @Davvo on the TrueNAS Community Forums Archive. https://www.truenas.com/community/resources/zfs-storage-pool-layout.201/download [1] This amazing document, created by iXsystems in February 2022 as a “White Paper”, cleanly explains how to qualify...

www.truenas.com

EDIT2: Mmh, it appears I have misinterpreted the situation a bit, it's more about IOPS than used space.

NickF · May 19, 2023

2twisty said:
The issue (which is very rare in my setup, but annoying AF when it happens) is when I had a ton of torrents all complete at once and qBittorrent was trying to move them from the "incomplete" directory (NFS-mounted block storage on RaidZ2 pool) to the "completed" directory (SMB-mounted share attached to qBittorrent host, also on the same RaidZ2 pool).

This results in a double data traversal over the network (from pool to VM over NFS block storage to SMB share), itsa 10G network, so that's not really a bottleneck here.

This also results in simultaneous reading of the block storage and writing to the SMB share, which taxes the hell out of the drives, fetching the data and then immediately writing it back to the same disks.

Normally, this isn't an issue because it might be trying to do that with 1-2 files at a time, and the disk performance is still acceptable.

I am glad that you understand the data flow problem here. While I get that you are using 10G, there's still going to be a latency penalty when you queue up alot of I/O. Just because you aren't bandwidth limited doesn't mean that isn't a problem. Couple that with the I/O problem with RAIDZ-2 I think that sums up most of the issue described.

2twisty said:
I do plan to split my block storage to a different pool eventually, but that's not in this HomeLab's budget right now. So If there's a way to tune some behavior to reduce the chance / impact of this kind of thing in the future, I'm all ears.

The easy answer, host your VM on local storage on an SSD and then only allow the torrent traffic to traverse the network connection. Then if nothing else your OS will be stable and downloads will finish at whatever rate the pool can keep up with. The torrent box itself can't be more than 100 GiB. I'm sure I have a 120gb SATA SSD kicking around I can ship to you if you want it. If you insist on running the VM over the network on the TN, make a pool of 1 SSD for now.

Also worth noting. It's probably better to use ISCSI vs NFSv3, if for no other reason than NFSv3 uses UDP by default. Letting the TCP stack do the heavy lifting of making sure your VM's data gets where it need to go is a step in the right direction. Beware of CTL_DATAMOVE errors in /var/log/messages though if you are going to keep the VM on the RaidZ2. The system will barf all overitself and spam the logs complaining about long wait times for IO to be written...and excessive re-transmissions will make the problem worse.

jgreco · May 20, 2023

NickF said:
Also worth noting. It's probably better to use ISCSI vs NFS, if for no other reason than NFS uses UDP by default.

This is doubly incorrect.

NFSv4 requires TCP (see RFC7530 IIRC) and the default protocol has been TCP since at least NFSv3 (see mount_nfs(8)):

Code:

             tcp     Use TCP transport.  This is the default option, as it
                     provides for increased reliability on both LAN and WAN
                     configurations compared to UDP.  Some old NFS servers do
                     not support this method; UDP mounts may be required for
                     interoperability.

You would also never want to use iSCSI instead of NFS because iSCSI sucks in many contexts where TCP NFS works fine, such as high latency/high loss networks, where you would get constant iSCSI timeouts and reconnects. NFS will happily do either UDP or TCP, but TCP is the default.

NickF · May 20, 2023

I apologize - NFSv3 :P Sorry for overly generalizing...

I'll edit my post.

FWIW NFSv4 is not the default option in TN

So, if nothing else OP has two options here, NFSv4 or iSCSI :)

Also FWIW, the additional error handling of the SCSI protocol is heavier than NFS for sure, either v3 or v4. But for goot reason from a data integrity standpoint. As an example: https://en.wikipedia.org/wiki/SCSI_check_condition

jgreco · May 20, 2023

NickF said:
FWIW NFSv4 is not the default option in TN

Not clear on what your point is. NFSv3 defaults to TCP as well. NFSv4 merely forces it. NFSv1 (RFC 1094) is the only NFS that is UDP-only for transport.

NickF · May 20, 2023

jgreco said:
Not clear on what your point is. NFSv3 defaults to TCP as well. NFSv4 merely forces it. NFSv1 (RFC 1094) is the only NFS that is UDP-only for transport.

Code:

nickf@anna:/etc/netplan$ sudo mount -t nfs 10.69.40.8:/mnt/optane_vm/test /var/test
Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /lib/systemd/system/rpc-statd.service.

Well I'll be damned. You are correct. You can ignore port 514, the server in question is actually my syslog server.

Code:

root@prod[~]# netstat -aln | grep 10.69.60.13
tcp        0      0 10.69.40.8:2049         10.69.60.13:952         ESTABLISHED
udp        0      0 10.69.40.8:514          10.69.60.13:5116        ESTABLISHED

But point remains, if OP want's to not have to worry about his VMs crashing he'd likely have a better go with iSCSI, given the suboptimal design of his pool. If he doesnt host the VM on local storage that is. Which I still think is the best solution to the problem given the constraints he has outlined.

jgreco · May 20, 2023

I can't imagine a case where the penalties of iSCSI timeouts would be preferable.

2twisty · May 25, 2023

FWIW, I am using NFS4. I wanted to use NFS so that the block data was stored in VHD files so they can be easily manipulated if need be.

One of the main points of my TN setup is ZFS and replication to my second TN box, so storing on a local SSD on the VM host is not an option.

Also, the amount of data that was downloaded that caused this mess was about 250GB all at once, lol. So a 120 wouldn't cut it. I had to make the virtual disk 300GB because I would often run out of space on the VHD because I would have several things partially completed that were consuming all that space.

Adding a single SSD to the TN box sounds like an acceptable idea, except that there is no fault tolerance, and I would be concerned that a single SSD (sata) would not be able to keep up with the IOPs. Is a single SATA SSD going to have superior IOPS to the 9-disk array?

Maybe since I would not be hammering it so hard with simultaneous read and write.

I do have a 250GB SATA SSD that I could toss in the TN box -- I think I have an open (and connected) 2.5 sled in my system.

After I complete this IT project that is coming up (whenever they get off their asses and order the FSCKing equipment), I am likely going to spring for a used SuperMicro server with 12 bays. Since I am using "consumer grade" multi-disk enclosures and SFF-8087 breakout cables, I suspect some of the other (not directly related to this issue) problems I've been having are because of that. This will give me a few more drive bays that I can use for some SSD storage.

HoneyBadger · May 25, 2023

2twisty said:
Is a single SATA SSD going to have superior IOPS to the 9-disk array?

In a pure disk scenario, with a decent SSD, absolutely yes. IOPS numbers for a spinning disk are optimistically in the low hundreds (100-200) whereas even a "slow" SSD should be delivering a few thousand of them.

Dice · May 25, 2023

2twisty said:
I had a ton of torrents all complete at once and qBittorrent was trying to move them from the "incomplete" directory (NFS-mounted block storage on RaidZ2 pool) to the "completed" directory (SMB-mounted share attached to qBittorrent host, also on the same RaidZ2 pool).

This is the main culprit IMO.
ZFS and bittorrent allocation of space mixes particularly poorly. It would be compounded by many streams, and Z2, a full pool etc.

I'd suggest you'd add an SSD for "incomplete downloads". Set the number of concurrent downloads to something that would make it all fit on the SSD. Then, finished downloads could be long term stored.

The problem is that once you overfill or cause bad fragmentation of your pool, there is no simple fix as to just delete data and it will be back as it were. You'll have to make space, and then rewrite data again. Thus there are incentives to feed files into long term storage thoughtfully.

Important Announcement for the TrueNAS Community.

How to avoid disk overload?

Contributor

Contributor

Guru

Resident Grinch

Contributor

Guru

Contributor

Wizard

Resident Grinch

actually does care

MVP

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Resident Grinch

Contributor

actually does care

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How to avoid disk overload?"

Similar threads