very slow performance while running comparison test

ricka777

Cadet
Joined
Oct 6, 2020
Messages
8
Hey,

I have a TrueNAS RC0 machine with 32GB ECC ram. It is generally working pretty great. I noticed something odd and I am wondering if it is a tuning issue.
The machine had two pools, an old one and new one I migrated data to. After migration, mostly for kicks and a little bit of testing, I used WinMerge to compare the contents on each pool. The test consists of checking the file names, dates, and randomly examining some data (quick comparison) for each file.

The test seemed to be going as fast as anyone could expect, running at a speed near 1Gb. What I experienced was if I went to any other file while the test was running, the response time was horrible. Clicking a video (even a short one) might take almost a minute to open when it used to be instantaneous.

I believe the issue is related to caching, as it took a while to go back to normal if I paused the test. Also, if I clicked around to different points in a video for a minute or two, the video I was working on would eventually become responsive even with the test running.

I know more memory is better, but I really expected 32GB to be plenty. I also have a M.2 drive I am not certain is being utilized at all for some kind of cache. Is there something I can adjust to help?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Can you post your system specifications and pool layout? (More information is better than less.)

Generally speaking though, the workload you're running is effectively random reads (the "compare the content" part specifically) and if you're asking a bunch of mechanical HDDs to do this, it can rapidly choke out the available IO capabilities of spinning disks. Your system and pool configuration will impact how bad the choke is as well (mirrors will do better than RAIDZ, an SSD used for metadata cache may help the name/date check, but not data)
 
Last edited:

ricka777

Cadet
Joined
Oct 6, 2020
Messages
8
I'll try to provide what you want to know, if I fail, please assume I just don't understand instead of my trying to avoid the question.

I have 12x 8TB Ironwolf drives in a pool, zfs raid with 3 drive fault tolerance. 8 are connected to the LSI device, the non-raid version that everyone recommends. The other 4 are connected to the motherboard sata connectors. It is a ASUS workstation board x99-ws/IPMI with a xeon processor.

I had a second pool 4x 4TB ironwolf which I was migrating from, attached to a second LSI card (same model). This pool is being removed along with the card.

I guess I am surprised that the action of one user (or software) could so severely incapacitate the responsiveness. So I am wondering what I can do to help avoid any kind of similar situation in the future? I did nothing with the SSD m.2. drive but point to it for installation, so if anything need be done - I haven't taken those steps yet.

Thanks.

PS. LSI SAS 9207-8i
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I have 12x 8TB Ironwolf drives in a pool, zfs raid with 3 drive fault tolerance. 8 are connected to the LSI device, the non-raid version that everyone recommends. The other 4 are connected to the motherboard sata connectors. It is a ASUS workstation board x99-ws/IPMI with a xeon processor.

I had a second pool 4x 4TB ironwolf which I was migrating from, attached to a second LSI card (same model). This pool is being removed along with the card.

From the description, your new pool is a RAIDZ3, and RAIDZ by design does not handle random I/O well when compared to mirror vdevs. Ironwolf drives are also guaranteed to be non-SMR so you won't face that, but this is a random-read workload you're putting on it.

Since the M.2 drive was used for installation, it can't be used for L2ARC or other "caching" purposes.

The workload you're running does sound like it will be extremely IOPS throttled. Run the command gstat -p and check to see if your disks are showing particularly high busy% values, and report back with a sample of the reads/sec value.
 

ricka777

Cadet
Joined
Oct 6, 2020
Messages
8
This is interesting, thank you. It sounds like it might benefit me to rebuild my system sometime and perhaps use a different SSD for the installation and the fast one for L2ARC cache.

I was verifying my data was moved between two pools and I have since removed the second pool so the only way I could attempt to sort of re-create this would be to duplicate the data on same pool (the only one I have now) and then run the test again. I will make a note of your suggestion and should it come up again, I can use that command to see if I am seeing anything interesting.

I did, for the heck of it, point it at the same data for source and destination and I am seeing, busy %50-%60, occasionally higher. Reads per second very from 285-400. One drive does seem to be showing consistently lower busy numbers, like 6-8% (instead of 50-60%). Not sure if that is helpful.

(For my own reference later, the low %busy it is da0.)

Okay, I lied, 11 of 12 of the drives are iron wolf. I forgot one of them (da0) is a ATA HGST HDN728080AL. This is the one with the low %busy.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
A low busy% would indicate the drive is faster (or spending less time with outstanding commands, at least) than the others.

You're likely at the limit of what your drives can physically deliver - if each of those 285-400 reads required a head movement that would mean you're getting 2.5-3.5ms average seek time, which for a 7200rpm spinning disk is phenomenal.

The problem is "how to stop one workload from stomping all over another" - this "noisy neighbor" problem isn't something new to or specific to ZFS/FreeNAS, it's just an ugly reality of shared resources.
 
Last edited:

ricka777

Cadet
Joined
Oct 6, 2020
Messages
8
I guess Winmerge running against a large amount of data using "quick contents" compare method is just a bad idea if you want to use your NAS with other applications at the same time. Fortunately, I am hoping I don't need to check that large a migration any time soon.

Thank you for your help.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I guess Winmerge running against a large amount of data using "quick contents" compare method is just a bad idea if you want to use your NAS with other applications at the same time. Fortunately, I am hoping I don't need to check that large a migration any time soon.

Thank you for your help.

I suspect the "Quick Contents" compare, because it's actually performing a bunch of random reads of your data. You're inadvertently putting your pool through a grueling "random I/O" performance test. ;)

You might actually come out ahead if you used something like Get-FileHash through PowerShell and did first one pool and then the other, or a directory at a time.
 
Top