Specific workload resulting in very bad ZFS performance - system crawls

airflow

Contributor
Joined
May 29, 2014
Messages
111
Hello! I am a happy user of TrueNAS for many years, I'm typically very happy with performance and features.

Currently I'm struggling with heavy performance issues which I cannot pinpoint exactly where they are coming from. Perhaps somebody of you can share insight about how to troubleshoot this, finding out what the limiting factor is here. Perhaps even fix it by changing some setting.

Server specs:
CPU: Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
16GB ECC RAM
1 pool with 6 HDDs, multiple copies (focus on availability)
1 pool with 1 NVME-drive (focus on performance)
current TrueNAS-13.0-U2

The workload is running in a jail, jail itself and it's work-data is kept on the fast NVME-drive. It's basically a database which needs to run a "pruning" job to reduce it's size. This workload is presumably very IO intensive.

The workload is taking a lot of time - more than expected and much more which is typically seen in similar setups. While this specific job is running, the system is generally performing very badly. Other jails are not responding properly, just starting a simple shell may take many seconds, GUI lags, performance monitoring on the GUI is basically broken (does not record data any more, or just in bits and pieces). See screenshot:
1663228438331.png


Normally I have a clear understanding what the limiting factor of a specific workload is. Not in this case. CPU is idling, memory utilization is low/normal, swap is unused. Yes, the NVME-drive might be hammered with IOPS (I cannot check in the GUI because of the lack of data, see above), but I assume it's IO-heavy but I don't expect my complete system to come to a crawl because of that.

What I already tried:
* Turn SYNC=OFF for the dataset containing the IOPS-heavy workload. Didn't change anything. I did it while the job was running - do I need to re-mount the dataset or reboot? Should that make a difference?
* The system-dataset is on the performance-drive (so on the same drive where the workload is). Should I move it to another pool? Could that explain that GUI/management is sluggish?

Thanks for any advice
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
What model is the motherboard?
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
On the commandline you can use
Code:
gstat -f '^(da|nvd)[0-9]+$'
to see how busy your disks are
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Turn SYNC=OFF for the dataset containing the IOPS-heavy workload. Didn't change anything. I did it while the job was running - do I need to re-mount the dataset or reboot? Should that make a difference?
The correct setting is sync=disabled... if that's what you set, it applies to all new transactions from that time forward.

The system-dataset is on the performance-drive (so on the same drive where the workload is). Should I move it to another pool? Could that explain that GUI/management is sluggish?
It's easy enough to move... depending on your boot media, maybe the boot pool is an OK alternative since you're already on a single point of failure with the NVME.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
The correct setting is sync=disabled... if that's what you set, it applies to all new transactions from that time forward.


It's easy enough to move... depending on your boot media, maybe the boot pool is an OK alternative since you're already on a single point of failure with the NVME.

sync=disabled - yes, that‘s what it‘s set to. Sorry for mixup of terms before. I did set it via GUI. When you say it should become active immediately for next IOPS, then I conclude it had no effect at all.

Regarding moving the system-dataset - you write that this is possible. Sure, but do you think it‘s a promising approach of fixing the issue? Well, I will just try it and report if it made any difference.

Apart from my two desperate ideas - what do you think about this problem? Any other ideas for troubleshooting?
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
On the commandline you can use
Code:
gstat -f '^(da|nvd)[0-9]+$'
to see how busy your disks are
I just tried this. To my understanding, this shows that the disk is actually nearly idling:

Code:
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| nvd0
    0      0      0      0    0.0      0      0    0.0    0.0| da0
    0      0      0      0    0.0      0      0    0.0    0.0| da1


Something else is overwhelming my system…
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Regarding moving the system-dataset - you write that this is possible. Sure, but do you think it‘s a promising approach of fixing the issue? Well, I will just try it and report if it made any difference.
easily done = easy to test and put it back if no improvement.

I think it's your best/easiest option to take the next step in resolving the issue.

If that's not it, I think we're back to looking at where the system is bottlenecking.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
easily done = easy to test and put it back if no improvement.

I think it's your best/easiest option to take the next step in resolving the issue.

If that's not it, I think we're back to looking at where the system is bottlenecking.

OK. What I don't like about this test is that it will discard all my system monitoring history according to my experience. But nevertheless I will try it.

In the meanwhile I had another idea: Could it be about the fact that the process which is causing this load is running in a jail? So some jail-specific issue? For example, some rate-limits applied on processes within jails? Perhaps I can rule that out by moving the process out of the jail into the main OS of TrueNAS (not recommended, not supported, I know - but just out of desperation). What do you think about that?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
So some jail-specific issue? For example, some rate-limits applied on processes within jails?
I don't see how that's causing the host to perform slowly if the jail process is being limited. There are no "limits" by default.

Perhaps I can rule that out by moving the process out of the jail into the main OS of TrueNAS (not recommended, not supported, I know - but just out of desperation). What do you think about that?
A really bad idea that will prove nothing and possibly break your install depending on how you would do that.
 
Last edited:
Joined
Jan 27, 2020
Messages
577
The system-dataset is on the performance-drive (so on the same drive where the workload is). Should I move it to another pool? Could that explain that GUI/management is sluggish?
Seems pretty obvious to me. Hogging the system-drive with i/o heavy operations leads to a badly responding system!?
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
If it affects the whole system, I would take a look on CPU and memory usage, starting from `top -SHIz` on a host.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
OK, I did now try the following things to find the cause of the performance problems I'm facing when executing the DB pruning job.

1) Changing the location of the system-dataset from the NVME-device to a) the boot-pool (dual UBS-sticks) and b) the HDD-pool consisting of 6 standard HDDs.

In both variants, the situation was unchanged - the job takes forever, gets slower and slower, and brings the system to a crawl which makes GUI sluggish and prevents the reporting/monitoring section from TrueNAS from actually recording the data. CPU and memory utilization is low, IOPS on the NVME-device holding the are low enough to not explain the behaviour (see examples of gstat-output earlier).

2) During investigation I learned about a phenomenon of "read amplification", which can happen if the record size of my DB-dataset is too large (default = 128K). To find out if this could explain my situation, I deleted the dataset with the DB and recreated it with a record size of 16K. I copied the DB to this new dataset and started the pruning job.

Sadly, this test also didn't show any changed behaviour or improvement of the situation.

Do you have any better ideas then my own? Some method of finding out what the system is actually doing while struggling with this one job? Any helpful advice appreciated.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
If it affects the whole system, I would take a look on CPU and memory usage, starting from `top -SHIz` on a host.
Hello mav@, thanks for your reply.

I can certainly do that. I did check the output from "top" and "htop" already multiple times during troubleshooting. But for me the output was OK as I saw that CPU utilization was low on all cores. What should I look for specifically when I check the output of "top -SHIz"? Should I post the output of the command while I'm facing the described problems here?

Would you say that it could have some relevance that the job is running within a jail as opposed to the host system?
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
What vendor and model of device are you using? Is there sufficient airflow for cooling?

This one:

Code:
Geom name: nvd0
Providers:
1. Name: nvd0
   Mediasize: 2000275111936 (1.8T)
   Sectorsize: 512
   Stripesize: 131072
   Stripeoffset: 0
   Mode: r1w1e3
   descr: WD_BLACK AN1500
   lunid: 0050433501000001
   ident: WUBT21330775
   rotationrate: 0
   fwsectors: 0
   fwheads: 0


Sufficiant airflow for cooling: Good point, the device gets rather hot. But when I check the smartctl output it's below Warning Level. I couldn't find any way of finding out if there would be some thermal throttling activated via SMART. Is there one? Even if there was, I wouldn't expect thermal throttling of the device to bring down the whole system.
 

mrpasc

Dabbler
Joined
Oct 10, 2020
Messages
42
First: This WD Black AN1500 *might* cause trouble, because it is a little strange thing. Basically it is a „RAID 0“ of two WD SN730 with a Marvell 88NR2241 RAID controller in an enclosure. It’s intended for Windows gaming machines with just PCIE gen3 to reach (nearly) PCIE gen4 speeds. Even if the tests with generic workloads looked good it could be that this Marvell controller behaves „strange“ with real-world workloads.

Second: Beside all hardware or configuration issues of your storage: are you sure the „prune job“ itself isn’t the problem? With badly designed databases and/or scripts/programs you can bring down even *much* more powerful systems (full table Scans as of bad/corrupted/not existing indexes and much more…)
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
Just to be sure, since I haven't seen it in the thread: You haven't enabled deduplication?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
WD_BLACK AN1500

This could be an issue as @mrpasc points out - user reviews seem to have the heat mentioned as a frequent complaint. The throttle point is reportedly 70C - has your drive reached or exceeded that temperature during a heavy workload?

I would suggest opening a couple SSH sessions remotely (not via the webGUI's "shell" option) and running both the top -SHIz as suggested by @mav@ as well as gstat -dp to see how hard your NVMe device is being hit.

Just to be sure, since I haven't seen it in the thread: You haven't enabled deduplication?

Good catch, @airflow can you confirm the dedup status of your volume? I don't suspect it's on as it isn't by default.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
Guys (and girls?), I'm happy to report that I finally could finish the workload successfully on the system. Super-fast (in under 4 hours, which is very good - in my previous tests, it let it run for several days in some instances), with no performance impact to the rest of the system, recording of system data working, etc - just like it should be! :smile:

And what made the difference? Actually my desperate idea to let the task run on the host system instead of within the jail (which was judged as "will prove nothing" by @sretalla). In case anyone is doubting my findings: I invite you to a zoom-call and I'll show you.

Regarding the other recent input to this thread: Yes, the WD Black AN1500 has a somewhat strange system architecture. It also has a bad thermal design (I read that in some reviews of the hardware - it's running rather hot). My choice for hardware was very limited in my case, because my system has no free SATA port left and also no M2-port. De-Duplication: I'm aware that this can have an impact, also it's absolutely no fit for the workload, so I left it disabled. And the temperature-topic: This was to my mind the most "promising" idea left to look into. In some previous test-runs the temperature of the device was indeed slightly over 70°. In the successful run, it was slightly under 70°. BUT: In some of the recent unsuccessful runs the temperature was also under 70°. So due to my test-results, I would rule out the thermal throttling theory.

Anyhow, thanks to everyone contributing and trying to help!
 
Top