Specific workload resulting in very bad ZFS performance - system crawls

airflow · Sep 15, 2022

Hello! I am a happy user of TrueNAS for many years, I'm typically very happy with performance and features.

Currently I'm struggling with heavy performance issues which I cannot pinpoint exactly where they are coming from. Perhaps somebody of you can share insight about how to troubleshoot this, finding out what the limiting factor is here. Perhaps even fix it by changing some setting.

Server specs:
CPU: Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
16GB ECC RAM
1 pool with 6 HDDs, multiple copies (focus on availability)
1 pool with 1 NVME-drive (focus on performance)
current TrueNAS-13.0-U2

The workload is running in a jail, jail itself and it's work-data is kept on the fast NVME-drive. It's basically a database which needs to run a "pruning" job to reduce it's size. This workload is presumably very IO intensive.

The workload is taking a lot of time - more than expected and much more which is typically seen in similar setups. While this specific job is running, the system is generally performing very badly. Other jails are not responding properly, just starting a simple shell may take many seconds, GUI lags, performance monitoring on the GUI is basically broken (does not record data any more, or just in bits and pieces). See screenshot:

Normally I have a clear understanding what the limiting factor of a specific workload is. Not in this case. CPU is idling, memory utilization is low/normal, swap is unused. Yes, the NVME-drive might be hammered with IOPS (I cannot check in the GUI because of the lack of data, see above), but I assume it's IO-heavy but I don't expect my complete system to come to a crawl because of that.

What I already tried:
* Turn SYNC=OFF for the dataset containing the IOPS-heavy workload. Didn't change anything. I did it while the job was running - do I need to re-mount the dataset or reboot? Should that make a difference?
* The system-dataset is on the performance-drive (so on the same drive where the workload is). Should I move it to another pool? Could that explain that GUI/management is sluggish?

Thanks for any advice

ChrisRJ · Sep 15, 2022

What model is the motherboard?

c77dk · Sep 15, 2022

On the commandline you can use

Code:

gstat -f '^(da|nvd)[0-9]+$'

to see how busy your disks are

sretalla · Sep 15, 2022

airflow said:
Turn SYNC=OFF for the dataset containing the IOPS-heavy workload. Didn't change anything. I did it while the job was running - do I need to re-mount the dataset or reboot? Should that make a difference?

The correct setting is sync=disabled... if that's what you set, it applies to all new transactions from that time forward.

airflow said:
The system-dataset is on the performance-drive (so on the same drive where the workload is). Should I move it to another pool? Could that explain that GUI/management is sluggish?

It's easy enough to move... depending on your boot media, maybe the boot pool is an OK alternative since you're already on a single point of failure with the NVME.

airflow · Sep 15, 2022

ChrisRJ said:
What model is the motherboard?

ASRock Rack E3C226D2I

airflow · Sep 15, 2022

sretalla said:
The correct setting is sync=disabled... if that's what you set, it applies to all new transactions from that time forward.

It's easy enough to move... depending on your boot media, maybe the boot pool is an OK alternative since you're already on a single point of failure with the NVME.

sync=disabled - yes, that‘s what it‘s set to. Sorry for mixup of terms before. I did set it via GUI. When you say it should become active immediately for next IOPS, then I conclude it had no effect at all.

Regarding moving the system-dataset - you write that this is possible. Sure, but do you think it‘s a promising approach of fixing the issue? Well, I will just try it and report if it made any difference.

Apart from my two desperate ideas - what do you think about this problem? Any other ideas for troubleshooting?

airflow · Sep 15, 2022

c77dk said:
On the commandline you can use
Code:
gstat -f '^(da|nvd)[0-9]+$'
to see how busy your disks are

I just tried this. To my understanding, this shows that the disk is actually nearly idling:

Code:

L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| nvd0
    0      0      0      0    0.0      0      0    0.0    0.0| da0
    0      0      0      0    0.0      0      0    0.0    0.0| da1

Something else is overwhelming my system…

sretalla · Sep 15, 2022

airflow said:
Regarding moving the system-dataset - you write that this is possible. Sure, but do you think it‘s a promising approach of fixing the issue? Well, I will just try it and report if it made any difference.

easily done = easy to test and put it back if no improvement.

I think it's your best/easiest option to take the next step in resolving the issue.

If that's not it, I think we're back to looking at where the system is bottlenecking.

airflow · Sep 15, 2022

sretalla said:
easily done = easy to test and put it back if no improvement.

I think it's your best/easiest option to take the next step in resolving the issue.

If that's not it, I think we're back to looking at where the system is bottlenecking.

OK. What I don't like about this test is that it will discard all my system monitoring history according to my experience. But nevertheless I will try it.

In the meanwhile I had another idea: Could it be about the fact that the process which is causing this load is running in a jail? So some jail-specific issue? For example, some rate-limits applied on processes within jails? Perhaps I can rule that out by moving the process out of the jail into the main OS of TrueNAS (not recommended, not supported, I know - but just out of desperation). What do you think about that?

sretalla · Sep 15, 2022

airflow said:
So some jail-specific issue? For example, some rate-limits applied on processes within jails?

I don't see how that's causing the host to perform slowly if the jail process is being limited. There are no "limits" by default.

airflow said:
Perhaps I can rule that out by moving the process out of the jail into the main OS of TrueNAS (not recommended, not supported, I know - but just out of desperation). What do you think about that?

A really bad idea that will prove nothing and possibly break your install depending on how you would do that.

mistermanko · Sep 15, 2022

airflow said:
The system-dataset is on the performance-drive (so on the same drive where the workload is). Should I move it to another pool? Could that explain that GUI/management is sluggish?

Seems pretty obvious to me. Hogging the system-drive with i/o heavy operations leads to a badly responding system!?

HoneyBadger · Sep 15, 2022

airflow said:
1 pool with 1 NVME-drive (focus on performance)

What vendor and model of device are you using? Is there sufficient airflow for cooling?

mav@ · Sep 16, 2022

If it affects the whole system, I would take a look on CPU and memory usage, starting from `top -SHIz` on a host.

airflow · Sep 16, 2022

OK, I did now try the following things to find the cause of the performance problems I'm facing when executing the DB pruning job.

1) Changing the location of the system-dataset from the NVME-device to a) the boot-pool (dual UBS-sticks) and b) the HDD-pool consisting of 6 standard HDDs.

In both variants, the situation was unchanged - the job takes forever, gets slower and slower, and brings the system to a crawl which makes GUI sluggish and prevents the reporting/monitoring section from TrueNAS from actually recording the data. CPU and memory utilization is low, IOPS on the NVME-device holding the are low enough to not explain the behaviour (see examples of gstat-output earlier).

2) During investigation I learned about a phenomenon of "read amplification", which can happen if the record size of my DB-dataset is too large (default = 128K). To find out if this could explain my situation, I deleted the dataset with the DB and recreated it with a record size of 16K. I copied the DB to this new dataset and started the pruning job.

Sadly, this test also didn't show any changed behaviour or improvement of the situation.

Do you have any better ideas then my own? Some method of finding out what the system is actually doing while struggling with this one job? Any helpful advice appreciated.

airflow · Sep 16, 2022

mav@ said:
If it affects the whole system, I would take a look on CPU and memory usage, starting from `top -SHIz` on a host.

Hello mav@, thanks for your reply.

I can certainly do that. I did check the output from "top" and "htop" already multiple times during troubleshooting. But for me the output was OK as I saw that CPU utilization was low on all cores. What should I look for specifically when I check the output of "top -SHIz"? Should I post the output of the command while I'm facing the described problems here?

Would you say that it could have some relevance that the job is running within a jail as opposed to the host system?

airflow · Sep 16, 2022

HoneyBadger said:
What vendor and model of device are you using? Is there sufficient airflow for cooling?

This one:

Code:

Geom name: nvd0
Providers:
1. Name: nvd0
   Mediasize: 2000275111936 (1.8T)
   Sectorsize: 512
   Stripesize: 131072
   Stripeoffset: 0
   Mode: r1w1e3
   descr: WD_BLACK AN1500
   lunid: 0050433501000001
   ident: WUBT21330775
   rotationrate: 0
   fwsectors: 0
   fwheads: 0

Sufficiant airflow for cooling: Good point, the device gets rather hot. But when I check the smartctl output it's below Warning Level. I couldn't find any way of finding out if there would be some thermal throttling activated via SMART. Is there one? Even if there was, I wouldn't expect thermal throttling of the device to bring down the whole system.

mrpasc · Sep 16, 2022

First: This WD Black AN1500 *might* cause trouble, because it is a little strange thing. Basically it is a „RAID 0“ of two WD SN730 with a Marvell 88NR2241 RAID controller in an enclosure. It’s intended for Windows gaming machines with just PCIE gen3 to reach (nearly) PCIE gen4 speeds. Even if the tests with generic workloads looked good it could be that this Marvell controller behaves „strange“ with real-world workloads.

Second: Beside all hardware or configuration issues of your storage: are you sure the „prune job“ itself isn’t the problem? With badly designed databases and/or scripts/programs you can bring down even *much* more powerful systems (full table Scans as of bad/corrupted/not existing indexes and much more…)

c77dk · Sep 16, 2022

Just to be sure, since I haven't seen it in the thread: You haven't enabled deduplication?

HoneyBadger · Sep 16, 2022

airflow said:
WD_BLACK AN1500

This could be an issue as @mrpasc points out - user reviews seem to have the heat mentioned as a frequent complaint. The throttle point is reportedly 70C - has your drive reached or exceeded that temperature during a heavy workload?

I would suggest opening a couple SSH sessions remotely (not via the webGUI's "shell" option) and running both the top -SHIz as suggested by @mav@ as well as gstat -dp to see how hard your NVMe device is being hit.

c77dk said:
Just to be sure, since I haven't seen it in the thread: You haven't enabled deduplication?

Good catch, @airflow can you confirm the dedup status of your volume? I don't suspect it's on as it isn't by default.

airflow · Sep 17, 2022

Guys (and girls?), I'm happy to report that I finally could finish the workload successfully on the system. Super-fast (in under 4 hours, which is very good - in my previous tests, it let it run for several days in some instances), with no performance impact to the rest of the system, recording of system data working, etc - just like it should be!

And what made the difference? Actually my desperate idea to let the task run on the host system instead of within the jail (which was judged as "will prove nothing" by @sretalla). In case anyone is doubting my findings: I invite you to a zoom-call and I'll show you.

Regarding the other recent input to this thread: Yes, the WD Black AN1500 has a somewhat strange system architecture. It also has a bad thermal design (I read that in some reviews of the hardware - it's running rather hot). My choice for hardware was very limited in my case, because my system has no free SATA port left and also no M2-port. De-Duplication: I'm aware that this can have an impact, also it's absolutely no fit for the workload, so I left it disabled. And the temperature-topic: This was to my mind the most "promising" idea left to look into. In some previous test-runs the temperature of the device was indeed slightly over 70°. In the successful run, it was slightly under 70°. BUT: In some of the recent unsuccessful runs the temperature was also under 70°. So due to my test-results, I would rule out the thermal throttling theory.

Anyhow, thanks to everyone contributing and trying to help!

Important Announcement for the TrueNAS Community.

Specific workload resulting in very bad ZFS performance - system crawls

Contributor

Wizard

Patron

Powered by Neutrality

Contributor

ASRock Rack E3C226D2I​

Contributor

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Guru

actually does care

iXsystems

Contributor

Contributor

Contributor

Dabbler

Patron

actually does care

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Specific workload resulting in very bad ZFS performance - system crawls"

Similar threads

ASRock Rack E3C226D2I