TrueNAS Core Lockup/Slowdown

Shockkota

Cadet
Joined
Sep 11, 2022
Messages
4
Good afternoon all,

I recently set up a new server and moved my TrueNAS pool from a ESXI vm hosted on an r720. To an XCP-NG vm on a custom tower server(check signature). While the install and setup went fine. I cannot get the server to stay online. After a while (pointing plex at a share makes this faster) all of the shares become unacceptable. On top of that certain parts of the web ui break and ssh access becomes slow. When the server is up and functioning, everything seems to work fine. File transfers are at acceptable speeds and scrubs/smart tests all appear fine.

The first time I noticed this, upon checking on the server the console was filled with:
Oct 6 21:12:53 truenas 1 2022-10-06T21:12:53.121955-07:00 truenas.local collectd 1993 - - plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.

I have been looking through the logs to try and figure out what was going on, but have been unable. I was hoping someone here may have better insight. I will attach the only log files I can find with events during the lockups.

As a final note, when this happens the system also takes HOURS to shutdown.

Thanks for looking.
 

Attachments

  • console.txt
    170.6 KB · Views: 83
  • daemon.txt
    3.7 MB · Views: 78
  • middlewared.txt
    89 KB · Views: 91
  • smbd_log.txt
    5.4 KB · Views: 140

Shockkota

Cadet
Joined
Sep 11, 2022
Messages
4
Bump. I have been considering trying a reinstall, but wanted to get input here on the problem first. I really miss having my NAS online.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I recently set up a new server and moved my TrueNAS pool from a ESXI vm hosted on an r720. To an XCP-NG vm on a custom tower server(check signature). While the install and setup went fine. I cannot get the server to stay online.

Yup. Why'd you do this? I've written careful instructions on how to properly virtualize TrueNAS.

 

Shockkota

Cadet
Joined
Sep 11, 2022
Messages
4
Yup. Why'd you do this? I've written careful instructions on how to properly virtualize TrueNAS.

I'm a bit confused on the question specifically. Are you asking why I virtualised or why I switched hypervisors?

I'm working my way through that thread now and will continue to do so. I wish I had ran across that prior to setting this up, but after not having issues on esxi I was not expecting problems. I can say I did pass-through the LSI card, same as I had done in the ESXI build.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'm a bit confused on the question specifically. Are you asking why I virtualised or why I switched hypervisors?

Or the third question, why you switched from something that is generally known to work well (ESXi), to something that no one uses (XCP-NG). It reads off as one of those "I aimed my shotgun at my foot and pulled the trigger, now my foot hurts, how do I stop it" kind of issues.

The reason I so carefully set out a guide to how to properly virtualize FreeNAS is because when you don't follow the guide, your chances of success rapidly diminish. This isn't just some crappy webserver VM where you can pick whatever size, virtual disk format and controller, whatever network controller, etc., that you want and have a reasonable expectation of it working. There's actually a bunch of considerations, many of which can partially or completely derail you. The admonition to use a hypervisor suitable to the task is the very first point I make in the article because it is literally the most common issue. It really isn't a suggestion. It's the thing you need to do if you don't enjoy beating your head against your server.

It's okay to despise ESXi. It's a weird product, but it is also the clear Cadillac of hypervisors. It isn't easy to learn and it isn't always easy to use.
 

Shockkota

Cadet
Joined
Sep 11, 2022
Messages
4
I wouldn't say I despise ESXi. I've used it both at home and work. However, after the Broadcom deal a bit ago and some concerns about it long term I moved most of my lab environment to XCP aside from the NAS. Upon recently moving from a house to a studio apartment I have been eliminating my rack-mount equipment, thus the rebuild and decided to bring everything over to XCP to have it all under one machine vs multiple servers. I did do some research prior to the build. TrueNAS was one of the reasons I decided not to go AMD, I just made the mistake of not factoring in potential hypervisor issues.

My biggest concern at the moment is less of the fact that im having issues and more I cannot find proof/logs of the cause. It makes it hard to troubleshoot other than just waiting to see if it stays up. If I was just losing sight of the drives a pass-through error would make more sense, but the fact that the TrueNAS OS itself seems to slow to a crawl when this happens has me confused.

I do want to fix the problem, but I am also hoping to understand whats going on.

Worse case scenario i'll convert the current box to a dedicated truenas machine and build another server for my vm's. Assuming XCP is the underlying issue.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
However, after the Broadcom deal a bit ago and some concerns about it long term

Fair enough. Dell was perhaps somewhat less evil.

My biggest concern at the moment is less of the fact that im having issues and more I cannot find proof/logs of the cause.

Most of the things that tend to go off the rails with virtualized TrueNAS do not leave clear tracks. Slowing down of the guest OS tends to suggest either scheduler issues or perhaps interrupt issues.

It makes it hard to troubleshoot other than just waiting to see if it stays up.

People often wonder why I advocate for long burn-in periods using representative workloads. It basically comes down to the nature of these incomprehensibly complex modern systems, with so many cogs all working together, it is often easier to see what happens and then tinker with it as needed. From a certain point of view, this is terrible engineering, especially if you come from a microcontroller or RTOS background. But often there's a lack of visibility into the issues within the complexity.
 
Top