SOLVED TrueNAS 12.0-U8.1 system freezes at seemingly random intervals (requires reboot)

Joined
Aug 7, 2018
Messages
7
I have a small TrueNAS box I've been using for a few years as a backup, and ever since I first set it up it will randomly freeze, with the time between failures ranging from a few hours to upwards of a month. I've tried viewing the console directly on the machine using a monitor and keyboard plugged into it, but the console itself is frozen as well, which implies to me the entire machine is freezing, not just the network interface.

The last time this occurred was last night. I had the system running a large cloud sync task when I went to bed. It is unclear if it finished this before freezing. I discovered it frozen at around 7:00 today and the system was rebooted at 07:34:09. The last thing I see in the alert history is a scrub of the pool "freenas-boot" completed at 3:47:28, and var/log/messages doesn't seem to have anything relevant (last few lines before the reboot):

Code:
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.004020-04:00 nas.local ntpd 1173 - - ----------------------------------------------------
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.004023-04:00 nas.local ntpd 1173 - - ntp-4 is maintained by Network Time Foundation,
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.004026-04:00 nas.local ntpd 1173 - - Inc. (NTF), a non-profit 501(c)(3) public-benefit
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.004032-04:00 nas.local ntpd 1173 - - corporation.  Support and training for ntp-4 are
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.004038-04:00 nas.local ntpd 1173 - - available at https://www.nwtime.org/support
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.004042-04:00 nas.local ntpd 1173 - - ----------------------------------------------------
Aug  9 19:37:34 nas Security policy loaded: MAC/ntpd (mac_ntpd)
Aug  9 19:37:34 nas 1 2022-08-09T19:37:34.753083-04:00 nas.local daemon 1248 - - _secure_path: /nonexistent/.login_conf is not owned by uid 65534
Aug  9 19:47:58 nas syslog-ng[1659]: syslog-ng starting up; version='3.29.1'
Aug  9 19:47:58 nas kernel: pid 982 (syslog-ng), jid 0, uid 0: exited on signal6 (core dumped)
Aug 10 00:00:00 nas syslog-ng[1659]: Configuration reload request received, reloading configuration;
6 (core dumped)
Aug 10 00:00:00 nas syslog-ng[1659]: Configuration reload request received, reloading configuration;
Aug 10 00:00:00 nas syslog-ng[1659]: Configuration reload finished;
Aug 11 00:00:00 nas syslog-ng[1659]: Configuration reload request received, reloading configuration;
Aug 11 00:00:00 nas syslog-ng[1659]: Configuration reload finished;


I bought an Intel network card about a year ago when I suspected the issue was the motherboard's network interface failing, but that didn't solve the problem. I am still using that network card (see specs below).

System specs:
OS Version: TrueNAS-12.0-U8.1
Motherboard: ASRock H110M-ITX
Power Supply: EVGA 500B 500W
CPU: Intel Pentium G4400
Memory: x1 Crucial CT8G4FGS8213.M8FH 8GB DDR4-2133
Network card: Intel EXPI9301CTBLK Gigabit Desktop Adapter
OS Drive: Sandisk SDCZ43 16GB USB 3.0
Storage Drives: x2 WD Red WD30EFRX 3TB

Please let me know what would be useful to help debug this. I am relatively comfortable working with Linux, but I have limited experience with OpenBSD systems. I'm almost at the point where I'm going to try replacing the motherboard and CPU, but I would like to make one last attempt to fix it in software if possible.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
You have only 8 GB RAM. The minimum recommended is 16 GB. This could just be a simple case of memory starvation. Your board is also a desktop board, not a server board, and doesn't support running your RAM in ECC mode.

Also, make sure your coolers and power supply aren't choked with dust. These lockups could also be due to overheating.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Look at Reporting->Memory, and look at the swap utilization. If you see any swap utilization, your system is experiencing significant memory pressure, and is running out of RAM to run everything.
 
Joined
Aug 7, 2018
Messages
7
You have only 8 GB RAM. The minimum recommended is 16 GB. This could just be a simple case of memory starvation. Your board is also a desktop board, not a server board, and doesn't support running your RAM in ECC mode.

Also, make sure your coolers and power supply aren't choked with dust. These lockups could also be due to overheating.

I wasn't aware 16 GB was the minimum recommended for TrueNAS. I'll go ahead and increase the RAM to 16 GB then.

I also wasn't aware this motherboard didn't support ECC; it was my understanding that was gated only by the RAM itself. I might consider upgrading the motherboard then, even if increasing the RAM fixes the issue.

I will note that this issue has been happening since the system was brand new, so I doubt cooling is the issue. I will, nonetheless, clean the thing out when I replace the RAM; it is no doubt quite dusty by now.

Look at Reporting->Memory, and look at the swap utilization. If you see any swap utilization, your system is experiencing significant memory pressure, and is running out of RAM to run everything.

I'll re-run the cloud sync task and see what the swap utilization is while that's running.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
This mobo (and most low-end ASRock mobos) have notoriously bad power delivery and problems with memory in XMP mode. Samuel Tai is probably right, but in case it is neither RAM or thermals you should also consider checking: 1. PSU degradation (if it is not in front of a UPS... anything goes) 2. fallback out of XMP to stock speeds (RAM) in your BIOS, could be a specific chip does not play well when accessed 3. put a gpu in and disable integrated graphics because low-end integrated desktop GP chips are notoriously bad these days (or disable graphics altogether), 4. If you do not have a UPS, get one. Just for the power stabilization. A line undervoltage at 500W non-redundant desktop quality PSU could easily under-volt your system momentarily, causing crashes or freezes but not a restart.
 
Joined
Aug 7, 2018
Messages
7
This mobo (and most low-end ASRock mobos) have notoriously bad power delivery and problems with memory in XMP mode. Samuel Tai is probably right, but in case it is neither RAM or thermals you should also consider checking: 1. PSU degradation (if it is not in front of a UPS... anything goes) 2. fallback out of XMP to stock speeds (RAM) in your BIOS, could be a specific chip does not play well when accessed 3. put a gpu in and disable integrated graphics because low-end integrated desktop GP chips are notoriously bad these days (or disable graphics altogether), 4. If you do not have a UPS, get one. Just for the power stabilization. A line undervoltage at 500W non-redundant desktop quality PSU could easily under-volt your system momentarily, causing crashes or freezes but not a restart.

Hmm. If I were to upgrade the motherboard what would you recommend?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Hmm. If I were to upgrade the motherboard what would you recommend?

A very common mistake when building a TrueNAS system is to assume the build process is just like any other PC build. It's not. You're building a server. Use server-grade components.

For example, instead of using the consumer-grade AsRock boards, try the server-grade AsRockRack boards. Click the blue button in my signature for an example. I've had this running solid since 2015, with occasional replacements as power supplies or backplanes wear out.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Best is to listen to Samuel Tai. If your financial situation is tight (as mine is) and (as you have stated) are using this only for enthusiast reasons, you could try finding a refurbished server. I got a HP DL380P Gen 8.1 (Dual Xeons, 12 cores total, 16GB RAM RDIMMs, dual psus) for under 200€ (I live in Greece). You might have a few learning adventures configuring it, but otherwise it should be fine.
 
Joined
Aug 7, 2018
Messages
7
Alright, I'm going to go ahead and replace the motherboard, RAM (the current RAM doesn't support ECC), and processor. It isn't as cheap or simple a solution as I would like, but if the current hardware is of suspect reliability, that's going to limit its usefulness as a backup server.

New hardware:
Motherboard: AsRock Rack E3C246D2I
Processor: Intel Xeon E-2124
Memory: x2 Crucial CT8G4WFS824A 8GB DDR4-2400 w/ ECC

If that looks good, I'll get back to you in ~2 weeks as to whether or not that solved the problem.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Looks good, but you should also consider replacing your power supply with a Seasonic.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
For your use case, you would spend a lot less on refurbished server-grade hardware (the prices may surprise you, a lot of companies upgraded recently and a whole other lot went under, so demand is a lot lower than supply) and such hardware would certainly be suitable for TrueNAS. Replacing a mobo and CPU of this caliber in a desktop could result in a cost of 180-200ish $. For that money (or 20 - 30 more) you could have server-grade hardware, with redundant PSUs (meaning, 2 of them), 4 or more Gbit Eth ports, remote lights-out management (you can even start the server remotely when it is off) advanced ECC ram, 2 CPUs (multi-core) of Ivy-Bridge gen or later, integrated drive bays with hot-swap capabilities, the possibility of adding an external enclosure with a lot more bays and more. You can also buy a refurbished workstation that comes in tower format, doesn't make all that noise and does not need a rack (although, my server didn't fit in the rack and is currently on top of a coffee table :) ).
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Alright, I'm going to go ahead and replace the motherboard, RAM (the current RAM doesn't support ECC), and processor. It isn't as cheap or simple a solution as I would like, but if the current hardware is of suspect reliability, that's going to limit its usefulness as a backup server.

New hardware:
Motherboard: AsRock Rack E3C246D2I
Processor: Intel Xeon E-2124
Memory: x2 Crucial CT8G4WFS824A 8GB DDR4-2400 w/ ECC

If that looks good, I'll get back to you in ~2 weeks as to whether or not that solved the problem.
If you can swing it, though, go for it!
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Probably late, but you might want to look at this guide.
 
Joined
Aug 7, 2018
Messages
7
I've been sufficiently enlightened as to the need for a better PSU. I figure I might as well throw in a small M.2 drive for the boot pool, so here's the new upgrade list:

Motherboard: AsRock Rack E3C246D2I
Processor: Intel Xeon E-2124
Memory: x2 Crucial CT8G4WFS824A 8GB DDR4-2400 w/ ECC
Power Supply: Seasonic Prime Fanless PX-500
OS Drive: WD SDAPMUW-256G
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
1660246123523.png

Don't forget this (^) for the mobo
 
Joined
Aug 7, 2018
Messages
7
I'm going ahead and marking this "solved".

All the parts finally came in, and I rebuilt the NAS. It's been running for about 14 days now without any issues. While this isn't necessarily proof that the issue has been fixed, I'm feeling relatively confident that it is.

Some things I discovered during the process:
  1. The cloud sync task estimated time left turned out to be less of an estimate than I thought it was. It took the NAS nearly a week to complete the initial sync task. Subsequent weekly syncs are only taking around an hour to complete, but it is somewhat odd that it is going so slowly. I'm attributing the slow upload speed to the fact that it needs to poll the server to determine if files were modified and it needs to encrypt files before uploading them.
  2. I got bait-and-switched on the country of origin of one of my two RAM sticks, and that led to a week-long delay during which I only had one RAM stick. I decided to go ahead and attempt the cloud sync anyway, and it did not encounter any issues during that process. I halted it to install the other RAM stick when it arrived, and after resuming the sync it completed as expected. This implies to me that the RAM starvation was not the cause of the issue, although it is entirely possible (although somewhat unlikely, I doubt I'm experiencing that many SEUs) that the non-ECC ram was the issue.
Unfortunately, because nearly the entire system was replaced it isn't really possible to determine the ultimate cause of the issue. Still, it is certainly nice to have a system that is actually sufficiently robust enough that I can trust it will still be running when I get home.
 
Top