SOLVED *Resolved* System shuts down within 10min *Resolved*

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
***Solution***: I disabled Precision Boost Overdrive, Core Performance Boost (in both locations listed below), Global C-State Control , PSS Support, and D.O.C.P.

Specific paths are as follows:

For Core Performance Boost, Global C-State Control:
- Advanced/AMD CBS/CPU Common Options

For Precision Boost Overdrive:
- Advanced/AMD overclocking (click accept)/Precision Boost Overdrive

For PSS Support:
- Advanced/CPU Configuration

For D.O.C.P.(it’s under Ai Overclock Tuner), Performance Enhancer, and Core Performance Boost (seemingly not same Core Performance Boost as listed earlier) they’re all on the main Extreme Tweaker page. Select default or disabled as applicable.

***End of Solution section***

***Beginning of Problem/Issue Section***
***Executive Summary***

After upgrading to TrueNas Core 13 (the most stable version available last month) and setting up SMART scans and scrubs my System began shutting down frequently enough to render it useless. It cannot make it through a 1TB backup. It just crashed again at 10min uptime. It also keeps refreshing (stating it’s connecting to the server or whatever).
I apologize for the lack of meaningful troubleshooting on my part. I’m rated total disability and in grad school while working 25hrs/wk and have next to zero knowledge of this OS and its command line protocols/commands. I’m trying my best to figure this out and i really need some help. (I’m reasonably tech literate, but my knowledge is all windows based and mostly non-command Line knowledge).

*****Detailed explanations and information****
Hardware:
Dark Hero VII mobo
3800x cpu with wraith cooler
512gb SSD for boot drive (forget make)
64GB Ripjaws V DDR4 3600 CL16 (speed and timing may be off)
2x8TB striped Seagate IronWolf NAS HDD
1x Noctua 200mm fan
Cooler Master HAF EVO case (the “test bench” box case).
Random old GPU (I forget the specifics)
Intel x540 T2 RJ45
Corsair 750 gold (hx I think. I bought it new when I built the system in May)
(Was stable for a month after last hardware change.)

Temps are well within safe operational limits.

Software:
TrueNas Core 13.0-U2 (was stable with TrueNas Core 11 or 12. I forget which one I used I’d have to try to look).

Connection to PC:
Cat6E cable and SMB share (I think SMB is how I can access my MNT/User vdev folder. Or maybe SMB was how I tried to install a RealTek NIC driver, regardless it was stable for months after that).

Protocols/routines/etc:
S.M.A.R.T. Tool scan (short 5x/wk, long 1x/wk)
HDD Scrubbing (weekly).
None of the aforementioned scans are done on the same day.
File compression is

Snapshot: I think I did the snapshot properly prior to upgrading. I definitely attempted it. I probably should’ve tested it.

Workload: 2x SSD backups from Win11 on a monthly full, biweekly diff, and 2x/daily incremental backup schedule.

Behavior:
System boots fine and everything seemingly runs normally. Every day or several times/day it shuts off. It cannot make it through a backup write (which normally takes around an hour for a 1TB drive) without shutting down.

I don’t know how to access the crash logs (my command line skills are near nonexistent, especially in TrueNas).

I do not even know where to begin and, as I’m currently in grad school with a part time job and significant physical disabilities that limit how long I can sit at a desk, I simply don’t have the ability to try troubleshooting an OS I know next to nothing about. I’m sorry to have to ask for help from the ground up, but that’s my reality. Any help would be greatly appreciated.

Worst case, is there a way to reinstall the OS without having to redo everything like recreating the connection from scratch and everything else (I download what I think is a config file prior to upgrading).
 

Attachments

  • C4CB29BA-8280-4737-B05C-F6FB0DACF132.jpeg
    C4CB29BA-8280-4737-B05C-F6FB0DACF132.jpeg
    267 KB · Views: 119
  • CA6B93D0-49FF-48EB-9080-DB21906EC111.jpeg
    CA6B93D0-49FF-48EB-9080-DB21906EC111.jpeg
    406.3 KB · Views: 114
  • 9BFA1317-22DC-4EC4-9731-75CD2F3436F5.jpeg
    9BFA1317-22DC-4EC4-9731-75CD2F3436F5.jpeg
    382.3 KB · Views: 111
  • 5340EB1A-FD06-43E6-B18B-E3FD745B4A93.jpeg
    5340EB1A-FD06-43E6-B18B-E3FD745B4A93.jpeg
    261.7 KB · Views: 140
Last edited:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Try disabling any overclocking in your BIOS. Also, as this is a Ryzen board, try setting the known BIOS stability fixes for Ryzen:
  • Disable Cool-n-Quiet
  • Disable C6 states
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
Any other specs or data that might help, just ask. I’m currently monitoring uptime with Macrium shut down (no backup writes or other activity. Just idling). Currently exceeded the 10min that I got while my system was trying to write an incremental backup.
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
Try disabling any overclocking in your BIOS. Also, as this is a Ryzen board, try setting the known BIOS stability fixes for Ryzen:
  • Disable Cool-n-Quiet
  • Disable C6 states
Thanks fellow Sam! I’ll check that out.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
In particular, make sure your RAM is running only at factory timings, to minimize the chances of data errors, as your RAM isn't ECC. Likewise, use stock voltages for everything. A very common error when building a TrueNAS system is assuming it's like any other PC build. It's not; you're building a server, and should use server-grade components. If you can't use server grade-components, then you should configure your system for maximum stability.
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
That makes sense. The mobo and cpu were free. So…‍ ::shrug::. But would the hardware be the cause of the new instability as it hasn’t changed? Though I suppose new software settings could push the hardware differently and result in new instability. (I’m not doubting your advice, just working through the logic openly so people can correct me if I’m wrong. Trying to make sure I understand the “why” behind the “what” of your suggestions for future troubleshooting).

Thank you again. I’ll try this and then try to run backups and see what happens.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
13 has a newer kernel than 12 or 11, so yes, it would beat on hardware differently. What would've been stable in 12 could absolutely not be in 13. The only thing you can do to compensate is to configure the hardware as conservatively as possible.
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
Not sure if any of this is relevant (see pic). But yeah, running conservatively makes sense. Not like I need 4.7Ghz and 3600 MT/s to just write data to a hard drive. Lol
 

Attachments

  • 94DBBD83-895A-4893-88BB-2FC6EBE79DA3.jpeg
    94DBBD83-895A-4893-88BB-2FC6EBE79DA3.jpeg
    311.8 KB · Views: 114

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
@Samuel Tai Thank you so much for your help! You’re a godsend!

It seems stable. Iunno if pasting a 100GB file while concurrently running a backup of my boot drive is a proper stress test or not, it used all cores, nearly maxed out RAM usage for ZFS cache, and bottlenecked on the HDD’s max stripes write speed of ≈490MiB/s and a peak network receive speed of 4.8Gb/s in the 100GB file transfer. So I’ll call it a win!

Next month I will be converting to RAIDZ1 and hopefully mirroring my boot drive, so here’s to hoping I won’t be back here w/questions lol.

Thank you again!
 

mervincm

Contributor
Joined
Mar 21, 2014
Messages
157
Do you have a FAN on that NIC? older tech 10G dual NIC are known to run hot, especially w cat6 cable, and designed to run in a case with significant airflow, it could be overheating .
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
Do you have a FAN on that NIC? older tech 10G dual NIC are known to run hot, especially w cat6 cable, and designed to run in a case with significant airflow, it could be overheating .
The temps were fine. It seems that disabling the precision boost overdrive, PSS support, and DOCP remedied the problem. If I get bored I might re-enable them one by one to see what the offending functionality or combination of functionalities was.
.
That being said, I didn’t know the old NICs run hot. That could be useful info in the future, so thank you.
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
Try disabling any overclocking in your BIOS. Also, as this is a Ryzen board, try setting the known BIOS stability fixes for Ryzen:
  • Disable Cool-n-Quiet
  • Disable C6 states
For What it’s worth, those terms/settings are no longer applicable. At least with the ASUS UEFI. The following are the settings I disabled to resolve my issue (I added them to the OP). Feel free to LMK if I erroneously/unnecessarily disabled anything so I can edit my OP accordingly.

***Solution***: I disabled Precision Boost Overdrive, Core Performance Boost (in both locations listed below), Global C-State Control , PSS Support, and D.O.C.P.

Specific paths are as follows:

For Core Performance Boost, Global C-State Control:
- Advanced/AMD CBS/CPU Common Options

For Precision Boost Overdrive:
- Advanced/AMD overclocking (click accept)/Precision Boost Overdrive

For PSS Support:
- Advanced/CPU Configuration

For D.O.C.P.(it’s under Ai Overclock Tuner), Performance Enhancer, and Core Performance Boost (seemingly not same Core Performance Boost as listed earlier) are all on the main Extreme Tweaker page. Select default or disabled as applicable.

***End of Solution section***
 

mervincm

Contributor
Joined
Mar 21, 2014
Messages
157
In general 10 gig NICs consume more power than 1 gig NICS, Dual port more than single port, and older cards more than newer (intel based at least, and most significantly, RJ45 Cat6 copper connections more than SFP+ based options. lastly, put a card designed for server airflow in a desktop case designed for low noise ... it's not at all uncommon for them to over-heat. I place a fan over mine despite the fact that I use SFP+ links.

PBO can be very tough to tune all the faults out of, took me months off and on before I gave up on it for my 5900x :)
 

JewJitsu11B

Cadet
Joined
May 9, 2022
Messages
9
In general 10 gig NICs consume more power than 1 gig NICS, Dual port more than single port, and older cards more than newer (intel based at least, and most significantly, RJ45 Cat6 copper connections more than SFP+ based options. lastly, put a card designed for server airflow in a desktop case designed for low noise ... it's not at all uncommon for them to over-heat. I place a fan over mine despite the fact that I use SFP+ links.

PBO can be very tough to tune all the faults out of, took me months off and on before I gave up on it for my 5900x :)
Makes me wonder if I should turn it off on my 5950x and X570.
 
Top