How to diagnose strange crashes

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
Hello everyone,

I very irregularity experience strange crashes in my system that I am simply am unable to diagnose (see crash screen in attached image). I am quite new to truenas but have been building linux servers for a while.


My server is built from a repurposed workstation that has been running fine for over a years and it has been running fine for two weeks and suddenly is producing these strange crashes

My hardware
  • Installed bare metal
  • Asrock B560 Pro4 (the workstation used a different motherboard)
  • 96 GB ram
  • Intel i9 11900 (non-k)
  • PSU: Flexguru 300w
  • one mirrored pool for frequently used data with two: WD BLACK SN850
  • one pool raid z1 with one 8tb HGST deskstar and two WD Ultrastar DC HC510 8TB
  • I use the onboard sata controllers
I am confident that this is not an overheating issue as at most get 60C. The system always reboots without problem and operates just as before.
I do get a waring on boot that I am not sure about "IPVS: rr: TCP: [ ip...:port] no destination available". The containers on the respective ports works fine.

One possible issue could be that I currently using the onboard gigabit port and really push its bandwidth that this cause this behavior? I am waiting for a mellanox conectX-3 at the moment.

I cannot detect any hardware or software flaw and am a bit lost how to diagnose this correctly.

Thank you!
 

Attachments

  • IMG_0045.jpg
    IMG_0045.jpg
    435 KB · Views: 154

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Your power supply seems very small for the required load.

 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
It should be more than sufficient given that the CPU has a 65w TDP and all drives should at the absolute maximum not draw more than 100w.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
Your power supply seems very small for the required load.

Ok I have another (unfortunately very loud ) 500w psu i'll try if using that fixes the problem
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It should be more than sufficient given that the CPU has a 65w TDP and all drives should at the absolute maximum not draw more than 100w.

Only if you have just one drive. Since this doesn't appear to be the case, you are mistaken about your power requirements. Your mainboard, memory, fans, etc., also all take additional power that you do not seem to have accounted for. Having a system that worked fine at one time but has since developed strange crashing issues is a classic sign that the PSU may be undersized or failing. Additionally, as PSU components age, they lose their ability to provide maximum power, which is why I typically emphasize derating as described in the PSU sizing guide..

Your "FlexGURU" power supply is a label of Fortron Source/Sparkle Power (collectively, "FSP"), a notorious maker of budget PC power supplies, many of which are known to fail after their expected three-to-five year lifespan. You should definitely consider this a potentially failing supply. Consider spending a bit on a high quality PSU with long life components if you value your data.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
I
Only if you have just one drive. Since this doesn't appear to be the case, you are mistaken about your power requirements. Your mainboard, memory, fans, etc., also all take additional power that you do not seem to have accounted for. Having a system that worked fine at one time but has since developed strange crashing issues is a classic sign that the PSU may be undersized or failing. Additionally, as PSU components age, they lose their ability to provide maximum power, which is why I typically emphasize derating as described in the PSU sizing guide..

Your "FlexGURU" power supply is a label of Fortron Source/Sparkle Power (collectively, "FSP"), a notorious maker of budget PC power supplies, many of which are known to fail after their expected three-to-five year lifespan. You should definitely consider this a potentially failing supply. Consider spending a bit on a high quality PSU with long life components if you value your data.
Ok lesson learned you guys are right my math was off I only cpu stressed tested the 300w psu and you are right I totally underestimated the power draw from the drives. I wanted to get a new psu anyways this one was just supposed to be good enough for the next six month or so until I add a GPU.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
Ok I got a new 700w psu and it still happen. I am a bit lost as to what could be the cause of this
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Are you overclocking? TrueNAS is very intolerant of overclocking. If your RAM has XMP profiles, switch them to standard timings.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
My XMP setting were set to auto and I now set it to "profile 1" (which are the only option in the Bios) let's hope that fixes it's. I guess that is what yiu get if you use Hardware not intended for server use
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Make sure there's also no CPU overclocking.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
Yes I turned that of i.e never had it turned on I also turned off turboboost which I'll not need. Thank you for all the feedback.
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I would recommend that you run the CPU Stress Test for several hours (even overnight) to heat saturate the motherboard and see if that passes. If that works, run a MemTest86 on the system for a good week.

You didn't list the boot drive, please do that and how it's connected. If it's a USB Flash drive, I'd replace it for the hell of it and see if that fixes the issue. Where is your system dataset located? On the pool or on the boot drive?

What Scale version are you running when the last failure occurred? Are you on the current version?

Last question: What was the system doing when it crashed? If you know. You made a statement above that you were pushing the 1Gbit connection hard and curious if that caused the failure. I doubt it, you would just saturate the connection, it shouldn't cause problems unless there was a motherboard issue.

Best of luck to you.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@OP You mentioned 60 degrees earlier
Are you talking disk temps - or system temps? If disk temps thats way too hot
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
Ok that didn't fix it either but at least this time I caught the error message and does seem to be a kernel panic.
@OP You mentioned 60 degrees earlier
Are you talking disk temps - or system temps? If disk temps thats way too hot
No I am only talking about CPU temps I wouldn't actually know how to measure disk temps. The problem still persist also with the new RAM settings. This time I saw the error message in time and it a kernel panic I just do not seem to be able to find the cause. I since swapped my network card for a mellanox connect3

I would recommend that you run the CPU Stress Test for several hours (even overnight) to heat saturate the motherboard and see if that passes. If that works, run a MemTest86 on the system for a good week.

You didn't list the boot drive, please do that and how it's connected. If it's a USB Flash drive, I'd replace it for the hell of it and see if that fixes the issue. Where is your system dataset located? On the pool or on the boot drive?

What Scale version are you running when the last failure occurred? Are you on the current version?

Last question: What was the system doing when it crashed? If you know. You made a statement above that you were pushing the 1Gbit connection hard and curious if that caused the failure. I doubt it, you would just saturate the connection, it shouldn't cause problems unless there was a motherboard issue.

Best of luck to you.
I have a set of zfs mirrored 32gb ssd as boot drive.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
645894E5-29F5-458C-A274-0FF9BA6E594D.jpeg


This time around the errors are more legible could that mean that a container is causing the panic?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Try turning off HyperThreading. Your CPU may be too new for TrueNAS, and you'll need to make it look like an earlier CPU.
 
Joined
Jun 15, 2022
Messages
674
Check for swelling power filtering capacitors around the CPU. <--click the link
FEA9ZMYGE05CTLG[1].jpg FLMJI3DGE05CTLS[1].jpg FX7ARH1GE04Z4IV[1].jpg

Also, the install is suspect when done on a failing PSU with noisy power. Think of the results in a music recording studio if the singer was being hit by a boxer while recording, except inside a computer is better compared to the noise levels of a restaurant kitchen.
 
Last edited:

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
Try turning off HyperThreading. Your CPU may be too new for TrueNAS, and you'll need to make it look like an earlier CPU.
Unfortunately, I still get crashes. Would this also apply to the C-state settings. I guess I just try with everything turned off.
 

aseimel

Dabbler
Joined
Apr 8, 2023
Messages
16
I think finally got it. Turning XMP off actually still left the ram voltage at 1.35 setting it to 1.2 seems to have fixed it. I guess I'll have to go step step turning thinks like hyperthreadingback on
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
If your RAM voltage was in question, run MemTest86 and tweak the voltage if you need to in order to find a stable system.
and it has been running fine for two weeks and suddenly is producing these strange crashes
This is what has me concerned. It was working fine and suddenly it is not. Can you relate this to anything you have done? Also, did you do a proper burn-in of the system just to ensure everything is really working as it should? I highly recommend the MemTest86 or MemTest86+, your poison to choose.
 
Top