TrueNAS Core Unscheduled System Reboots

buzzhussman

Cadet
Joined
Apr 20, 2023
Messages
4
I've been trying to figure out why I'm getting these unscheduled system reboots. The reboots happen at random, they don't happen everyday, and they don't necessarily happen when the system is under load. I'm using this as a backup NAS for Veeam and the backups only run once a week on Sunday at 1-3 am. The reboot has never happened in this time period. I've looked through many many threads on this forum about the same thing and the only thing I've found was to turn off SMART on all the disks. I did that and it seemed to stop for a few days but it started back again. I checked the crash log folder and there is nothing in there so that leads me to believe the system isn't crashing. The only thing I could find is what looks like a NIC kernel panic. In the messages log I find:
truenas kernel: bceX: link state changed to DOWN
Then a few seconds later:
truenas kernel: bceX: link state changed to UP
The X is the number of the NIC. There are 4, 0-3, but I am only using bce0
These are QLogic NetXtreme II BCM5709 1000Base-T

I had this chassis running ESXi for a while and had zero problems. Only when I switched to TrueNAS Core did these issues start happening.

Are there other logs I can check to get a better idea? Has anyone ever seen this before and found a workaround/fix?
 
Joined
Jun 15, 2022
Messages
674
There are many potential causes, although a stressed PSU outputting current with ripple is common. What's your power supply rated for and how much power are you drawing from it?

That rig was first released in 2009. Did you check the CPU filter capacitors for swelling?

Also see:
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Re-using a Dell PowerEdge 710 chassis
Intel Xeon X5570

You should check to see if there is anything in the logs that would indicate a hardware issue at the time of the restart. Check /var/log/messages and /var/run/dmesg.boot to see if anything looks amiss. As these systems age, you run into issues such as needing to reseat components such as DIMM's and CPU's. Use an electronics grade contact cleaner such as MG Chemicals or Puretronics. The heat sink compound or thermal pads also have a limited service life and are candidates for cleaning and repasting, Arctic Silver Arcticlean and Arctic Silver 5 are relatively easy to use.

An X5570 based system is a Nehalem based system, which places it at about 13 or 14 years of age. While it is very possible for a well-built server to last that long, this is potentially reaching the end of its service life, and it should be considered suspect. In addition to the filter capacitors that @WI_Hedgehog suggests, you should check to see that all fans are in good condition and providing adequate cooling, and if you can place a scope on the power rails, checking for ripple was also a good suggestion. PSU's in these servers are designed for harsh environments and high temperatures, but after more than a decade, they may become incapable of coping with the loads that they once did when "young". You don't need a $500-$1000 oscilloscope for this work, power supply problems generally are evident even on a $50 USB oscilloscope.

Absent any actual damage to components (bad caps, parts exploding, knocked off the board, burned to a crisp, etc) I would say a lot of gear that comes in here with problems is just due to a need for some periodic TLC.
 

buzzhussman

Cadet
Joined
Apr 20, 2023
Messages
4
There are many potential causes, although a stressed PSU outputting current with ripple is common. What's your power supply rated for and how much power are you drawing from it?

That rig was first released in 2009. Did you check the CPU filter capacitors for swelling?
As far as I can tell there is no issues with the hardware. If there was a hardware issue I would see it in the DRAC but it's all good in there. No swelling of any capacitors. I have dual 870W PSU's. It uses about 160W at idle and about 400W under load.
Re-using a Dell PowerEdge 710 chassis
Intel Xeon X5570

You should check to see if there is anything in the logs that would indicate a hardware issue at the time of the restart. Check /var/log/messages and /var/run/dmesg.boot to see if anything looks amiss. As these systems age, you run into issues such as needing to reseat components such as DIMM's and CPU's. Use an electronics grade contact cleaner such as MG Chemicals or Puretronics. The heat sink compound or thermal pads also have a limited service life and are candidates for cleaning and repasting, Arctic Silver Arcticlean and Arctic Silver 5 are relatively easy to use.

An X5570 based system is a Nehalem based system, which places it at about 13 or 14 years of age. While it is very possible for a well-built server to last that long, this is potentially reaching the end of its service life, and it should be considered suspect. In addition to the filter capacitors that @WI_Hedgehog suggests, you should check to see that all fans are in good condition and providing adequate cooling, and if you can place a scope on the power rails, checking for ripple was also a good suggestion. PSU's in these servers are designed for harsh environments and high temperatures, but after more than a decade, they may become incapable of coping with the loads that they once did when "young". You don't need a $500-$1000 oscilloscope for this work, power supply problems generally are evident even on a $50 USB oscilloscope.

Absent any actual damage to components (bad caps, parts exploding, knocked off the board, burned to a crisp, etc) I would say a lot of gear that comes in here with problems is just due to a need for some periodic TLC.
Interesting. I didn't know about the /var/run/dmesg.boot. I checked that and saw this:
mfi0: 46225 (735304782s/0x0008/FATAL) - Battery needs replacement - SOH Bad
If I'm not mistaken that's the RAID card battery. That could possibly be the issue. But...I'm not gonna replace it. That's just too much work for this thing. I'm just gonna tell them the system is old and needs to be replaced. No more re-using old garbage to try and squeeze more life out of it. Thanks for the input!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680

Ahh, there's the problem.


You must remove the RAID controller; RAID controllers are not for use with ZFS. This needs to be an HBA. You will want to back up the contents of your pool before making the swap.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
But...I'm not gonna replace it. That's just too much work for this thing. I'm just gonna tell them the system is old and needs to be replaced. No more re-using old garbage to try and squeeze more life out of it. Thanks for the input!
As mentioned by @jgreco - a RAID controller shouldn't be used with TrueNAS. We're more than happy to help you with an updated system design - feel free to DM me if you're interested in exploring the official iXsystems appliance lineup as well. :smile:
 

buzzhussman

Cadet
Joined
Apr 20, 2023
Messages
4
Ahh, there's the problem.


You must remove the RAID controller; RAID controllers are not for use with ZFS. This needs to be an HBA. You will want to back up the contents of your pool before making the swap.
I kinda got around it by jerry rigging it and putting each of the disks individually into RAID0 so it would sort of recognize them as independent and not an array. the ZFS system yelled at me when I first installed them and had them in RAID5. It was one of those "do your best with what you have" type situations. But again, that's too much work for this old system. I've already set in motion the purchase of a new NAS. I'll circle back to y'all once the new system is in place. Thanks again!
 
Joined
Jun 15, 2022
Messages
674
No more re-using old garbage to try and squeeze more life out of it. Thanks for the input!
That is geriatric. Even I tend to not touch them unless they land on my doorstep.

mfi0 ....----> I've already set in motion the purchase of a new NAS. I'll circle back to y'all once the new system is in place. Thanks again!
Ohhhh, um, we've seen this before...maybe before buying the racecar and running it into the wall you look at the hardware requirements, and such...
 
Top