Server randomly Crashing 12.0u7

CB_H1

Cadet
Joined
Jan 12, 2022
Messages
4
Hey All,

We seem to have a strange issue and I'm not sure what is going on. The system randomly locks up and crashes where only a hard reboot will fix it. We found error's on the server saying it couldn't ping the NTP pool so we deleted all entries and change to a different NTP server. However, it has crashed again.
We were running SYSLog, included but the only thing is the reload of the config that seems out of place.
Hope someone can help.

Hardware Information:
MB: ASUS B250M-K
CPU: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
RAM: 8GB
Hard Drives: 5 in total
Disk 0: SanDisk SD8SN8U-128G-1006 NVME - Boot Pool
Disk 1: ST4000VM000-1F3168 4TB - 8TB Pool
Disk 2: ST4000VM000-1F3168 4TB - 8TB Pool
Disk 3: ST2000DX001-1CM164 2TB - 4TB Pool
Disk 3: ST2000DX001-1CM164 2TB - 4TB Pool
Raid: 2 Raid 0 sets for 8TB and 4TB and no raid for boot.
Hrad Drive Controller: Onboard
Network - Onboard

Software Information:
TrueNAS-12.0-STABLE - Release Train for TrueNAS 12.0 [release]

Attached:
SYSLOG
DEBUG

Thank you for your time and help.
 

Attachments

  • SYSLOG True NAS.xlsx
    44.7 KB · Views: 136
  • debug-BACKUP-20220112131334.tgz
    450.9 KB · Views: 117

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Random lockups and crashes of the kind you describe represent hardware issues. There is no software fix, no bug that is causing these. Removing NTP servers certainly isn't going to do jack.

MB: ASUS B250M-K

A common issue with consumer grade desktop boards is that they are made for consumer grade desktop PC's, operated 8 hours a day for three years before the next refresh cycle.

That's a six year old mainboard, so it's about two or three years beyond its expected lifetime. This particular board seemed to have a high run of early fatalities, as noted at NewEgg and elsewhere, so expecting long life out of it may be optimistic.

One of the reasons we discourage the use of these kinds of boards is that server-grade mainboards, designed for years of operation 24/7 in warmer data center environments, are actually designed to last, and often run seven, ten, even more years. This is largely just choosing higher quality components.

Check for bulging or blown capacitors on your mainboard. The capacitor plague is the classic example of cheap components dooming a PC. The "classic" example was from before the time of the board in question, but PC manufacturers are under constant pressure to keep their prices low.

Take the system back to its burn-in phase, which is the best place to weed out these problems. Run memtest86 for two weeks, and then cpuburn for another week. If you cannot successfully complete those simple tests, FreeNAS/TrueNAS will *never* be able to run in a stable fashion on your system. If you cannot remediate those problems, perhaps by reseating DIMM's or resocketing the CPU, you will ultimately need to trash the board. Also, if you think the word "weeks" was an error or a typo, think again. You really need long-term stability of the platform, and you already have demonstrated that yours is not.

Do note that some steps, such as resocketing the CPU, can appear to "fix" issues such as overheating CPU's due to aged thermal compound. Be certain that you are doing such remediations properly, by repasting the CPU fan, etc.


Occasionally other problems such as a failing PSU can cause instability. Use the following guide if you wish to try a replacement.

 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Raid: 2 Raid 0 sets for 8TB and 4TB and no raid for boot.
Hrad Drive Controller: Onboard
Do you mean that you're using the onboard HDD controller to do RAID 0 with the 8TB disks and with the 4TB disks (giving TrueNAS only 2 disks of 16TB and 8TB respectively)?

If that's what you're doing, you need to read this:

Also you need to think about why you don't want redundancy in your pools (if you don't care about keeping the data you have on it, I guess that's fine).

Currently, you have no protection against disk failure and no way to correct (and maybe even not detect) errors in your pools. (unless I misunderstood you)

MB: ASUS B250M-K

Network - Onboard
This Motherboard has a Realtek NIC.

The realtek drivers have been known to cause kernel panic under heavy load (particularly if hardware offload has not been disabled).

Adding an Intel NIC may help.

Also pay attention to the great advice that @jgreco has added while I was typing this...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If that's what you're doing, you need to read this:

That might be the wrong resource. I'm not sure we even have anything that covers the folly of not having redundancy in a ZFS pool. You're allowed to not have redundancy, of course. But it's a warning worth paying attention to.

The realtek drivers have been known to cause kernel panic under heavy load (particularly if hardware offload has not been disabled).

Is that still a thing? I know it was, years ago, but I have no recollection of it in recent years. Usually the Realteks just suck horribly if you have the bad luck to get one of the useless chips, or moderately acceptably at a low performance level for the rest of them. Bill Paul's sarcastic classic comments on Realtek in the rl driver are sadly missing from the re driver source, but presumably only because the work was sponsored...
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
That might be the wrong resource. I'm not sure we even have anything that covers the folly of not having redundancy in a ZFS pool.
It's a starting point... using onboard RAID is problem 1... delivering a RAID 0 volume with it is problem 2.
 

CB_H1

Cadet
Joined
Jan 12, 2022
Messages
4
Random lockups and crashes of the kind you describe represent hardware issues. There is no software fix, no bug that is causing these. Removing NTP servers certainly isn't going to do jack.



A common issue with consumer grade desktop boards is that they are made for consumer grade desktop PC's, operated 8 hours a day for three years before the next refresh cycle.

That's a six year old mainboard, so it's about two or three years beyond its expected lifetime. This particular board seemed to have a high run of early fatalities, as noted at NewEgg and elsewhere, so expecting long life out of it may be optimistic.

One of the reasons we discourage the use of these kinds of boards is that server-grade mainboards, designed for years of operation 24/7 in warmer data center environments, are actually designed to last, and often run seven, ten, even more years. This is largely just choosing higher quality components.

Check for bulging or blown capacitors on your mainboard. The capacitor plague is the classic example of cheap components dooming a PC. The "classic" example was from before the time of the board in question, but PC manufacturers are under constant pressure to keep their prices low.

Take the system back to its burn-in phase, which is the best place to weed out these problems. Run memtest86 for two weeks, and then cpuburn for another week. If you cannot successfully complete those simple tests, FreeNAS/TrueNAS will *never* be able to run in a stable fashion on your system. If you cannot remediate those problems, perhaps by reseating DIMM's or resocketing the CPU, you will ultimately need to trash the board. Also, if you think the word "weeks" was an error or a typo, think again. You really need long-term stability of the platform, and you already have demonstrated that yours is not.

Do note that some steps, such as resocketing the CPU, can appear to "fix" issues such as overheating CPU's due to aged thermal compound. Be certain that you are doing such remediations properly, by repasting the CPU fan, etc.


Occasionally other problems such as a failing PSU can cause instability. Use the following guide if you wish to try a replacement.
Hi jgreco:
In regards to test burn etc, this system was fully tested and working. The CPU was also reseated and new thermal paste was applied before putting on TrueNAS. The bit I forgot to mention is that Trunas ran fine for about 2 months before this issue appeared.

Do you mean that you're using the onboard HDD controller to do RAID 0 with the 8TB disks and with the 4TB disks (giving TrueNAS only 2 disks of 16TB and 8TB respectively)?

If that's what you're doing, you need to read this:

Also you need to think about why you don't want redundancy in your pools (if you don't care about keeping the data you have on it, I guess that's fine).

Currently, you have no protection against disk failure and no way to correct (and maybe even not detect) errors in your pools. (unless I misunderstood you)




This Motherboard has a Realtek NIC.

The realtek drivers have been known to cause kernel panic under heavy load (particularly if hardware offload has not been disabled).

Adding an Intel NIC may help.

Also pay attention to the great advice that @jgreco has added while I was typing this...
Hi sretalla,
Re the Disk, they all stand alone and use trunas to add the drives as two separate pools for 8TB and 4TB effectively creating the raid.
Redundancy for this system isn't really an issue it's more about total space, so while I agree with you that we should have redundancy it is not needed here at all.
I hadn't disabled the hardware offload hadn't been disabled so I will do that and see how it goes.

Thanks for all your replies.
I know the setup is basic and budget. The main purpose of the nas is to take backups from devices over the internet as a secondary backup location, which is why redundancy isn't really an issue.
 

CB_H1

Cadet
Joined
Jan 12, 2022
Messages
4
Hi jgreco:
In regards to test burn etc, this system was fully tested and working. The CPU was also reseated and new thermal paste was applied before putting on TrueNAS. The bit I forgot to mention is that Trunas ran fine for about 2 months before this issue appeared.


Hi sretalla,
Re the Disk, they all stand alone and use trunas to add the drives as two separate pools for 8TB and 4TB effectively creating the raid.
Redundancy for this system isn't really an issue it's more about total space, so while I agree with you that we should have redundancy it is not needed here at all.
I hadn't disabled the hardware offload hadn't been disabled so I will do that and see how it goes.

Thanks for all your replies.
I know the setup is basic and budget. The main purpose of the nas is to take backups from devices over the internet as a secondary backup location, which is why redundancy isn't really an issue.
Just a note to say the NIC dropped again but got the error in SYSlog:

10.0.5.210 - - [12/Jan/2022:20:07:26 +0000] "GET /ui/15.511faf61ea571a4e8ae3.js HTTP/1.1" 304 0 "http://10.0.5.10/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"

re0: link state changed to DOWN

I/O error occurred while writing; fd='22', error='No route to host (65)'

Syslog connection broken; fd='22', server='AF_INET(10.0.5.210:514)', time_reopen='60'

I/O error occurred while writing; fd='22', error='No route to host (65)'

10.0.5.210 (iqn.1991-05.com.microsoft:backup-svr): no ping reply (NOP-Out) after 5 seconds; dropping the connection

re0: link state changed to UP

not sure what this means
 

CB_H1

Cadet
Joined
Jan 12, 2022
Messages
4
Do you mean that you're using the onboard HDD controller to do RAID 0 with the 8TB disks and with the 4TB disks (giving TrueNAS only 2 disks of 16TB and 8TB respectively)?

If that's what you're doing, you need to read this:

Also you need to think about why you don't want redundancy in your pools (if you don't care about keeping the data you have on it, I guess that's fine).

Currently, you have no protection against disk failure and no way to correct (and maybe even not detect) errors in your pools. (unless I misunderstood you)




This Motherboard has a Realtek NIC.

The realtek drivers have been known to cause kernel panic under heavy load (particularly if hardware offload has not been disabled).

Adding an Intel NIC may help.

Also pay attention to the great advice that @jgreco has added while I was typing this...
Thanks very much, Sretalla, changing the NIC card has resolved all the issues faced.
 
Top