Server Keeps Hard Crashing 12.0u7

TidalWave

Explorer
Joined
Mar 6, 2019
Messages
51
Hey Guys,

We are in a bit of a pickle. We have TrueNAS 12.0u7 running for months just fine. However recently the server keeps crashing when we access data or run a scrub. We have a 90 drive JBOD supermicro chassis connected via SAS cable to a supermicro head unit server.

Three weeks agao, we had a drive fail but when I tried to replace the failed drive with a new drive, the web gui wouldn't accept the disk. And I tried to wipe the disk using the gui and it wouldn't wipe either. So I ran a command like this:

zpool replace tank2 /dev/gptid/00e45156-c66d-11eb-acb4-3cecef1011b7 /dev/multipath/disk56

That actually worked and the pool began resilvering. All was well, and I ran a scrub, all went well. People used the server for days and it was fine.

However a week later, the server started to randomly crash. We are not sure why, there are no errors on the JBOD controller. The freenas web gui would just stop working, and the middleware service crashed.

Sometimes we get an error message saying 30 drives popped offline, other times we get the python code error messages when the server crashes.

In either case I have to power cycle both the JBOD and the head unit. Upon reboot, the TrueNAS server will see the pool and do the import and then run some txg reallocations. After about 30 mins the server will boot and the pool will return.

However when people start accessing the data the server crashes again.

So I thought maybe when I added the drive with the command line that somehow corrupted the pool. So we offlined disk56 and reinserted a brand new HDD. These are all 16TB SAS by the way. So we wiped the disk and then reslivered the new disk using the web GUI and that worked. The resliver finished, and the pool shows healthy.

However when I run a scrub at about 5% the server crashes again. Giving the python error codes in the attached picture.

We are going to try swapping the JBOD chassis tomorrow, but I'm curious to know if anyone has any other ideas on why our server keeps crashing and giving the python error codes about middleware.

-Tidal
 

Attachments

  • MiddleWare Not running.png
    MiddleWare Not running.png
    239.5 KB · Views: 169
  • 2nd crash PythonCrashCode.png
    2nd crash PythonCrashCode.png
    685.7 KB · Views: 189
  • 1st Crash Error Message.png
    1st Crash Error Message.png
    506.8 KB · Views: 171
  • txg records changing on bootup..png
    txg records changing on bootup..png
    626.8 KB · Views: 163
  • DataSet Scanning on Boot up.png
    DataSet Scanning on Boot up.png
    929.4 KB · Views: 171
Joined
Jan 9, 2022
Messages
7
Hi

Where is your 'system set' located? Can it be on the (once) faulty pool? Have you tried moving it to a different one?
 

TidalWave

Explorer
Joined
Mar 6, 2019
Messages
51
Hi

Where is your 'system set' located? Can it be on the (once) faulty pool? Have you tried moving it to a different one?
Do you mean my OS? I have two pools the boot pool and then the tank(data). I'm not familiar with the system set.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
truenas_forum_temp.PNG
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The Python errors look to me like the middleware isn't running, that's just a guess. I don't see what's causing the middleware to crash.

Given that you have a setup a lot closer to what iXsystems sells than many of the users around here have, the developers might take an extra interest in a problem report from you, and may be able to provide some additional guidance. If there's a bug or issue causing the middleware to crash on large systems after a disk replacement, that's potentially noteworthy.
 
Joined
Jan 9, 2022
Messages
7
The System Dataset is what Jessep pointed out in the screen shot, if it resides on faulty pool, I suggest moving it to the boot, and see if it helps.
 

TidalWave

Explorer
Joined
Mar 6, 2019
Messages
51
The system dataset is on the Tank. Should it be on the Boot drive?
 

Attachments

  • SystemDataset.png
    SystemDataset.png
    216.1 KB · Views: 180

ddaenen1

Patron
Joined
Nov 25, 2019
Messages
318
Suggest you try moving it to a different pool (Boot drive is an option) and watch how it behaves.
Temporarily to verify if it fixes the issue this is something you could try. I would not recommend this as a permanent option unless your boot pool is redundant. Otherwise a boot disk crash would kill any possibility to recover your installation afterwards.
 
Joined
Jan 9, 2022
Messages
7
Temporarily to verify if it fixes the issue this is something you could try. I would not recommend this as a permanent option unless your boot pool is redundant. Otherwise a boot disk crash would kill any possibility to recover your installation afterwards.
Installation can be recovered by backing up the system configuration. Everyone should keep their system's configuration backups up to date as well as the data backups, regardless of where the system data set resides.
 
Top