Unrecoverable crash on Nightly 21.08-MASTER-20210812-172919

Diff

Dabbler
Joined
May 9, 2020
Messages
33
System

Dell R730xd, CPU: 2 x Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
Memory: 128GB ECC, Dell HBA330 mini (IT mode)
Disks / Pools:
Mirror (boot)
2 x 250GB 12Gb/s SAS SSD (LENOVO HUSMR1625ASS20E)
RAID-Z2
12 x 1.9TB 12Gb/s SAS SSD (SAMSUNG MZILS1T9HCHP 003)
HOT SPARES
2 x 1.9TB 12Gb/s SAS SSD (SAMSUNG MZILS1T9HCHP 003)
SLOG cache
2 x 512GB NVMe SSD

I was running my server on nightly builds from early 21.06, but all of the sudden server crashed suddenly (still investigating root cause) few days ago. At that moment was on build nightly 21.08-MASTER-20210812-172919, after powering server back Z2 was gone.

To be honest was surprised, since it has 12 disks, so suppose to survive up to 2 disks loss in Z2, as well as there was 2 hot spares, so I was quite confident it will survive any crash... I have been mistaken..

I tried to bring Z2 pool online manually with -f. It came online, in degraded state, stating there is resilver-ing process going on, but looking on iotop, zpool iostat, was seating there for hours w/o any visible activity.
After a day, I lost patience :mad: and considering I did not see any signs of disk IO, decided to try reboot. Did not help.

Tried to update to most recent nightly build 21.08-MASTER-20210830-23-2921, did not help either. After another round of reboots, not it has been for over 2.5 days trying to import ZFS pools and do not load at all.

Loaded ISO for 21.08 BETA.1, tried to boot and upgrade, which had failed.

This this I would assume I would have to say goodbye to 20 VMs state hosted on NFS share off that TrueNAS Scale :oops: and RIP curated ISO and some apps installation for Windows VMs.

If anybody from iXSystems interested to collect logs and investigate for future improvements, I can grab some and share, before I wipe out system in next few days.

Here are some screenshots with errors:

rpviewer-4.png


rpviewer-5.png


rpviewer-6.png


Filled out - https://jira.ixsystems.com/browse/NAS-112174
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Upgrading to SCALE 21.08 would not lose data on ZFS pool. However, for some reason your Z2 VDEV seems almost failed which would imply a potential for data loss. We should focus on recovering the Z2 VDEV.

Theory would suggest getting system up and running with a stable version (eg. 21.08) .... then trying to import the pool. However, my experience having to do these types of recoveries is non-existent, so see what others say.

If your system failed initially due to a hardware issue, then it might be necessary to import the pool on another system. Any evidence of a hardware fault?
 

Diff

Dabbler
Joined
May 9, 2020
Messages
33
If your system failed initially due to a hardware issue, then it might be necessary to import the pool on another system. Any evidence of a hardware fault?

Yes, I have a suspicion it was hw related. Server powered off suddenly. In log I saw these:

pe730xd_event_log.png



After that it initially server did not see 2 x 250GB SSD in backplane. Had to pull server out of rack and unplug and plug back cables. That helped and server started to boot.

One of drives (slot 7) - 1.9TB SAS SSD, started to report 28GB instead of 1788.50GB size in Dell iDRAC
Over time another drive (slot 11) started to report warning, if I understood correctly it's potential failure in near future
Today, actually one more drive (slot 13), also reporting 28GB instead of 1788.50GB
pe730xd_disks.png


Considering I got these SSDs on eBay a year and half ago, for very reasonable price, and they are manufactured back in 2015, I expected them to fail, this is why I hoped to keep Z2 (up to 2 drives could fail) and 2 hot spares, but did not help, so probably there is something more going on.

Here is a drive label side, if that helps:

pe730xd_pm1633.png


Another part - I flashed H330 into HBA330 to enable IT mode, maybe that flash was not fully correct, even it report being HBA330 Mini now.
Maybe I should just go buy H730 mini or H730P Mini, which suppose to support switching to HBA mode without needs to be re-flashed

pe730xd_hba330.png
 
Last edited:
Top