TrueNAS Help Needed ASAP Pool is OFFLINE

swezey

Dabbler
Joined
Feb 17, 2022
Messages
21
So came in this AM and none of the network drives were accessible. traced the problem to the POOL in the TrueNAS being offline and then found all the messages TreuNAS was sending last night about the situation in my EMAIL:

At 5:08 PM yesterday:

New alert:
* Pool VIPER_AFA state is UNAVAIL: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED

At 6:09 PM this:

New alert:
* Pool VIPER_AFA state is UNAVAIL: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED

At 6:29 this:

New alert:
* Pool VIPER_AFA state is UNAVAIL: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42653820 is DEGRADED
  • Disk SanDisk LT0400MO 42653312 is DEGRADED
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42659844 is DEGRADED
  • Disk SanDisk LT0400MO 42653456 is DEGRADED
  • Disk SanDisk LT0400MO 42654368 is DEGRADED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42654764 is DEGRADED
  • Disk SanDisk LT0400MO 42656792 is DEGRADED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42652144 is DEGRADED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED

At 9:03 this:

New alert:
* Pool VIPER_AFA state is UNAVAIL: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42653820 is DEGRADED
  • Disk SanDisk LT0400MO 42653312 is DEGRADED
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42659844 is DEGRADED
  • Disk SanDisk LT0400MO 42653456 is DEGRADED
  • Disk SanDisk LT0400MO 42654368 is FAULTED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42654764 is FAULTED
  • Disk SanDisk LT0400MO 42656792 is DEGRADED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42652144 is DEGRADED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED

By 9:10 this:

New alert:
* Pool VIPER_AFA state is UNAVAIL: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42653820 is DEGRADED
  • Disk SanDisk LT0400MO 42653312 is FAULTED
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42659844 is DEGRADED
  • Disk SanDisk LT0400MO 42653456 is DEGRADED
  • Disk SanDisk LT0400MO 42654368 is FAULTED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42654764 is FAULTED
  • Disk SanDisk LT0400MO 42656792 is DEGRADED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42652144 is FAULTED
  • Disk SanDisk LT0400MO 42657568 is FAULTED
  • Disk SanDisk LT0400MO 42654316 is FAULTED
  • Disk SanDisk LT0400MO 42654540 is FAULTED
  • Disk SanDisk LT0400MO 42652260 is FAULTED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED

* Device: /dev/da0, failed to read SMART values.
* Device: /dev/da0, Read SMART Self-Test Log Failed.
* Device: /dev/da4, failed to read SMART values.
* Device: /dev/da4, Read SMART Self-Test Log Failed.

At 9:11 this:

These alerts have been cleared:
* Device: /dev/da4, failed to read SMART values.
* Device: /dev/da4, Read SMART Self-Test Log Failed.

Current alerts:
* Snapshot Task For Dataset "VIPER_AFA/viperiscsiafa" failed: cannot create snapshot 'VIPER_AFA/viperiscsiafa@auto-2023-01-22_00-00': out of space
no snapshots were created..
* Pool VIPER_AFA state is UNAVAIL: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42653820 is DEGRADED
  • Disk SanDisk LT0400MO 42653312 is FAULTED
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42659844 is DEGRADED
  • Disk SanDisk LT0400MO 42653456 is DEGRADED
  • Disk SanDisk LT0400MO 42654368 is FAULTED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42654764 is FAULTED
  • Disk SanDisk LT0400MO 42656792 is DEGRADED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42652144 is FAULTED
  • Disk SanDisk LT0400MO 42657568 is FAULTED
  • Disk SanDisk LT0400MO 42654316 is FAULTED
  • Disk SanDisk LT0400MO 42654540 is FAULTED
  • Disk SanDisk LT0400MO 42652260 is FAULTED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED

* Device: /dev/da0, failed to read SMART values.
* Device: /dev/da0, Read SMART Self-Test Log Failed.

At MIDNIGHT:
Current alerts:
* Device: /dev/da0, failed to read SMART values.
* Device: /dev/da0, Read SMART Self-Test Log Failed.
* Snapshot Task For Dataset "VIPER_AFA/viperiscsiafa" failed: cannot create snapshot 'VIPER_AFA/viperiscsiafa@auto-2023-01-23_00-00': out of space
no snapshots were created..
* Pool VIPER_AFA state is UNAVAIL: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning.
The following devices are not healthy:
  • Disk STEC S842E400M2 STM000187ABF is UNAVAIL
  • Disk SanDisk LT0400MO 42653820 is DEGRADED
  • Disk SanDisk LT0400MO 42653312 is FAULTED
  • Disk SanDisk LT0400MO 42654968 is FAULTED
  • Disk SanDisk LT0400MO 42659844 is DEGRADED
  • Disk SanDisk LT0400MO 42653456 is DEGRADED
  • Disk SanDisk LT0400MO 42654368 is FAULTED
  • Disk SanDisk LT0400MO 42655780 is FAULTED
  • Disk SanDisk LT0400MO 42654764 is FAULTED
  • Disk SanDisk LT0400MO 42656792 is DEGRADED
  • Disk SanDisk LT0400MO 42653224 is FAULTED
  • Disk SanDisk LT0400MO 42652144 is FAULTED
  • Disk SanDisk LT0400MO 42657568 is FAULTED
  • Disk SanDisk LT0400MO 42654316 is FAULTED
  • Disk SanDisk LT0400MO 42654540 is FAULTED
  • Disk SanDisk LT0400MO 42652260 is FAULTED
  • Disk SanDisk LT0400MO 42660276 is FAULTED
  • Disk SanDisk LT0400MO 42652372 is FAULTED
Well, you get the idea. Status is now VIPER_AFA(System Dataset Pool) OFFLINE

I have to be honest that I do not know as much about TrueNAS as a should. How the heck do I go about troubleshooting this? I am filling in at my company for the empty IT role and we are of course totally down. I can't believe all these drives failed at once!

The hardware is a Supermicro 4U X10DRH-iT 72x 2.5" 2xE5-2667v3 64GB SAS9300-8i 12Gbps SAS3 Server. It's actually a former Ciara storage box. It's been running fine with no issues for a year. Can someone please just tell me kind of where to look? Is this really a drive issue? Something else? I admit I know less about this than I should. Thank you!!
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
short answer: your pool is dead. restore from backups.

long answer:
you need to check your cabling/storage controller/backplane to ensure they are working correctly.

normally, you could try putting the pool into another system but....72 drives will make that difficult.

aditionally, if these disks are all about the same age, they might have reached their endurance limit. restore ...from backups!
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You should supply the output of these 2 commands, (only one will give usable output);
zpool status zpool import

Also, the output of SMART from one of the faulted SanDisk LT0400MO using;
smartctl -x /dev/XXX
where XXX is the drive name.

Next, it is possible that the SAS controller SAS9300-8i 12Gbps has a thermal problem. Over heating SAS controllers has been the source of several problems we have seen in the forums.

As @artlessknave suggested, it is possible that if your disks are the same age, they have reached their endurance. However, it does appear that these are Enterprise type SSDs, so less likely than cable, power supply or over heating SAS controller.
 

swezey

Dabbler
Joined
Feb 17, 2022
Messages
21
Folks thanks for your replies... really MUCH appreciated! We found that the problem was the 40,000-hour bug (never heard of it? Me neither... Google it - you will be SHOCKED!) 17 out of 72 drives failed in short order and took out the pool (yeah I was not expecting that kind of redundancy but nor was I expecting that kind of simultaneous failure!) Good news: Restored *ALMOST* all the data from backups. Bad News: Exposed some flaws n our backup strategy. Good news: We can now fix holes in the backup strategy and make it even better.

Thanks again for posting your comments and ideas, again much appreciation to everyone!

Bill
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Reminds me of Y2K all over again lol.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Wow
3.7-4.5 Years and they fail. There's planned obselescence for you. Hopefully still in warranty - but not a nice way to test backups
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506

swezey

Dabbler
Joined
Feb 17, 2022
Messages
21
Unfortunately, no warranty. We're a small company and can't afford that enterprise stuff new but jeez - certainly a flaw not to be expected. That's why we had 72 drives - PLENTY of fault tolerance but not 17 drives worth. As soon as one VDEV (all RAIDZ3 BTW) failed, BOOM, the POOL was toast. Oh well. You live, you learn. (Thank you A. Morrisette) :cool:
 
Top