Pool Degraded False Report?

indivision · Oct 11, 2023

After I applied the latest update (to TrueNAS-SCALE-22.12.4), one of my pools shows as degraded.

The alert message is:

Pool "name" state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk WDC_"blabla" is UNAVAIL

However, I see no failed Smart tests. I'm wondering if this drive is healthy and TrueNAS incorrectly kicked it out of the pool?

What are steps that I can take to troubleshoot the disk health further. And, if healthy, how do I re-add it to the pool? (I have replacement drives ready. But, don't want to use them if this thing really is healthy...)

Thank you for any help!

sfatula · Oct 13, 2023

That's bad actually. Normally, that happens when using identifiers that can change between reboots, like people who use sda, sdb, etc. when they create the pool. I'd like to see full results of zpool status -x from command line. Other alternative is hardware related.

indivision · Oct 18, 2023

sfatula said:
That's bad actually. Normally, that happens when using identifiers that can change between reboots, like people who use sda, sdb, etc. when they create the pool. I'd like to see full results of zpool status -x from command line. Other alternative is hardware related.

Sorry for the delay. I got sick for some days...

I ended up just swapping the drive with a new one since it was still under warranty. So, it's going back to WD now.

I never manually set any identifiers myself. That is all done by TrueNAS as far as I know.

Could it be that I have some legacy identifiers in place due to upgrading from old systems over time? I tried running the zpool command from shell to share. But, it says command not found. Do I need to enable that somewhere?

NugentS · Oct 18, 2023

try "sudo zpool status -x"

indivision · Oct 18, 2023

Thank you.

It says:

all pools are healthy

sfatula · Oct 18, 2023

So, there are no degraded pools now. I presume you resilvered?

Glad you are feeling better now.

indivision · Oct 18, 2023

sfatula said:
So, there are no degraded pools now. I presume you resilvered?

Glad you are feeling better now.

Thank you.

No degraded pools. Yes. I did resilver with the new drive replacement.

I am still concerned that I could have something misconfigured regarding the identifiers. (This same thing happened with another drive a few months ago.)

Is there a way to check how that is set up in my system?

sfatula · Oct 18, 2023

Maybe post results of these:

zpool status
zpool status -L
sas2flash -list

You could have wrong drive ids if you mess around with various command line utilities to copy over partition tables, edited them, etc. Doubting you did this.

I love the meshify case btw, have one too!

I presume your LSI SAS9211-8i is flashed to IT mode? Is the firmware current?

Did this server come from core to scale, or, fresh install?

indivision · Oct 19, 2023

sfatula said:
Maybe post results of these:

zpool status zpool status -L sas2flash -list

pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:24 with 0 errors on Wed Oct 18 03:45:25 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdp3 ONLINE 0 0 0

errors: No known data errors

pool: megatron
state: ONLINE
scan: scrub repaired 0B in 01:08:53 with 0 errors on Sun Sep 24 01:08:57 2023
config:

NAME STATE READ WRITE CKSUM
megatron ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdd2 ONLINE 0 0 0
sdi2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
sdk2 ONLINE 0 0 0
sdl2 ONLINE 0 0 0
sdj2 ONLINE 0 0 0

errors: No known data errors

pool: optimus
state: ONLINE
scan: scrub repaired 0B in 04:30:39 with 0 errors on Sun Oct 15 08:30:54 2023
config:

NAME STATE READ WRITE CKSUM
optimus ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sde2 ONLINE 0 0 0
sdh2 ONLINE 0 0 0
sdf2 ONLINE 0 0 0
sdg2 ONLINE 0 0 0

errors: No known data errors

pool: ramjet
state: ONLINE
scan: scrub repaired 0B in 00:26:57 with 0 errors on Mon Oct 16 04:27:01 2023
config:

NAME STATE READ WRITE CKSUM
ramjet ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdn2 ONLINE 0 0 0
sdm2 ONLINE 0 0 0

errors: No known data errors

pool: warpath
state: ONLINE
scan: scrub repaired 0B in 00:01:36 with 0 errors on Sun Sep 17 00:01:37 2023
config:

NAME STATE READ WRITE CKSUM
warpath ONLINE 0 0 0
sdo2 ONLINE 0 0 0

errors: No known data errors

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2008(B2)

Controller Number : 0
Controller : SAS2008(B2)
PCI Address : 00:01:00:00
SAS Address : 500605b-0-013c-a580
NVDATA Version (Default) : 14.01.00.08
NVDATA Version (Persistent) : 14.01.00.08
Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9211-8i
BIOS Version : 07.39.02.00
UEFI BSD Version : N/A
FCODE Version : N/A
Board Name : SAS9211-8i
Board Assembly : N/A
Board Tracer Number : N/A

Finished Processing Commands Successfully.
Exiting SAS2Flash.

sfatula said:
You could have wrong drive ids if you mess around with various command line utilities to copy over partition tables, edited them, etc. Doubting you did this.

Definitely not. My shell use is pretty limited to file maintenance, permissions type work. I believe heavyscript is the only outside utility I've added.

sfatula said:
I love the meshify case btw, have one too!

It's great, right?! I think I've had 6-7 Fractal Design cases now. Meshify is perfect for non-rack NAS imo. Even with a lot of drives you can fit a lot of cooling in there.

sfatula said:
I presume your LSI SAS9211-8i is flashed to IT mode? Is the firmware current?

It is flashed to IT and I updated it when I bought it a few years ago. But, I haven't checked to see if there is an update since. I can't seem to find a link to where they offer the latest...?

sfatula said:
Did this server come from core to scale, or, fresh install?

A good question. I switched to Scale almost right when it came out. So, it was a while ago now and my memory is hazy. I'm 90% sure that I did a core-to-scale transition initially. But, there were lingering issues due to that. So, I re-built from a fresh install. I may even have done a second fresh install and re-build later.

But, there is a 10% chance that I'm not remembering that correctly... Any way to check that in CLI?

sfatula · Oct 19, 2023

Firmware is good. I've seen older versions corrupt drives. My concern is it sounds likle you had a couple label errors. But you didn't post results of the first command, simply zpool status

You have a UPS?

Which disk is the bad one? What is result of smartctl -a /dev/sd? where? is the correct disk?

indivision · Oct 19, 2023

sfatula said:
Firmware is good. I've seen older versions corrupt drives. My concern is it sounds likle you had a couple label errors. But you didn't post results of the first command, simply zpool status

pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:24 with 0 errors on Wed Oct 18 03:45:25 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdp3 ONLINE 0 0 0

errors: No known data errors

pool: megatron
state: ONLINE
scan: scrub repaired 0B in 01:08:53 with 0 errors on Sun Sep 24 01:08:57 2023
config:

NAME STATE READ WRITE CKSUM
megatron ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
1a36fdbb-87a7-400a-843a-96b5b9c8037e ONLINE 0 0 0
3b4adc52-afd8-4f1b-8454-c68c7fd268fc ONLINE 0 0 0
c8ac76a3-d77f-4210-865d-d7d441c45de6 ONLINE 0 0 0
3a7e46b8-f8be-4def-a408-2e8b831c9069 ONLINE 0 0 0
20f0a7f2-bec7-4fcd-a8f1-1045193926b1 ONLINE 0 0 0
def8425a-3394-4e99-8179-f49afd3a975e ONLINE 0 0 0

errors: No known data errors

pool: optimus
state: ONLINE
scan: scrub repaired 0B in 04:30:39 with 0 errors on Sun Oct 15 08:30:54 2023
config:

NAME STATE READ WRITE CKSUM
optimus ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
3ae8f823-cc71-11eb-ab04-3cecef437968 ONLINE 0 0 0
4239eb88-cbd1-11eb-b412-3cecef437968 ONLINE 0 0 0
8878b974-b5be-4069-a538-a1a77762b5c2 ONLINE 0 0 0
1652d52c-3be4-40a4-8b9e-034c6ba48c34 ONLINE 0 0 0
24bddbf9-5aae-4589-ac0d-249a214e7de9 ONLINE 0 0 0
83818862-6dbe-436f-accd-70d1c2892d89 ONLINE 0 0 0

errors: No known data errors

pool: ramjet
state: ONLINE
scan: scrub repaired 0B in 00:26:57 with 0 errors on Mon Oct 16 04:27:01 2023
config:

NAME STATE READ WRITE CKSUM
ramjet ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
6eaade7a-e7af-4f06-9e07-02fcd8a7456a ONLINE 0 0 0
530c2921-a13f-474d-8426-579bf1f1db91 ONLINE 0 0 0

errors: No known data errors

pool: warpath
state: ONLINE
scan: scrub repaired 0B in 00:01:36 with 0 errors on Sun Sep 17 00:01:37 2023
config:

NAME STATE READ WRITE CKSUM
warpath ONLINE 0 0 0
07c19263-a63e-4adc-b4e7-dcd4352051fe ONLINE 0 0 0

errors: No known data errors

sfatula said:
You have a UPS?

I don't. Probably should. But, I haven't really had any noticeable power issues. Using a 1000W PSU.

sfatula said:
Which disk is the bad one? What is result of smartctl -a /dev/sd? where? is the correct disk?

The bad disk is no longer installed. I replaced it with a new one. Or, do you mean which one is the replacement to troubleshoot its label, etc.?

The replacement drive is sdg on optimus.

Here is another odd side-note: I have a pair of small optane NVME drives that aren't assigned to any pool at the moment. But, at some point, TrueNAS started only showing one of them. So, the other may have gone out? I should probably remove both since they arent being used...

sfatula · Oct 19, 2023

The UPS is just to make sure you don't suffer an unlucky power failure that results with pool issues due to abnormal shutdown. I'll check the rest tonight, have errands to do, lol. In the meantime, do the boot messages show both nvmes? If they don't, of course Truenas won't. You can check with dmesg.

Also, per below post and I noticed you do not show that on your system hardware list, what PSU do you have?

Constantin · Oct 19, 2023

FWIW, I had some similar wonkiness when my PSU started to go. Mind you, it’s a 500W+ capable beastie that was brought to its knees with a 125W plug load after 6 years of faithful, cool service. Nothing spectacular re warning, just unhappiness with the pool where disks were dropping out.

Once the PSU was replaced, all drives were back online happily and the pool was considered healthy again after a resilver. As part of this experience, TrueNAS (ever helpfully) suggested destroying the pool and starting over. So you got off easy!!!

indivision · Oct 20, 2023

sfatula said:
The UPS is just to make sure you don't suffer an unlucky power failure that results with pool issues due to abnormal shutdown. I'll check the rest tonight, have errands to do, lol. In the meantime, do the boot messages show both nvmes? If they don't, of course Truenas won't. You can check with dmesg.

dmesg shows that it only lists the one nvme. Is it possible that I've hit some kind of max controller capacity with the number of drives?

sfatula said:
Also, per below post and I noticed you do not show that on your system hardware list, what PSU do you have?

It's a Corsair RM1000X 80+ Gold 1000W.

Constantin said:
FWIW, I had some similar wonkiness when my PSU started to go. Mind you, it’s a 500W+ capable beastie that was brought to its knees with a 125W plug load after 6 years of faithful, cool service. Nothing spectacular re warning, just unhappiness with the pool where disks were dropping out.

Once the PSU was replaced, all drives were back online happily and the pool was considered healthy again after a resilver. As part of this experience, TrueNAS (ever helpfully) suggested destroying the pool and starting over. So you got off easy!!!

Hm. I hate to imagine it is the PSU. It's relatively new... I have a PSU tester. But, will be a PIA to take apart to test.

sfatula · Oct 20, 2023

You might want to check your MB manual. You can disable the M.2 port, BIOS. Not saying you did, but, might as well check. PSU great!

Important Announcement for the TrueNAS Community.

Pool Degraded False Report?

indivision

Guru

Pool "name" state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk WDC_"blabla" is UNAVAIL

sfatula

Guru

indivision

Guru

NugentS

MVP

indivision

Guru

sfatula

Guru

indivision

Guru

sfatula

Guru

indivision

Guru

sfatula

Guru

indivision

Guru

sfatula

Guru

Constantin

Vampire Pig

indivision

Guru

sfatula

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Pool Degraded False Report?

Guru

Pool "name" state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. The following devices are not healthy: Disk WDC_"blabla" is UNAVAIL

Guru

Guru

MVP

Guru

Guru

Guru

Guru

Guru

Guru

Guru

Guru

Vampire Pig

Guru

Guru

Similar threads

Pool "name" state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk WDC_"blabla" is UNAVAIL