Pool Degraded False Report?

indivision

Guru
Joined
Jan 4, 2013
Messages
806
After I applied the latest update (to TrueNAS-SCALE-22.12.4), one of my pools shows as degraded.

The alert message is:

Pool "name" state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk WDC_"blabla" is UNAVAIL


However, I see no failed Smart tests. I'm wondering if this drive is healthy and TrueNAS incorrectly kicked it out of the pool?

What are steps that I can take to troubleshoot the disk health further. And, if healthy, how do I re-add it to the pool? (I have replacement drives ready. But, don't want to use them if this thing really is healthy...)

Thank you for any help!
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
That's bad actually. Normally, that happens when using identifiers that can change between reboots, like people who use sda, sdb, etc. when they create the pool. I'd like to see full results of zpool status -x from command line. Other alternative is hardware related.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
That's bad actually. Normally, that happens when using identifiers that can change between reboots, like people who use sda, sdb, etc. when they create the pool. I'd like to see full results of zpool status -x from command line. Other alternative is hardware related.

Sorry for the delay. I got sick for some days...

I ended up just swapping the drive with a new one since it was still under warranty. So, it's going back to WD now.

I never manually set any identifiers myself. That is all done by TrueNAS as far as I know.

Could it be that I have some legacy identifiers in place due to upgrading from old systems over time? I tried running the zpool command from shell to share. But, it says command not found. Do I need to enable that somewhere?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
try "sudo zpool status -x"
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
So, there are no degraded pools now. I presume you resilvered?

Glad you are feeling better now.
 
Last edited:

indivision

Guru
Joined
Jan 4, 2013
Messages
806
So, there are no degraded pools now. I presume you resilvered?

Glad you are feeling better now.

Thank you.

No degraded pools. Yes. I did resilver with the new drive replacement.

I am still concerned that I could have something misconfigured regarding the identifiers. (This same thing happened with another drive a few months ago.)

Is there a way to check how that is set up in my system?
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Maybe post results of these:

zpool status zpool status -L sas2flash -list

You could have wrong drive ids if you mess around with various command line utilities to copy over partition tables, edited them, etc. Doubting you did this.

I love the meshify case btw, have one too!

I presume your LSI SAS9211-8i is flashed to IT mode? Is the firmware current?

Did this server come from core to scale, or, fresh install?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Maybe post results of these:

zpool status zpool status -L sas2flash -list

pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:24 with 0 errors on Wed Oct 18 03:45:25 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdp3 ONLINE 0 0 0

errors: No known data errors

pool: megatron
state: ONLINE
scan: scrub repaired 0B in 01:08:53 with 0 errors on Sun Sep 24 01:08:57 2023
config:

NAME STATE READ WRITE CKSUM
megatron ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdd2 ONLINE 0 0 0
sdi2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
sdk2 ONLINE 0 0 0
sdl2 ONLINE 0 0 0
sdj2 ONLINE 0 0 0

errors: No known data errors

pool: optimus
state: ONLINE
scan: scrub repaired 0B in 04:30:39 with 0 errors on Sun Oct 15 08:30:54 2023
config:

NAME STATE READ WRITE CKSUM
optimus ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sde2 ONLINE 0 0 0
sdh2 ONLINE 0 0 0
sdf2 ONLINE 0 0 0
sdg2 ONLINE 0 0 0

errors: No known data errors

pool: ramjet
state: ONLINE
scan: scrub repaired 0B in 00:26:57 with 0 errors on Mon Oct 16 04:27:01 2023
config:

NAME STATE READ WRITE CKSUM
ramjet ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdn2 ONLINE 0 0 0
sdm2 ONLINE 0 0 0

errors: No known data errors

pool: warpath
state: ONLINE
scan: scrub repaired 0B in 00:01:36 with 0 errors on Sun Sep 17 00:01:37 2023
config:

NAME STATE READ WRITE CKSUM
warpath ONLINE 0 0 0
sdo2 ONLINE 0 0 0

errors: No known data errors

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2008(B2)

Controller Number : 0
Controller : SAS2008(B2)
PCI Address : 00:01:00:00
SAS Address : 500605b-0-013c-a580
NVDATA Version (Default) : 14.01.00.08
NVDATA Version (Persistent) : 14.01.00.08
Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9211-8i
BIOS Version : 07.39.02.00
UEFI BSD Version : N/A
FCODE Version : N/A
Board Name : SAS9211-8i
Board Assembly : N/A
Board Tracer Number : N/A

Finished Processing Commands Successfully.
Exiting SAS2Flash.

You could have wrong drive ids if you mess around with various command line utilities to copy over partition tables, edited them, etc. Doubting you did this.

Definitely not. My shell use is pretty limited to file maintenance, permissions type work. I believe heavyscript is the only outside utility I've added.

I love the meshify case btw, have one too!

It's great, right?! I think I've had 6-7 Fractal Design cases now. Meshify is perfect for non-rack NAS imo. Even with a lot of drives you can fit a lot of cooling in there.

I presume your LSI SAS9211-8i is flashed to IT mode? Is the firmware current?

It is flashed to IT and I updated it when I bought it a few years ago. But, I haven't checked to see if there is an update since. I can't seem to find a link to where they offer the latest...?

Did this server come from core to scale, or, fresh install?

A good question. I switched to Scale almost right when it came out. So, it was a while ago now and my memory is hazy. I'm 90% sure that I did a core-to-scale transition initially. But, there were lingering issues due to that. So, I re-built from a fresh install. I may even have done a second fresh install and re-build later.

But, there is a 10% chance that I'm not remembering that correctly... Any way to check that in CLI?
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Firmware is good. I've seen older versions corrupt drives. My concern is it sounds likle you had a couple label errors. But you didn't post results of the first command, simply zpool status

You have a UPS?

Which disk is the bad one? What is result of smartctl -a /dev/sd? where? is the correct disk?
 
Last edited:

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Firmware is good. I've seen older versions corrupt drives. My concern is it sounds likle you had a couple label errors. But you didn't post results of the first command, simply zpool status

pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:24 with 0 errors on Wed Oct 18 03:45:25 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdp3 ONLINE 0 0 0

errors: No known data errors

pool: megatron
state: ONLINE
scan: scrub repaired 0B in 01:08:53 with 0 errors on Sun Sep 24 01:08:57 2023
config:

NAME STATE READ WRITE CKSUM
megatron ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
1a36fdbb-87a7-400a-843a-96b5b9c8037e ONLINE 0 0 0
3b4adc52-afd8-4f1b-8454-c68c7fd268fc ONLINE 0 0 0
c8ac76a3-d77f-4210-865d-d7d441c45de6 ONLINE 0 0 0
3a7e46b8-f8be-4def-a408-2e8b831c9069 ONLINE 0 0 0
20f0a7f2-bec7-4fcd-a8f1-1045193926b1 ONLINE 0 0 0
def8425a-3394-4e99-8179-f49afd3a975e ONLINE 0 0 0

errors: No known data errors

pool: optimus
state: ONLINE
scan: scrub repaired 0B in 04:30:39 with 0 errors on Sun Oct 15 08:30:54 2023
config:

NAME STATE READ WRITE CKSUM
optimus ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
3ae8f823-cc71-11eb-ab04-3cecef437968 ONLINE 0 0 0
4239eb88-cbd1-11eb-b412-3cecef437968 ONLINE 0 0 0
8878b974-b5be-4069-a538-a1a77762b5c2 ONLINE 0 0 0
1652d52c-3be4-40a4-8b9e-034c6ba48c34 ONLINE 0 0 0
24bddbf9-5aae-4589-ac0d-249a214e7de9 ONLINE 0 0 0
83818862-6dbe-436f-accd-70d1c2892d89 ONLINE 0 0 0

errors: No known data errors

pool: ramjet
state: ONLINE
scan: scrub repaired 0B in 00:26:57 with 0 errors on Mon Oct 16 04:27:01 2023
config:

NAME STATE READ WRITE CKSUM
ramjet ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
6eaade7a-e7af-4f06-9e07-02fcd8a7456a ONLINE 0 0 0
530c2921-a13f-474d-8426-579bf1f1db91 ONLINE 0 0 0

errors: No known data errors

pool: warpath
state: ONLINE
scan: scrub repaired 0B in 00:01:36 with 0 errors on Sun Sep 17 00:01:37 2023
config:

NAME STATE READ WRITE CKSUM
warpath ONLINE 0 0 0
07c19263-a63e-4adc-b4e7-dcd4352051fe ONLINE 0 0 0

errors: No known data errors

You have a UPS?

I don't. Probably should. But, I haven't really had any noticeable power issues. Using a 1000W PSU.

Which disk is the bad one? What is result of smartctl -a /dev/sd? where? is the correct disk?

The bad disk is no longer installed. I replaced it with a new one. Or, do you mean which one is the replacement to troubleshoot its label, etc.?

The replacement drive is sdg on optimus.

Here is another odd side-note: I have a pair of small optane NVME drives that aren't assigned to any pool at the moment. But, at some point, TrueNAS started only showing one of them. So, the other may have gone out? I should probably remove both since they arent being used...
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
The UPS is just to make sure you don't suffer an unlucky power failure that results with pool issues due to abnormal shutdown. I'll check the rest tonight, have errands to do, lol. In the meantime, do the boot messages show both nvmes? If they don't, of course Truenas won't. You can check with dmesg.

Also, per below post and I noticed you do not show that on your system hardware list, what PSU do you have?
 
Last edited:

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
FWIW, I had some similar wonkiness when my PSU started to go. Mind you, it’s a 500W+ capable beastie that was brought to its knees with a 125W plug load after 6 years of faithful, cool service. Nothing spectacular re warning, just unhappiness with the pool where disks were dropping out.

Once the PSU was replaced, all drives were back online happily and the pool was considered healthy again after a resilver. As part of this experience, TrueNAS (ever helpfully) suggested destroying the pool and starting over. So you got off easy!!! :smile:
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
The UPS is just to make sure you don't suffer an unlucky power failure that results with pool issues due to abnormal shutdown. I'll check the rest tonight, have errands to do, lol. In the meantime, do the boot messages show both nvmes? If they don't, of course Truenas won't. You can check with dmesg.
dmesg shows that it only lists the one nvme. Is it possible that I've hit some kind of max controller capacity with the number of drives?
Also, per below post and I noticed you do not show that on your system hardware list, what PSU do you have?
It's a Corsair RM1000X 80+ Gold 1000W.
FWIW, I had some similar wonkiness when my PSU started to go. Mind you, it’s a 500W+ capable beastie that was brought to its knees with a 125W plug load after 6 years of faithful, cool service. Nothing spectacular re warning, just unhappiness with the pool where disks were dropping out.

Once the PSU was replaced, all drives were back online happily and the pool was considered healthy again after a resilver. As part of this experience, TrueNAS (ever helpfully) suggested destroying the pool and starting over. So you got off easy!!! :smile:
Hm. I hate to imagine it is the PSU. It's relatively new... I have a PSU tester. But, will be a PIA to take apart to test.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
You might want to check your MB manual. You can disable the M.2 port, BIOS. Not saying you did, but, might as well check. PSU great!
 
Top