Degraded pool, is this really a problem?

thany · Sep 26, 2022

I got a set of 4 brand-new SSDs and created a pool. And it's immediately degraded. Every type of SMART test turns up green.

Here's my zpool status:

Code:

  pool: SSD
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
config:

        NAME                                      STATE     READ WRITE CKSUM
        SSD                                       DEGRADED     0     0     0
          raidz1-0                                DEGRADED     0     0     0
            9636c9db-39f1-466b-b238-973bfee2a415  ONLINE       0     0     0
            57ba6532-4e94-42c2-a513-14e6dc8127ab  ONLINE       0     0     0
            05df70c4-4f3f-4939-a597-f0a31d67a635  ONLINE       0     0     0
            cdc73920-be07-4cc3-a996-0511ae1df527  FAULTED      0    28     0  too many errors

errors: No known data errors

And smartctl of the offending drive:

Code:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     WD Blue / Red / Green SSDs
Device Model:     WDC  WDS500G1R0A-68A4W0
Serial Number:    214703A00055
LU WWN Device Id: 5 001b44 8bc38af2b
Firmware Version: 411000WR
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Mon Sep 26 17:50:30 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       10
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       8
165 Block_Erase_Count       0x0032   100   100   ---    Old_age   Always       -       65536
166 Minimum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       0
167 Max_Bad_Blocks_per_Die  0x0032   100   100   ---    Old_age   Always       -       130
168 Maximum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       0
169 Total_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       283
170 Grown_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       0
171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
173 Average_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       0
174 Unexpected_Power_Loss   0x0032   100   100   ---    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   075   030   ---    Old_age   Always       -       25 (Min/Max 21/30)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       11
230 Media_Wearout_Indicator 0x0032   001   001   ---    Old_age   Always       -       0x000000000000
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 NAND_GB_Written_TLC     0x0032   100   100   ---    Old_age   Always       -       0
234 NAND_GB_Written_SLC     0x0032   100   100   ---    Old_age   Always       -       3
241 Host_Writes_GiB         0x0030   253   253   ---    Old_age   Offline      -       3
242 Host_Reads_GiB          0x0030   253   253   ---    Old_age   Offline      -       6
244 Temp_Throttle_Status    0x0032   000   100   ---    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%         0         -
# 2  Extended offline    Completed without error       00%         9         -

Selective Self-tests/Logging not supported

It looks to me like the reserved space is a little bit too low. I'm not sure if this is a real actual problem, nor if it's correct for TrueNAS to "trip over" this pre-fault. Isn't reserved space supposed to vary from drive to drive anyway?

So the question is: is it safe to use this drive, and if so, how do I tell TrueNAS to go and use it as normal?

thany · Sep 26, 2022

Another interesting point that I thought about, is how do I find out which of the 4 drive bays this drive is in? Can I make it blink the activity LED or something?

HoneyBadger · Sep 26, 2022

While it is possible for the drive to be DOA, I would try reseating or replacing the cable leading to this particular SSD first.

thany · Sep 26, 2022

All four are on the same SAS cable.

Glorious1 · Sep 26, 2022

thany said:
Another interesting point that I thought about, is how do I find out which of the 4 drive bays this drive is in? Can I make it blink the activity LED or something?

There should be an easier way, but . . . You have the gptid. So do this:
glabel status: gives gptid and geom name
cat /var/log/dmesg.today | grep Serial (or better, WebGUI > Storage > Disks): gives geom name and serial number, which you can find on the physical drive

thany · Oct 4, 2022

Already got the serial number. It's buried in the GUI as well. But the serial number is written *on* the disk, so I would still have to take them all out to check them one by one. And sure, I could label them in a way that I can see serial numbers from the outside. But blinking the status LED would be a super simple and effective way to identify a drive.

This problem is exactly why blade servers usually have an ID LED, which can be blinked or something to identify a blade from the management software. Especially when dealing with RAID systemes (I'm back to truenas :)) port numbers are unpredictable, /dev/sdX codes seem volatile (subject to be rearranged even when a disk gets added), and cabling could be wired out of order during hardware maintenance. All valid scenarios by which a simple & effective way to just quickly ID a drive would be greatly preferable to taking them all out.

Glorious1 · Oct 4, 2022

Sure, I was just answering the question based on the OS we have, rather than one with features we wish it had. Since I built my server, I maintain a table with all the drive information, including slot number. I print it out and put it under the server, so any time I need to open it I can immediately pinpoint a drive.

thany · Oct 4, 2022

The simplest solution is an analogue one, it seems :)

HoneyBadger · Oct 4, 2022

If you have a compatible SAS backplane, you can try using sesutil locate daX to blink the LED. You'd still have to translate the gptid to the device first though.

NugentS · Oct 4, 2022

Glorious1 said:
Sure, I was just answering the question based on the OS we have, rather than one with features we wish it had. Since I built my server, I maintain a table with all the drive information, including slot number. I print it out and put it under the server, so any time I need to open it I can immediately pinpoint a drive.

Just how I do it. I have a diagram indicating whichg drive model, serial number and other details is inserted into which slot.

So far I have kept it up to date

Johnny Fartpants · Oct 4, 2022

Or better still just glabel your drives with a suitable identifier before you create your pool then create the pool via the CLI using the glabel names. Export the pool from the CLI and import it via the Web UI and away you go forever.

Then zpool status looks like this.

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
label/1_W51P_2_0 ONLINE 0 0 0
label/2_U61P_2_1 ONLINE 0 0 0
label/3_KN1P_2_2 ONLINE 0 0 0
label/4_KK1P_2_3 ONLINE 0 0 0
label/5_S1P6_2_4 ONLINE 0 0 0
label/6_BU1P_2_5 ONLINE 0 0 0
label/7_PZ1P_2_6 ONLINE 0 0 0

PS: Don't forget to first build a pool via the webui of all your disks. It doesn’t matter what type of pool as we just want TrueNAS to partition the drives for swap and data and also set alignment.

Detach (export) the pool via the webui but don’t mark disks as new otherwise this will wipe all of the partitions etc.

Now it’s time to label your disks glabel label –v 1_50PV_2_0 /dev/da0p2

Note that only p2 needs naming as this is our data partition and p1 will become swap.

Create a new zpool from the command line using the label names.

zpool create –f tank raidz2 label/1_50PV_2_0 label/2_ZHZV_2_1 label/3_52WV_2_2 label/4_TL7V_2_3

rvassar · Oct 5, 2022

Just a thought.... Swap two drives and see if it follows the drive or sticks to a slot.

Important Announcement for the TrueNAS Community.

Degraded pool, is this really a problem?

thany

Dabbler

thany

Dabbler

HoneyBadger

actually does care

thany

Dabbler

Glorious1

Guru

thany

Dabbler

Glorious1

Guru

thany

Dabbler

HoneyBadger

actually does care

NugentS

MVP

Johnny Fartpants

Guru

rvassar

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Degraded pool, is this really a problem?

Dabbler

Dabbler

actually does care

Dabbler

Guru

Dabbler

Guru

Dabbler

actually does care

MVP

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Degraded pool, is this really a problem?"

Similar threads