got alert drive unavail but its not

Grinas

Contributor
Joined
May 4, 2017
Messages
174
I just got the following alert.

Code:
 * Pool fourtb state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
    * Disk TOSHIBA_HDWT840 92K2S0M9SRZH is UNAVAIL
    * Disk WDC_WD30EFRX-68EUZN0 WD-WCC4N6SS9F16 is FAULTED
Current alerts:
  * New ZFS version or feature flags are available for pool 'ssd1tb'. Upgrading pools is a one-time process that can prevent rolling the system back to an earlier TrueNAS version. It is recommended to read the TrueNAS release notes and confirm you need the new ZFS feature flags before upgrading a pool.
  * Pool fourtb state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following devices are not healthy:
    * Disk TOSHIBA_HDWT840 92K2S0M9SRZH is UNAVAIL
    * Disk WDC_WD30EFRX-68EUZN0 WD-WCC4N6SS9F16 is FAULTED



The Disk TOSHIBA_HDWT840 92K2S0M9SRZH shows as unavail in UI but shows as online when i expand spare and i am currently waiting the results of the long smart test on it but can anyone explain why this would occurr. bad cable or something i assume?

I believe this drive was set up as a spare for the pool in case a drive failed. would that explain why its unavail?

The drive is new and it wouldnt be the first time i had a new drive fail within a month.


Screenshot 2023-09-14 at 7.43.15 a.m..png


Screenshot 2023-09-14 at 7.42.37 a.m..png


Screenshot 2023-09-14 at 7.42.23 a.m..png


Code:
smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA HDWT840
Serial Number:    92K2S0M9SRZH
LU WWN Device Id: 5 000039 bc561bc4f
Firmware Version: KQ0C0L
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Zoned Device:     Device managed zones
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 14 08:08:52 2023 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 498) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7057
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       401
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       4823
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       34 (Min/Max 21/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   100   100   000    Old_age   Always       -       57
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       860
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       401         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
sdd is good - it has replaced sdf which has faulted.
Its sdf you need to look at
 

Grinas

Contributor
Joined
May 4, 2017
Messages
174
sdd is good - it has replaced sdf which has faulted.
Its sdf you need to look at
Yeah i see thats dead and its an old drive but still does not explain why it was sdd saying was unavail when it is.

could there be something wrong with the raid controller or something else is on the way out?
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I don't know if this is the case - but have you rebooted. The sda/sdb/sdc etc can and will change between reboots depenig on which order that TN finds the drives.
 

Grinas

Contributor
Joined
May 4, 2017
Messages
174
it was reboot yesterday but i only got the alert this morning.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
It might just be a quirk of the GUI's reporting.

ZFS would state a SPARE drive in use as "UNAVAIL" for NEW uses in the "Spare" section. So, until drive "sdf" is replaced, "sdd" is just busy. (aka un-available for new sparing purposes).
 

Grinas

Contributor
Joined
May 4, 2017
Messages
174
after the bad drive is replaced with the spare drive in the pool. Should it not still show the bad drive as faulted or at least remove the bad drive from the list. The bad drive is passing smart tests.

I would of thought the spare drive that replaced the bad drive would now show under VDEV and not still as spare. Is this normal?

Screenshot 2023-09-18 at 3.25.08 p.m..png



Code:
smartctl -a /dev/sdf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N6SS9F16
LU WWN Device Id: 5 0014ee 20e541427
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 18 15:20:46 2023 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (40020) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 401) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   184   178   021    Pre-fail  Always       -       5783
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       270
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       53963
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       270
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       255
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       22845
194 Temperature_Celsius     0x0022   112   100   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     53397         -
# 2  Extended offline    Completed without error       00%     53201         -
# 3  Conveyance offline  Completed without error       00%     53174         -
# 4  Short offline       Completed without error       00%     52767         -
# 5  Short offline       Completed without error       00%     52755         -
# 6  Short offline       Completed without error       00%     52599         -
# 7  Short offline       Completed without error       00%     52587         -
# 8  Short offline       Completed without error       00%     52431         -
# 9  Short offline       Completed without error       00%     52419         -
#10  Short offline       Completed without error       00%     52264         -
#11  Short offline       Completed without error       00%     52252         -
#12  Short offline       Completed without error       00%     52048         -
#13  Short offline       Completed without error       00%     52036         -
#14  Short offline       Completed without error       00%     51880         -
#15  Short offline       Completed without error       00%     51868         -
#16  Short offline       Completed without error       00%     51712         -
#17  Short offline       Completed without error       00%     51700         -
#18  Short offline       Completed without error       00%     51548         -
#19  Short offline       Completed without error       00%     51536         -
#20  Short offline       Completed without error       00%     51308         -
#21  Short offline       Completed without error       00%     51296         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
No, a ZFS Spare remains a Spare, (in use), and takes over the function of the failed drive.

The intent here is to allow you to replace your failed drive, and whence re-silvered, the Spare drive returns to available status. Part of the reason for this is that Spares can be shared between pools. So, if you have a pool with 6TB disks and another with 10TB disks, using a Spare of 10TB for both pools works. But, you don't necessarily want a 10TB disk to become a permanent part of your 6TB pool.

If you want to make the Spare a permanent replacement, you can do so and then the faulted disk disappears from your pool. As does the temporary Spare Mirror vDev. And if your Spares list has no more spares, then it too disappears.
 
Joined
Oct 22, 2019
Messages
3,641
I'm just going to say it, I don't care at this point (and this type of confusion happens often, not just with TrueNAS, but ZFS in general.)

Developers and engineers need to let real-world users write descriptions and design layouts.

In what world does this make any intuitive sense?
spare-me-please.png


Can they not understand why this confuses people?

And why does "ssd" (under the "Spare" category) display as "Unavailable"? Why not word it as "In Use" or "Active"?


The way things are presented in a ZFS pool's vdev topology when spares are involved, reminds me of this xkcd comic:
tar.png
 
Last edited:

Grinas

Contributor
Joined
May 4, 2017
Messages
174
No, a ZFS Spare remains a Spare, (in use), and takes over the function of the failed drive.

The intent here is to allow you to replace your failed drive, and whence re-silvered, the Spare drive returns to available status. Part of the reason for this is that Spares can be shared between pools. So, if you have a pool with 6TB disks and another with 10TB disks, using a Spare of 10TB for both pools works. But, you don't necessarily want a 10TB disk to become a permanent part of your 6TB pool.

If you want to make the Spare a permanent replacement, you can do so and then the faulted disk disappears from your pool. As does the temporary Spare Mirror vDev. And if your Spares list has no more spares, then it too disappears.
Ok that does make sense but its not very intuitive whats going on.

At the moment it looks like i have 2 spares. If it was not for the alert that i have i would not know which drive is even problematic. I dont even know now if the SDF drive is actually in use or if the spare had taken over.

Also if i look at storage it shows a warning symbol but it shows sda as the drive failing a smart test. If i look at the recent smart test result of that drive it shows it passed. I never got an alert for this failed smart test result.

Screenshot 2023-09-18 at 7.42.37 p.m..png


Screenshot 2023-09-18 at 7.49.19 p.m..png

Should i be replacing both sda and sdf even though they are not showing as degraded?

Code:
smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N0XP648R
LU WWN Device Id: 5 0014ee 2640bcb11
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 18 19:44:01 2023 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (40080) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 402) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       4
  3 Spin_Up_Time            0x0027   185   180   021    Pre-fail  Always       -       5725
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       240
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   038   038   000    Old_age   Always       -       45531
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       240
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       239
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5405
194 Temperature_Celsius     0x0022   113   097   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     44953         310455104
# 2  Extended offline    Completed without error       00%     44765         -
# 3  Conveyance offline  Completed without error       00%     44738         -
# 4  Extended offline    Completed without error       00%     35590         -
# 5  Extended offline    Completed without error       00%     32877         -
# 6  Extended offline    Completed without error       00%     32853         -
# 7  Extended offline    Completed without error       00%     32783         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The "sda" disk has too many "Load_Cycle_Count" of 5405. Not sure what to do about that. If you look at the SMART Extended offline test, it lists a bad block. But, since you have no relocated blocks, it would appear that block is not in use.

If it were my disk, I would hold off replacing "sda" for a while. You have a spare, (whence you replace "sdf").

The GUI's lack of showing "sdf" as failed, is more a GUI issue. If the Spare took over and you can't find another cause, replace it.
 

Grinas

Contributor
Joined
May 4, 2017
Messages
174
The "sda" disk has too many "Load_Cycle_Count" of 5405. Not sure what to do about that. If you look at the SMART Extended offline test, it lists a bad block. But, since you have no relocated blocks, it would appear that block is not in use.

If it were my disk, I would hold off replacing "sda" for a while. You have a spare, (whence you replace "sdf").

The GUI's lack of showing "sdf" as failed, is more a GUI issue. If the Spare took over and you can't find another cause, replace it.

its not clear in the UI if the spare has taken over. it looks like i have two spares.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
No, the GUI is not showing you 2 spares. It is the method that ZFS shows you during spare replacement. Yes, it may be confusing to new people using ZFS, but it is what it is.

Ideally the GUI would be clearer, as that is written by iXsystems. For example, the part where it shows 2 disks under SPARE, I personally would use a variation on the wording, perhaps "SPARING" or "SPARE IN USE". Then under the "Spare" vDev, I would not have "UNAVAIL" but again "SPARE IN USE".
 
Top