Bad/Failed Drive?

rmccullough

Patron
Joined
May 17, 2018
Messages
269
I suspect I have a failed drive, but want to be sure before I "replace" it with a hot backup.

I received an alert the other day:
* Pool Tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk 14681939649493252335 is FAULTED
So I logged into shell and ran a couple of commands:
Code:
root@freenas[~]# zpool status
  pool: Tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 2.63M in 04:50:57 with 0 errors on Sat Jan  1 04:51:00 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/3aceec98-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0
            gptid/850946ad-bd0e-11e8-9a6f-0cc47a303600  ONLINE       0     0     0
            gptid/3fe310b6-a7d1-11e8-b311-0cc47a303600  FAULTED    147     0     0  too many errors
            gptid/40f281b9-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0
            gptid/447abee3-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0
            gptid/458f64b0-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0
            gptid/491ca6b0-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0
            gptid/4a41f4ea-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0
            gptid/4dcf9801-a7d1-11e8-b311-0cc47a303600  ONLINE       0     0     0

errors: No known data errors


I read some posts here that also suggested getting the output of smartctl:
Code:
root@freenas[~]# smartctl -a /dev/da1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS641
Revision:             MS04
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d0c350c
Serial number:        P5G6R3RV
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sat Jan  1 10:30:55 2022 MST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     23 C
Drive Trip Temperature:        55 C

Accumulated power on time, hours:minutes 29488:15
Manufactured in week 06 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  26
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1255
Elements in grown defect list: 571

Vendor (Seagate Cache) information
  Blocks sent to initiator = 22796998390317056

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      12348      600         0     12948    4737783      52691.712           4
write:         0        0         0         0     547287      28130.329           0
verify:        0        0         0         0     171204          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   29382                 - [-   -    -]
# 2  Background short  Completed                   -   29214                 - [-   -    -]
# 3  Background long   Completed                   -   29050                 - [-   -    -]
# 4  Background short  Completed                   -   28878                 - [-   -    -]
# 5  Background short  Completed                   -   28662                 - [-   -    -]
# 6  Background short  Completed                   -   28494                 - [-   -    -]
# 7  Background long   Completed                   -   28330                 - [-   -    -]
# 8  Background short  Completed                   -   28157                 - [-   -    -]
# 9  Background short  Completed                   -   27917                 - [-   -    -]
#10  Background short  Completed                   -   27749                 - [-   -    -]
#11  Background long   Completed                   -   27588                 - [-   -    -]
#12  Background short  Completed                   -   27412                 - [-   -    -]
#13  Background short  Completed                   -   27197                 - [-   -    -]
#14  Background short  Completed                   -   27029                 - [-   -    -]
#15  Background long   Completed                   -   26865                 - [-   -    -]
#16  Background short  Completed                   -   26693                 - [-   -    -]
#17  Background short  Completed                   -   26453                 - [-   -    -]
#18  Background short  Completed                   -   26285                 - [-   -    -]
#19  Background long   Completed                   -   26122                 - [-   -    -]
#20  Background short  Completed                   -   25949                 - [-   -    -]

Long (extended) Self-test duration: 6 seconds [0.1 minutes]


I hate to just assume the drive is bad, but it looks like it is. Anything else I should try before I start the replacement process?

Is this the right process to replace the disk: Replacement
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
smartctl is certainly confirming that some errors were not internally corrected by the drive, so makes sense that you take it out of the pool in order to do something like badblocks on it separately before trusting it again (if you can).

That link is to the right part of the documentation.
 

rmccullough

Patron
Joined
May 17, 2018
Messages
269
Thanks for the response @sretalla. Can you explain "badblocks" a bit more?

Here is an updated smartctl output after the long test completed:
Code:
root@freenas[~]# smartctl -a /dev/da1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS641
Revision:             MS04
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d0c350c
Serial number:        P5G6R3RV
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Jan 12 10:57:35 2022 MST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     25 C
Drive Trip Temperature:        55 C

Accumulated power on time, hours:minutes 29752:42
Manufactured in week 06 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  26
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1266
Elements in grown defect list: 629

Vendor (Seagate Cache) information
  Blocks sent to initiator = 22796998390317056

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      12348      600         0     12948    4778515      52691.712           4
write:         0        0         0         0     547287      28130.329           0
verify:        0        0         0         0     202181          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       7   29737         428249648 [0x3 0x5d 0x1]
# 2  Background short  Completed                   -   29622                 - [-   -    -]
# 3  Background short  Completed                   -   29382                 - [-   -    -]
# 4  Background short  Completed                   -   29214                 - [-   -    -]
# 5  Background long   Completed                   -   29050                 - [-   -    -]
# 6  Background short  Completed                   -   28878                 - [-   -    -]
# 7  Background short  Completed                   -   28662                 - [-   -    -]
# 8  Background short  Completed                   -   28494                 - [-   -    -]
# 9  Background long   Completed                   -   28330                 - [-   -    -]
#10  Background short  Completed                   -   28157                 - [-   -    -]
#11  Background short  Completed                   -   27917                 - [-   -    -]
#12  Background short  Completed                   -   27749                 - [-   -    -]
#13  Background long   Completed                   -   27588                 - [-   -    -]
#14  Background short  Completed                   -   27412                 - [-   -    -]
#15  Background short  Completed                   -   27197                 - [-   -    -]
#16  Background short  Completed                   -   27029                 - [-   -    -]
#17  Background long   Completed                   -   26865                 - [-   -    -]
#18  Background short  Completed                   -   26693                 - [-   -    -]
#19  Background short  Completed                   -   26453                 - [-   -    -]
#20  Background short  Completed                   -   26285                 - [-   -    -]

Long (extended) Self-test duration: 6 seconds [0.1 minutes]
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Explanations for the badblocks test are here.
Note this is a DESTRUCTIVE test, and that it takes DAYS to complete.
 

rmccullough

Patron
Joined
May 17, 2018
Messages
269
I had sometime to try the replacement process for this failed disk. I started by following the steps from the Replacement documentation.

I am trying to move the disk "Offline", but it does not seem to complete. I click the triple-dot menu, and select "Offline". From the prompt, I click the "Confirm" checkbox and then click the "OFFLINE" button. I get the spinner for a few seconds, then the spinner disappears and the disk still shows as "FAULTED".

I am seeing this added in dmesg:
Code:
GEOM_ELI: Device mirror/swap2.eli destroyed.
GEOM_MIRROR: Device swap2: provider destroyed.
GEOM_MIRROR: Device swap2 destroyed.
GEOM_ELI: Device mirror/swap1.eli destroyed.
GEOM_MIRROR: Device swap1: provider destroyed.
GEOM_MIRROR: Device swap1 destroyed.
GEOM_ELI: Device mirror/swap0.eli destroyed.
GEOM_MIRROR: Device swap0: provider destroyed.
GEOM_MIRROR: Device swap0 destroyed.
GEOM_MIRROR: Device mirror/swap0 launched (3/3).
GEOM_MIRROR: Device mirror/swap1 launched (3/3).
GEOM_MIRROR: Device mirror/swap2 launched (3/3).
GEOM_ELI: Device mirror/swap0.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device mirror/swap1.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device mirror/swap2.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware


And I get something similar in my IPMI console:
failed_to_offline_console.png


I am going to try the scrub to see if it helps, but wanted to proactively ask if there is something I am missing.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Faulted is a state that equates to offline, so maybe you're already where you want to be.
 
Top