ZFS Drive failure

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
Hi I'm running an eight disk zfs z2 array with one hot spare. I was notified by the system of a drive failure.

I just want to make sure I'm interpreting results accurately. I believe the offending drive is da5:

Code:
 
pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 16.8G in 00:22:23 with 0 errors on Fri Sep  2 08:50:42 2022
config:

    NAME                                              STATE     READ WRITE CKSUM
    tank                                              ONLINE       0     0     0
      raidz2-0                                        ONLINE       0     0     0
        gptid/16f8514f-6a3d-11ea-83c6-0cc47a84a594    ONLINE       0     0     0
        gptid/2eff3431-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/2fad6079-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/96f13243-a9e5-11ea-ab76-0cc47a84a594    ONLINE       0     0     0
        gptid/310fd248-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        spare-5                                       ONLINE       0     0     0
          gptid/31c62952-d2f0-11e6-8e60-0cc47a84a594  ONLINE       0     1     0
          gptid/a203d5ad-49e3-11ea-9739-0cc47a84a594  ONLINE       0     0     0
        gptid/32845d1c-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/3338ea10-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
    cache
      gptid/752998f6-094f-11ec-9a66-0cc47a84a594      ONLINE       0     0     0
    spares
      gptid/a203d5ad-49e3-11ea-9739-0cc47a84a594      INUSE     currently in use

errors: No known data errors


Code:
# glabel status
                                      Name  Status  Components
gptid/622dd1d9-cdae-11e6-be8a-0cc47a84a594     N/A  ada0p1
gptid/0df883e1-d080-11e6-8e60-0cc47a84a594     N/A  ada1p1
gptid/a203d5ad-49e3-11ea-9739-0cc47a84a594     N/A  ada3p2
gptid/752998f6-094f-11ec-9a66-0cc47a84a594     N/A  ada2p1
gptid/16f8514f-6a3d-11ea-83c6-0cc47a84a594     N/A  da0p2
gptid/96f13243-a9e5-11ea-ab76-0cc47a84a594     N/A  da3p2
gptid/310fd248-d2f0-11e6-8e60-0cc47a84a594     N/A  da4p2
gptid/2fad6079-d2f0-11e6-8e60-0cc47a84a594     N/A  da2p2
gptid/2eff3431-d2f0-11e6-8e60-0cc47a84a594     N/A  da1p2
gptid/31c62952-d2f0-11e6-8e60-0cc47a84a594     N/A  da5p2
gptid/3338ea10-d2f0-11e6-8e60-0cc47a84a594     N/A  da7p2
gptid/32845d1c-d2f0-11e6-8e60-0cc47a84a594     N/A  da6p2


Code:
#sas3ircu 0 DISPLAY
...
...
Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 5
  SAS Address                             : 4433221-1-0500-0000
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 5723166/11721045167
  Manufacturer                            : ATA
  Model Number                            : WDC WD60EFRX-68L
  Firmware Revision                       : 0A82
  Serial No                               : WDWXL1H16RYYAK
  Unit Serial No(VPD)                     : WD-WXL1H16RYYAK
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

...
...


I believe the gptid 31c62952-d2f0-11e6-8e60-0cc47a84a594/da5 is experiencing a write failure as per the table? and then the spare was activated to take over. I'm performing a smartctl long test on the offending drive at the moment. Nothing showed up on the short smtctl test of the drive. I just wanted to ensure I'm interpreting results correctly.
 
Last edited:

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
So running the smartctl long test I do indeed get a read failure:

Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     46393         2889891760
# 2  Extended offline    Interrupted (host reset)      50%     46378         -
# 3  Short captive       Interrupted (host reset)      90%     46372         -
# 4  Short captive       Interrupted (host reset)      90%     46372         -
# 5  Short offline       Completed without error       00%     46372         -
# 6  Short offline       Completed without error       00%     46314         -
# 7  Short offline       Completed without error       00%     46146         -
# 8  Short offline       Completed without error       00%     45978         -
# 9  Short offline       Completed without error       00%     45810         -
#10  Short offline       Completed without error       00%     45642         -
#11  Short offline       Completed without error       00%     45474         -
#12  Short offline       Completed without error       00%     45307         -
#13  Short offline       Completed without error       00%     45139         -
#14  Short offline       Completed without error       00%     44975         -
#15  Short offline       Completed without error       00%     44807         -
#16  Short offline       Completed without error       00%     44639         -
#17  Short offline       Completed without error       00%     44471         -
#18  Short offline       Completed without error       00%     44303         -
#19  Short offline       Completed without error       00%     44136         -
#20  Short offline       Completed without error       00%     43968         -
#21  Short offline       Completed without error       00%     43800         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


So my question is -- is this part of a failing drive (probably) or is this just something for example I could reformat the drive and then let zfs rebuild the array?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Your drive is failing. The zpool status shows a checksum mismatch during a write, and SMART shows a read failure during a self-test. Please run the replacement procedure with a new, burned-in drive.
 

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
Your drive is failing. The zpool status shows a checksum mismatch during a write, and SMART shows a read failure during a self-test. Please run the replacement procedure with a new, burned-in drive.
Hey thanks.
 
Top