One disk in RAID Z2 pool has failed both long and short SMART tests - should I replace?

testfire10

Dabbler
Joined
Aug 1, 2021
Messages
46
Hi everyone. I have a pool with 6 disks in RAID Z2. In the last few days one of the disks has failed the long and short SMART tests. Below, I've given the results from some of the smartctl commands I've run. My question is should I replace this disk now? Has it actually failed? Is there some other commands I should run to ascertain its condition?

Code:
12/29/2021
truenas# smartctl -i -H -A /dev/da0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              H7280B520SUN8.0T
Revision:             A3Y1
Compliance:           SPC-4
User Capacity:        7,865,536,647,168 bytes [7.86 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 1 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca25257042c
Serial number:        001809SJVJJN        7SHJVJJN
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Dec 29 11:38:04 2021 PST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     21 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 28470:47
Manufactured in week 09 of year 2018
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  37
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1200
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 14309183256002560

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0     6697         0      6697    7597776    4499917.559           0
write:         0        0         0         0    2092002     191683.219           0
verify:        0        0         0         0     203765          0.007           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Failed in segment -->       3   28387                 - [0x1 0xb 0x96]
# 2  Background long   Failed in segment -->       3   28298                 - [0x1 0xb 0x96]
# 3  Background short  Failed in segment -->       3   28298                 - [0x1 0xb 0x96]
# 4  Background short  Failed in segment -->       3   28219                 - [0x1 0xb 0x96]
# 5  Background short  Completed                   -   28051                 - [-   -    -]
# 6  Background short  Completed                   -   27883                 - [-   -    -]
# 7  Background short  Completed                   -   27715                 - [-   -    -]
# 8  Background short  Completed                   -   27547                 - [-   -    -]
# 9  Background short  Completed                   -   27379                 - [-   -    -]
#10  Background short  Completed                   -   27210                 - [-   -    -]
#11  Background short  Completed                   -   27042                 - [-   -    -]
#12  Background short  Completed                   -   26874                 - [-   -    -]
#13  Background short  Completed                   -   26706                 - [-   -    -]
#14  Background short  Completed                   -   26538                 - [-   -    -]
#15  Background short  Completed                   -   26370                 - [-   -    -]
#16  Background short  Completed                   -   26202                 - [-   -    -]
#17  Background short  Completed                   -   26034                 - [-   -    -]
#18  Background short  Completed                   -   25866                 - [-   -    -]
#19  Background short  Completed                   -   25699                 - [-   -    -]
#20  Background short  Completed                   -   25531                 - [-   -    -]

Long (extended) Self-test duration: 59291 seconds [988.2 minutes]
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Since your pool is RaidZ2, I would monitor the corrected error counts for a few days to see if you are growing, and how fast then make the call.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
IMO, failing SMART self-tests, particularly consistently failing tests, are grounds for immediate disk replacement. A handful of offline/uncorrectable sectors won't do it for me, if the number is low (i.e., single digits) and stable. But if the disk is failing its own internal self-tests, and doing so consistently, it's a dead disk spinning.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
smartctl -a /dev/da0 might explain what's going on. But what's the point of running SMART tests if you don't trust the results and fail to act accordingly? Failed tests should mean "replace as soon as possible".
How are the remaining disks doing?
 

testfire10

Dabbler
Joined
Aug 1, 2021
Messages
46
smartctl -a /dev/da0 might explain what's going on. But what's the point of running SMART tests if you don't trust the results and fail to act accordingly? Failed tests should mean "replace as soon as possible".
How are the remaining disks doing?
I see the value in the SMART tests. It's not a matter of 'trusting' the results, it's more a matter of interpreting them to understand what they mean. Remaining disks are doing ok (in terms of SMART tests, i.e. no failures). I've since added a hot spare, and have purchased a backup disk to replace this one in the event it does fail.
 
Top