So I've been looking around the forum (and google) and I've seen some old threads relating to this but I'm unsure if they're still relevant. From what I've read SMART data isn't very accurate on SAS drives. Is that still the case?
Basically my setup is in a VM through ESXI with 4 cores allocated to the TrueNAS Scale VM. My issue is that 4 of the 5 drives drives failed a GUI SMART long test. One drive is faulty, but the other 3 all have this same error from the SMART data
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
Since every drive failed at the same segment, does that indicate there is an issue with my server and not the drives, or is it a coincidence and they're all on their way out?
Full results of the SMART tests
As you can see these are old worn out drives. They were a freebie from a friend who got them with a server he purchased. He said they were untested and I should assume they wouldn't even spin up, but if they do I could use them to get my server up and running. The plan is to buy some better drives one or two at a time to replace these drives with something newer. The last one is showing as faulted under the storage dashboard after a scrub, so clearly that's the first to replace. Assuming that the SMART information is correct and 4/5 are bad, are there any that are worse than others? Nothing that isn't backed up is on the drives, just a few things to test a plex server so if 2 drives fail before I get new drives then so be it. I'd just rather avoid that if at all possible.
FWIW I also installed Scrutiny from Truecharts and that shows all the drives, including the faulted one, as passing. I'm assuming that isn't accurate.
Relevant Hardware
	
		
			
		
		
	
			
			Basically my setup is in a VM through ESXI with 4 cores allocated to the TrueNAS Scale VM. My issue is that 4 of the 5 drives drives failed a GUI SMART long test. One drive is faulty, but the other 3 all have this same error from the SMART data
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
Since every drive failed at the same segment, does that indicate there is an issue with my server and not the drives, or is it a coincidence and they're all on their way out?
Full results of the SMART tests
Code:
=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VRA9
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01e055a4
Serial number:        WMC1F1678693
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb 22 21:50:49 2024 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     40 C
Drive Trip Temperature:        40 C
Accumulated power on time, hours:minutes 81963:00
Manufactured in week 48 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  59
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  560
Elements in grown defect list: 53
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    3017697    12304     24877   3030001      12340     338644.706          35
write:   6776911   115439    115543   6892350     115443     113104.558           0
verify:        7        0         0         7          0          1.175           0
Non-medium error count:     2265
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       6   65535        6856968932 [0x3 0x11 0x0]
# 2  Background long   Aborted (device reset ?)    -   65535                 - [-   -    -]
# 3  Background long   Failed in segment -->       6   65535        6856968932 [0x3 0x11 0x0]
# 4  Background short  Completed                   -   65535                 - [-   -    -]
# 5  Background short  Aborted (by user command)   -      58                 - [-   -    -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sdc |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VRA9
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f0129b61c
Serial number:        WMC1F1527782
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb 22 21:51:00 2024 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     39 C
Drive Trip Temperature:        40 C
Accumulated power on time, hours:minutes 81970:58
Manufactured in week 49 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  56
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  531
Elements in grown defect list: 0
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    6493553    32327     35315   6525880      32327     332248.186           0
write:   8674701    32322     32351   8707023      32322     114587.892           0
verify:        0        0         0         0          0          1.220           0
Non-medium error count:     2353
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   65535                 - [-   -    -]
# 2  Background short  Completed                   -   65535                 - [-   -    -]
# 3  Background short  Aborted (by user command)   -      67                 - [-   -    -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sdd |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VRA9
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01e139d0
Serial number:        WMC1F1742086
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb 22 21:51:14 2024 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     36 C
Drive Trip Temperature:        40 C
Accumulated power on time, hours:minutes 81957:12
Manufactured in week 50 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  55
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  428
Elements in grown defect list: 2
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    5135024    54621     99850   5189645      54649     329897.037          24
write:   8025664   206852    206960   8232516     206852     120589.987           0
verify:        0        0         0         0          0          0.000           0
Non-medium error count:     2404
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       6   65535        6417431600 [0x3 0x16 0x0]
# 2  Background short  Completed                   -   65535                 - [-   -    -]
# 3  Background short  Aborted (by user command)   -      52                 - [-   -    -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sde |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VRA9
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01e05540
Serial number:        WMC1F1663425
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb 22 21:51:31 2024 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     40 C
Drive Trip Temperature:        40 C
Accumulated power on time, hours:minutes 81702:42
Manufactured in week 48 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  87
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  73
Elements in grown defect list: 0
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    5095331    49657     70531   5144988      49662     344067.051           5
write:   7298915   507957    507971   7806872     507957     117777.114           0
verify:        2        0         0         2          0          0.000           0
Non-medium error count:     2216
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       6   65535         789824151 [0x3 0x11 0x0]
# 2  Background short  Completed                   -   65535                 - [-   -    -]
# 3  Background short  Aborted (by user command)   -      72                 - [-   -    -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sdf |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VRA9
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01e04e0c
Serial number:        WMC1F1680307
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb 22 21:52:08 2024 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     39 C
Drive Trip Temperature:        40 C
Accumulated power on time, hours:minutes 81963:56
Manufactured in week 48 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  58
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  577
Elements in grown defect list: 6
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    4036385    77202    301945   4113587      77317     322281.851         113
write:   7615701   517023    517040   8132724     517025     121439.462           4
verify:        5      164      3202       169        168          0.000           0
Non-medium error count:     1912
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       6   65535         491530903 [0x3 0x11 0x0]
# 2  Background short  Completed                   -   65535                 - [-   -    -]
# 3  Background short  Aborted (by user command)   -      59                 - [-   -    -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
As you can see these are old worn out drives. They were a freebie from a friend who got them with a server he purchased. He said they were untested and I should assume they wouldn't even spin up, but if they do I could use them to get my server up and running. The plan is to buy some better drives one or two at a time to replace these drives with something newer. The last one is showing as faulted under the storage dashboard after a scrub, so clearly that's the first to replace. Assuming that the SMART information is correct and 4/5 are bad, are there any that are worse than others? Nothing that isn't backed up is on the drives, just a few things to test a plex server so if 2 drives fail before I get new drives then so be it. I'd just rather avoid that if at all possible.
FWIW I also installed Scrutiny from Truecharts and that shows all the drives, including the faulted one, as passing. I'm assuming that isn't accurate.
Relevant Hardware
- Asus 2U server ESC4000 G3
 - Intel Xeon E5-2620 v4
 - 16GB ECC RAM
 - 5x 4TB WD4001FYYG-01SL3 in RAIDZ1 connected to a PERC H330 and passed through to TrueNAS.
 
			
				Last edited: