So I've been looking around the forum (and google) and I've seen some old threads relating to this but I'm unsure if they're still relevant. From what I've read SMART data isn't very accurate on SAS drives. Is that still the case?
Basically my setup is in a VM through ESXI with 4 cores allocated to the TrueNAS Scale VM. My issue is that 4 of the 5 drives drives failed a GUI SMART long test. One drive is faulty, but the other 3 all have this same error from the SMART data
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
Since every drive failed at the same segment, does that indicate there is an issue with my server and not the drives, or is it a coincidence and they're all on their way out?
Full results of the SMART tests
As you can see these are old worn out drives. They were a freebie from a friend who got them with a server he purchased. He said they were untested and I should assume they wouldn't even spin up, but if they do I could use them to get my server up and running. The plan is to buy some better drives one or two at a time to replace these drives with something newer. The last one is showing as faulted under the storage dashboard after a scrub, so clearly that's the first to replace. Assuming that the SMART information is correct and 4/5 are bad, are there any that are worse than others? Nothing that isn't backed up is on the drives, just a few things to test a plex server so if 2 drives fail before I get new drives then so be it. I'd just rather avoid that if at all possible.
FWIW I also installed Scrutiny from Truecharts and that shows all the drives, including the faulted one, as passing. I'm assuming that isn't accurate.
Relevant Hardware
Basically my setup is in a VM through ESXI with 4 cores allocated to the TrueNAS Scale VM. My issue is that 4 of the 5 drives drives failed a GUI SMART long test. One drive is faulty, but the other 3 all have this same error from the SMART data
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
Since every drive failed at the same segment, does that indicate there is an issue with my server and not the drives, or is it a coincidence and they're all on their way out?
Full results of the SMART tests
Code:
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VRA9
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01e055a4
Serial number: WMC1F1678693
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Feb 22 21:50:49 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 40 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 81963:00
Manufactured in week 48 of year 2013
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 59
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 560
Elements in grown defect list: 53
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 3017697 12304 24877 3030001 12340 338644.706 35
write: 6776911 115439 115543 6892350 115443 113104.558 0
verify: 7 0 0 7 0 1.175 0
Non-medium error count: 2265
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 6 65535 6856968932 [0x3 0x11 0x0]
# 2 Background long Aborted (device reset ?) - 65535 - [- - -]
# 3 Background long Failed in segment --> 6 65535 6856968932 [0x3 0x11 0x0]
# 4 Background short Completed - 65535 - [- - -]
# 5 Background short Aborted (by user command) - 58 - [- - -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sdc |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VRA9
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f0129b61c
Serial number: WMC1F1527782
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Feb 22 21:51:00 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 39 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 81970:58
Manufactured in week 49 of year 2013
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 56
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 531
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 6493553 32327 35315 6525880 32327 332248.186 0
write: 8674701 32322 32351 8707023 32322 114587.892 0
verify: 0 0 0 0 0 1.220 0
Non-medium error count: 2353
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 65535 - [- - -]
# 2 Background short Completed - 65535 - [- - -]
# 3 Background short Aborted (by user command) - 67 - [- - -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sdd |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VRA9
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01e139d0
Serial number: WMC1F1742086
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Feb 22 21:51:14 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 36 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 81957:12
Manufactured in week 50 of year 2013
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 55
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 428
Elements in grown defect list: 2
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 5135024 54621 99850 5189645 54649 329897.037 24
write: 8025664 206852 206960 8232516 206852 120589.987 0
verify: 0 0 0 0 0 0.000 0
Non-medium error count: 2404
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
# 2 Background short Completed - 65535 - [- - -]
# 3 Background short Aborted (by user command) - 52 - [- - -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sde |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VRA9
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01e05540
Serial number: WMC1F1663425
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Feb 22 21:51:31 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 40 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 81702:42
Manufactured in week 48 of year 2013
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 87
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 73
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 5095331 49657 70531 5144988 49662 344067.051 5
write: 7298915 507957 507971 7806872 507957 117777.114 0
verify: 2 0 0 2 0 0.000 0
Non-medium error count: 2216
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 6 65535 789824151 [0x3 0x11 0x0]
# 2 Background short Completed - 65535 - [- - -]
# 3 Background short Aborted (by user command) - 72 - [- - -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
admin@truenas[~]$ sudo smartctl -a /dev/sdf |more
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VRA9
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01e04e0c
Serial number: WMC1F1680307
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Feb 22 21:52:08 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 39 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 81963:56
Manufactured in week 48 of year 2013
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 58
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 577
Elements in grown defect list: 6
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 4036385 77202 301945 4113587 77317 322281.851 113
write: 7615701 517023 517040 8132724 517025 121439.462 4
verify: 5 164 3202 169 168 0.000 0
Non-medium error count: 1912
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 6 65535 491530903 [0x3 0x11 0x0]
# 2 Background short Completed - 65535 - [- - -]
# 3 Background short Aborted (by user command) - 59 - [- - -]
Long (extended) Self-test duration: 31120 seconds [8.6 hours]
As you can see these are old worn out drives. They were a freebie from a friend who got them with a server he purchased. He said they were untested and I should assume they wouldn't even spin up, but if they do I could use them to get my server up and running. The plan is to buy some better drives one or two at a time to replace these drives with something newer. The last one is showing as faulted under the storage dashboard after a scrub, so clearly that's the first to replace. Assuming that the SMART information is correct and 4/5 are bad, are there any that are worse than others? Nothing that isn't backed up is on the drives, just a few things to test a plex server so if 2 drives fail before I get new drives then so be it. I'd just rather avoid that if at all possible.
FWIW I also installed Scrutiny from Truecharts and that shows all the drives, including the faulted one, as passing. I'm assuming that isn't accurate.
Relevant Hardware
- Asus 2U server ESC4000 G3
- Intel Xeon E5-2620 v4
- 16GB ECC RAM
- 5x 4TB WD4001FYYG-01SL3 in RAIDZ1 connected to a PERC H330 and passed through to TrueNAS.
Last edited: