So I've been looking around the forum (and google) and I've seen some old threads relating to this but I'm unsure if they're still relevant. From what I've read SMART data isn't very accurate on SAS drives. Is that still the case?
Basically my setup is in a VM through ESXI with 4 cores allocated to the TrueNAS Scale VM. My issue is that 4 of the 5 drives drives failed a GUI SMART long test. One drive is faulty, but the other 3 all have this same error from the SMART data
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
Since every drive failed at the same segment, does that indicate there is an issue with my server and not the drives, or is it a coincidence and they're all on their way out?
Full results of the SMART tests
As you can see these are old worn out drives. They were a freebie from a friend who got them with a server he purchased. He said they were untested and I should assume they wouldn't even spin up, but if they do I could use them to get my server up and running. The plan is to buy some better drives one or two at a time to replace these drives with something newer. The last one is showing as faulted under the storage dashboard after a scrub, so clearly that's the first to replace. Assuming that the SMART information is correct and 4/5 are bad, are there any that are worse than others? Nothing that isn't backed up is on the drives, just a few things to test a plex server so if 2 drives fail before I get new drives then so be it. I'd just rather avoid that if at all possible.
FWIW I also installed Scrutiny from Truecharts and that shows all the drives, including the faulted one, as passing. I'm assuming that isn't accurate.
Relevant Hardware
Basically my setup is in a VM through ESXI with 4 cores allocated to the TrueNAS Scale VM. My issue is that 4 of the 5 drives drives failed a GUI SMART long test. One drive is faulty, but the other 3 all have this same error from the SMART data
# 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0]
Since every drive failed at the same segment, does that indicate there is an issue with my server and not the drives, or is it a coincidence and they're all on their way out?
Full results of the SMART tests
Code:
=== START OF INFORMATION SECTION === Vendor: WD Product: WD4001FYYG-01SL3 Revision: VRA9 Compliance: SPC-4 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x50000c0f01e055a4 Serial number: WMC1F1678693 Device type: disk Transport protocol: SAS (SPL-4) Local Time is: Thu Feb 22 21:50:49 2024 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 40 C Drive Trip Temperature: 40 C Accumulated power on time, hours:minutes 81963:00 Manufactured in week 48 of year 2013 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 59 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 560 Elements in grown defect list: 53 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 3017697 12304 24877 3030001 12340 338644.706 35 write: 6776911 115439 115543 6892350 115443 113104.558 0 verify: 7 0 0 7 0 1.175 0 Non-medium error count: 2265 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> 6 65535 6856968932 [0x3 0x11 0x0] # 2 Background long Aborted (device reset ?) - 65535 - [- - -] # 3 Background long Failed in segment --> 6 65535 6856968932 [0x3 0x11 0x0] # 4 Background short Completed - 65535 - [- - -] # 5 Background short Aborted (by user command) - 58 - [- - -] Long (extended) Self-test duration: 31120 seconds [8.6 hours] admin@truenas[~]$ sudo smartctl -a /dev/sdc |more smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: WD Product: WD4001FYYG-01SL3 Revision: VRA9 Compliance: SPC-4 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x50000c0f0129b61c Serial number: WMC1F1527782 Device type: disk Transport protocol: SAS (SPL-4) Local Time is: Thu Feb 22 21:51:00 2024 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 40 C Accumulated power on time, hours:minutes 81970:58 Manufactured in week 49 of year 2013 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 56 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 531 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 6493553 32327 35315 6525880 32327 332248.186 0 write: 8674701 32322 32351 8707023 32322 114587.892 0 verify: 0 0 0 0 0 1.220 0 Non-medium error count: 2353 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed - 65535 - [- - -] # 2 Background short Completed - 65535 - [- - -] # 3 Background short Aborted (by user command) - 67 - [- - -] Long (extended) Self-test duration: 31120 seconds [8.6 hours] admin@truenas[~]$ sudo smartctl -a /dev/sdd |more smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: WD Product: WD4001FYYG-01SL3 Revision: VRA9 Compliance: SPC-4 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x50000c0f01e139d0 Serial number: WMC1F1742086 Device type: disk Transport protocol: SAS (SPL-4) Local Time is: Thu Feb 22 21:51:14 2024 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 36 C Drive Trip Temperature: 40 C Accumulated power on time, hours:minutes 81957:12 Manufactured in week 50 of year 2013 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 55 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 428 Elements in grown defect list: 2 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 5135024 54621 99850 5189645 54649 329897.037 24 write: 8025664 206852 206960 8232516 206852 120589.987 0 verify: 0 0 0 0 0 0.000 0 Non-medium error count: 2404 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> 6 65535 6417431600 [0x3 0x16 0x0] # 2 Background short Completed - 65535 - [- - -] # 3 Background short Aborted (by user command) - 52 - [- - -] Long (extended) Self-test duration: 31120 seconds [8.6 hours] admin@truenas[~]$ sudo smartctl -a /dev/sde |more smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: WD Product: WD4001FYYG-01SL3 Revision: VRA9 Compliance: SPC-4 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x50000c0f01e05540 Serial number: WMC1F1663425 Device type: disk Transport protocol: SAS (SPL-4) Local Time is: Thu Feb 22 21:51:31 2024 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 40 C Drive Trip Temperature: 40 C Accumulated power on time, hours:minutes 81702:42 Manufactured in week 48 of year 2013 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 87 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 73 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 5095331 49657 70531 5144988 49662 344067.051 5 write: 7298915 507957 507971 7806872 507957 117777.114 0 verify: 2 0 0 2 0 0.000 0 Non-medium error count: 2216 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> 6 65535 789824151 [0x3 0x11 0x0] # 2 Background short Completed - 65535 - [- - -] # 3 Background short Aborted (by user command) - 72 - [- - -] Long (extended) Self-test duration: 31120 seconds [8.6 hours] admin@truenas[~]$ sudo smartctl -a /dev/sdf |more smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: WD Product: WD4001FYYG-01SL3 Revision: VRA9 Compliance: SPC-4 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x50000c0f01e04e0c Serial number: WMC1F1680307 Device type: disk Transport protocol: SAS (SPL-4) Local Time is: Thu Feb 22 21:52:08 2024 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 40 C Accumulated power on time, hours:minutes 81963:56 Manufactured in week 48 of year 2013 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 58 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 577 Elements in grown defect list: 6 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 4036385 77202 301945 4113587 77317 322281.851 113 write: 7615701 517023 517040 8132724 517025 121439.462 4 verify: 5 164 3202 169 168 0.000 0 Non-medium error count: 1912 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> 6 65535 491530903 [0x3 0x11 0x0] # 2 Background short Completed - 65535 - [- - -] # 3 Background short Aborted (by user command) - 59 - [- - -] Long (extended) Self-test duration: 31120 seconds [8.6 hours]
As you can see these are old worn out drives. They were a freebie from a friend who got them with a server he purchased. He said they were untested and I should assume they wouldn't even spin up, but if they do I could use them to get my server up and running. The plan is to buy some better drives one or two at a time to replace these drives with something newer. The last one is showing as faulted under the storage dashboard after a scrub, so clearly that's the first to replace. Assuming that the SMART information is correct and 4/5 are bad, are there any that are worse than others? Nothing that isn't backed up is on the drives, just a few things to test a plex server so if 2 drives fail before I get new drives then so be it. I'd just rather avoid that if at all possible.
FWIW I also installed Scrutiny from Truecharts and that shows all the drives, including the faulted one, as passing. I'm assuming that isn't accurate.
Relevant Hardware
- Asus 2U server ESC4000 G3
- Intel Xeon E5-2620 v4
- 16GB ECC RAM
- 5x 4TB WD4001FYYG-01SL3 in RAIDZ1 connected to a PERC H330 and passed through to TrueNAS.
Last edited: