As the title states, I'm having a really odd issue going on. All 6 of the new drives I added are in a degraded or failed state and i'm not sure what went wrong here. They all passed smart tests and the smart data seemed healthy to me. Any advice?
Hey! Sorry this is my post here. I'm running a powered r530 with a perc h310 flashed in it mode. Also running a lsi 9207-8e connected to a Netapp DS4246. In this pool i have 3 vDevs. One of which is giving errors after adding 6 HGST drives that passed all smart tests before I threw them in. I'm not sure what information you'd need, but I can add anything else that would be of help. Screenshots, etc. Thank you!What disks are these, where did you get them from, how are they connected... There's not much to go on in your post.
ZPool Status for the pool with issuesLet's start withzpool status
andsmartctl -x /dev/sdX
, the latter for each of the affected disks.
pool: Studio state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub canceled on Tue Dec 5 13:32:12 2023 config: NAME STATE READ WRITE CKSUM Studio DEGRADED 0 0 0 raidz1-0 ONLINE 0 0 0 bf6894b2-4e90-47fe-be4c-e6be8ed41ec7 ONLINE 0 0 0 90fce70e-b82b-4c68-a78f-adb830b12e68 ONLINE 0 0 0 4ad6ec78-2967-4017-b372-927a7dca1e47 ONLINE 0 0 0 58bd3274-a91b-4a33-a711-b2495b80e462 ONLINE 0 0 0 426d713c-1171-4557-90bf-96b9da727041 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 f41d8a1d-f052-456d-9343-5280a3192c26 ONLINE 0 0 0 2363d5b3-89f9-4586-9291-e2781962e79e ONLINE 0 0 0 db0f3488-3475-43cf-89c8-3cd27b960e41 ONLINE 0 0 0 aaf62f30-53c7-4944-8fec-048303c45def ONLINE 0 0 0 148ff0cd-c3d9-4534-aafb-d955871f9319 ONLINE 0 0 0 raidz1-2 DEGRADED 0 1.94K 0 6c8f8d99-40ee-4b14-9dc3-5bef3546b758 DEGRADED 0 963 1 too many errors 18f69766-3fa0-4458-9225-6b0323730b66 DEGRADED 0 1.07K 1 too many errors 96000c13-61ec-4278-832c-51bd90527dea DEGRADED 0 788 0 too many errors spare-3 UNAVAIL 0 74 0 insufficient replicas b57fca1f-0bd9-42a0-aca8-56aa73553b90 FAULTED 0 64 0 too many errors 77f58081-decd-4d38-b4c1-6139d0a1d5aa FAULTED 0 75 0 too many errors 2c7f1196-5415-4ac3-bd19-f6f7aa979442 DEGRADED 0 847 0 too many errors spares 77f58081-decd-4d38-b4c1-6139d0a1d5aa INUSE currently in use errors: No known data errors
smartctl -x /dev/sdab smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HGST Product: HUH721008AL4200 Revision: A3Z4 Compliance: SPC-4 User Capacity: 8,001,563,222,016 bytes [8.00 TB] Logical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000cca27d0272c0 Serial number: 2SG1ARYF Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Dec 5 15:46:33 2023 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled Read Cache is: Enabled Writeback Cache is: Disabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned <not available> Power on minutes since format <not available> Current Drive Temperature: 35 C Drive Trip Temperature: 85 C Manufactured in week 21 of year 2019 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 89 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 1504 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 51113588672167936 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 3 0 3 3212507 14955.371 0 write: 0 0 0 0 3894982 232944.141 0 verify: 0 0 0 0 11774 0.000 0 Non-medium error count: 0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 34125 - [- - -] # 2 Background long Completed - 34096 - [- - -] # 3 Background short Completed - 34053 - [- - -] # 4 Background short Completed - 34005 - [- - -] # 5 Background short Completed - 33957 - [- - -] # 6 Background short Completed - 33885 - [- - -] # 7 Background long Completed - 33866 - [- - -] # 8 Background short Completed - 33848 - [- - -] Long (extended) Self-test duration: 60239 seconds [1004.0 minutes] Background scan results log Status: waiting until BMS interval timer expires Accumulated power on time, hours:minutes 34165:02 [2049902 minutes] Number of background scans performed: 27, scan progress: 0.00% Number of background medium scans performed: 27 Protocol Specific port log page for SAS SSP relative target port id = 1 generation code = 2 number of phys = 1 phy identifier = 0 attached device type: SAS or SATA device attached reason: unknown reason: unknown negotiated logical link rate: phy enabled; 12 Gbps attached initiator port: ssp=1 stp=1 smp=1 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x5000cca27d0272c1 attached SAS address = 0x5d09466071559e05 attached phy identifier = 5 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0 relative target port id = 2 generation code = 2 number of phys = 1 phy identifier = 1 attached device type: no device attached attached reason: unknown reason: power on negotiated logical link rate: phy enabled; unknown attached initiator port: ssp=0 stp=0 smp=0 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x5000cca27d0272c2 attached SAS address = 0x0 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0
smartctl -x /dev/sdy smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HGST Product: HUH721008AL4200 Revision: A3Z4 Compliance: SPC-4 User Capacity: 8,001,563,222,016 bytes [8.00 TB] Logical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000cca27d0649f4 Serial number: 2SG3G6TF Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Dec 5 15:46:52 2023 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled Read Cache is: Enabled Writeback Cache is: Disabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned <not available> Power on minutes since format <not available> Current Drive Temperature: 42 C Drive Trip Temperature: 85 C Manufactured in week 23 of year 2019 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 92 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 1509 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 54442452785823744 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 341 0 341 2998578 14958.617 0 write: 0 0 0 0 2576965 272073.797 0 verify: 0 0 0 0 6047 0.000 0 Non-medium error count: 0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 34134 - [- - -] # 2 Background long Completed - 34102 - [- - -] # 3 Background short Completed - 34062 - [- - -] # 4 Background short Completed - 34014 - [- - -] # 5 Background short Completed - 33966 - [- - -] # 6 Background short Completed - 33894 - [- - -] # 7 Background long Completed - 33876 - [- - -] # 8 Background short Completed - 33857 - [- - -] Long (extended) Self-test duration: 65426 seconds [1090.4 minutes] Background scan results log Status: waiting until BMS interval timer expires Accumulated power on time, hours:minutes 34173:51 [2050431 minutes] Number of background scans performed: 40, scan progress: 0.00% Number of background medium scans performed: 40 Protocol Specific port log page for SAS SSP relative target port id = 1 generation code = 1 number of phys = 1 phy identifier = 0 attached device type: expander device attached reason: power on reason: unknown negotiated logical link rate: phy enabled; 6 Gbps attached initiator port: ssp=0 stp=0 smp=1 attached target port: ssp=0 stp=0 smp=1 SAS address = 0x5000cca27d0649f5 attached SAS address = 0x500a09800638aeff attached phy identifier = 26 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0 relative target port id = 2 generation code = 1 number of phys = 1 phy identifier = 1 attached device type: no device attached attached reason: unknown reason: power on negotiated logical link rate: phy enabled; unknown attached initiator port: ssp=0 stp=0 smp=0 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x5000cca27d0649f6 attached SAS address = 0x0 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0
They're being ran through two separate HBA cards currently. The one connected at 6gbps is connected through the external LSI9208 I have connected to a DS4246. The 12gbps drive is connected to the internal perc hba310 connected to the server's backplaneWell, the SMART reports on those disks are rather unhelpful and vague, but they do say the disks have been in near-continuous use for the past four years. I also see that one disk is connected at 12 Gb/s and the other only at 6 Gb/s. Since most of the errors reported are write errors and the disks aren't themselves complaining, you should share the details of the HBA(s) and expanders you're using. You were either sold crap drives salvaged after being replaced for having crapped out or have a bizarre SAS connectivity issue.
Yeah no dice. Same issue. I'm currently away from the server, but i'll be able to check in on it tomorrow. I'll try switching out my LSI HBA and double check cables and switch where the drives are in the chassis to see if the issue follows.Best to do so, yes.