As the title states, I'm having a really odd issue going on. All 6 of the new drives I added are in a degraded or failed state and i'm not sure what went wrong here. They all passed smart tests and the smart data seemed healthy to me. Any advice?
Hey! Sorry this is my post here. I'm running a powered r530 with a perc h310 flashed in it mode. Also running a lsi 9207-8e connected to a Netapp DS4246. In this pool i have 3 vDevs. One of which is giving errors after adding 6 HGST drives that passed all smart tests before I threw them in. I'm not sure what information you'd need, but I can add anything else that would be of help. Screenshots, etc. Thank you!What disks are these, where did you get them from, how are they connected... There's not much to go on in your post.
ZPool Status for the pool with issuesLet's start withzpool statusandsmartctl -x /dev/sdX, the latter for each of the affected disks.
pool: Studio
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub canceled on Tue Dec 5 13:32:12 2023
config:
NAME STATE READ WRITE CKSUM
Studio DEGRADED 0 0 0
raidz1-0 ONLINE 0 0 0
bf6894b2-4e90-47fe-be4c-e6be8ed41ec7 ONLINE 0 0 0
90fce70e-b82b-4c68-a78f-adb830b12e68 ONLINE 0 0 0
4ad6ec78-2967-4017-b372-927a7dca1e47 ONLINE 0 0 0
58bd3274-a91b-4a33-a711-b2495b80e462 ONLINE 0 0 0
426d713c-1171-4557-90bf-96b9da727041 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
f41d8a1d-f052-456d-9343-5280a3192c26 ONLINE 0 0 0
2363d5b3-89f9-4586-9291-e2781962e79e ONLINE 0 0 0
db0f3488-3475-43cf-89c8-3cd27b960e41 ONLINE 0 0 0
aaf62f30-53c7-4944-8fec-048303c45def ONLINE 0 0 0
148ff0cd-c3d9-4534-aafb-d955871f9319 ONLINE 0 0 0
raidz1-2 DEGRADED 0 1.94K 0
6c8f8d99-40ee-4b14-9dc3-5bef3546b758 DEGRADED 0 963 1 too many errors
18f69766-3fa0-4458-9225-6b0323730b66 DEGRADED 0 1.07K 1 too many errors
96000c13-61ec-4278-832c-51bd90527dea DEGRADED 0 788 0 too many errors
spare-3 UNAVAIL 0 74 0 insufficient replicas
b57fca1f-0bd9-42a0-aca8-56aa73553b90 FAULTED 0 64 0 too many errors
77f58081-decd-4d38-b4c1-6139d0a1d5aa FAULTED 0 75 0 too many errors
2c7f1196-5415-4ac3-bd19-f6f7aa979442 DEGRADED 0 847 0 too many errors
spares
77f58081-decd-4d38-b4c1-6139d0a1d5aa INUSE currently in use
errors: No known data errorssmartctl -x /dev/sdab
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721008AL4200
Revision: A3Z4
Compliance: SPC-4
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca27d0272c0
Serial number: 2SG1ARYF
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Dec 5 15:46:33 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 35 C
Drive Trip Temperature: 85 C
Manufactured in week 21 of year 2019
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 89
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1504
Elements in grown defect list: 0
Vendor (Seagate Cache) information
Blocks sent to initiator = 51113588672167936
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 3 0 3 3212507 14955.371 0
write: 0 0 0 0 3894982 232944.141 0
verify: 0 0 0 0 11774 0.000 0
Non-medium error count: 0
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 34125 - [- - -]
# 2 Background long Completed - 34096 - [- - -]
# 3 Background short Completed - 34053 - [- - -]
# 4 Background short Completed - 34005 - [- - -]
# 5 Background short Completed - 33957 - [- - -]
# 6 Background short Completed - 33885 - [- - -]
# 7 Background long Completed - 33866 - [- - -]
# 8 Background short Completed - 33848 - [- - -]
Long (extended) Self-test duration: 60239 seconds [1004.0 minutes]
Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 34165:02 [2049902 minutes]
Number of background scans performed: 27, scan progress: 0.00%
Number of background medium scans performed: 27
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 2
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; 12 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca27d0272c1
attached SAS address = 0x5d09466071559e05
attached phy identifier = 5
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
relative target port id = 2
generation code = 2
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca27d0272c2
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
smartctl -x /dev/sdy
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721008AL4200
Revision: A3Z4
Compliance: SPC-4
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca27d0649f4
Serial number: 2SG3G6TF
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Dec 5 15:46:52 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 42 C
Drive Trip Temperature: 85 C
Manufactured in week 23 of year 2019
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 92
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1509
Elements in grown defect list: 0
Vendor (Seagate Cache) information
Blocks sent to initiator = 54442452785823744
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 341 0 341 2998578 14958.617 0
write: 0 0 0 0 2576965 272073.797 0
verify: 0 0 0 0 6047 0.000 0
Non-medium error count: 0
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 34134 - [- - -]
# 2 Background long Completed - 34102 - [- - -]
# 3 Background short Completed - 34062 - [- - -]
# 4 Background short Completed - 34014 - [- - -]
# 5 Background short Completed - 33966 - [- - -]
# 6 Background short Completed - 33894 - [- - -]
# 7 Background long Completed - 33876 - [- - -]
# 8 Background short Completed - 33857 - [- - -]
Long (extended) Self-test duration: 65426 seconds [1090.4 minutes]
Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 34173:51 [2050431 minutes]
Number of background scans performed: 40, scan progress: 0.00%
Number of background medium scans performed: 40
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 1
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: power on
reason: unknown
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000cca27d0649f5
attached SAS address = 0x500a09800638aeff
attached phy identifier = 26
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
relative target port id = 2
generation code = 1
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca27d0649f6
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
They're being ran through two separate HBA cards currently. The one connected at 6gbps is connected through the external LSI9208 I have connected to a DS4246. The 12gbps drive is connected to the internal perc hba310 connected to the server's backplaneWell, the SMART reports on those disks are rather unhelpful and vague, but they do say the disks have been in near-continuous use for the past four years. I also see that one disk is connected at 12 Gb/s and the other only at 6 Gb/s. Since most of the errors reported are write errors and the disks aren't themselves complaining, you should share the details of the HBA(s) and expanders you're using. You were either sold crap drives salvaged after being replaced for having crapped out or have a bizarre SAS connectivity issue.
Yeah no dice. Same issue. I'm currently away from the server, but i'll be able to check in on it tomorrow. I'll try switching out my LSI HBA and double check cables and switch where the drives are in the chassis to see if the issue follows.Best to do so, yes.