10 out of 12 Drives Failing?

mallen

Cadet
Joined
Jan 8, 2023
Messages
3
Hi all,

I'm rather new to TrueNAS, but am trying to make sense of this. My host is a Dell T620, 12x3.5" bays, H710p flashed to IT mode I have 12x3Tb HGST drives in it and am running VMware 7.03 on the metal, passive the drives through to TrueNAS Core as a guest. S.M.A.R.T. is throwing errors on 10 out of 12 drives, stating that the error count has increased from 3 to 4. I've included the output of two drives as a sample, below. he self-test will occasionally complete, but, as you can see, usually reads that it failed in the second segment.

root@truenas[~]# smartctl -x /dev/da1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUS724030ALS640
Revision: A124
Compliance: SPC-4
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Logical block size: 512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca027c64a90
Serial number: P8KJ1N4W
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Jan 8 15:28:23 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 35 C
Drive Trip Temperature: 85 C

Manufactured in week 13 of year 2014
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 66
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1869
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 5169733250842624

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 569346 3 0 569349 101716216 1397605.623 0
write: 0 0 0 0 76480 20048.961 0
verify: 3420 0 0 3420 1256 230.685 0

Non-medium error count: 0

Self-test execution status: 100% of test remaining
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Self test in progress ... 2 NOW - [- - -]
# 2 Background short Failed in second segment 2 46322 - [0x4 0x3e 0x3]
# 3 Background short Failed in second segment 2 46322 - [0x4 0x3e 0x3]
# 4 Background short Failed in second segment 2 46321 - [0x4 0x3e 0x3]
# 5 Background short Failed in second segment 2 46320 - [0x4 0x3e 0x3]
# 6 Background short Failed in second segment 2 0 - [0x4 0x3e 0x3]

Long (extended) Self-test duration: 29637 seconds [493.9 minutes]

Background scan results log
Status: scan is active
Accumulated power on time, hours:minutes 46323:56 [2779436 minutes]
Number of background scans performed: 252, scan progress: 33.04%
Number of background medium scans performed: 252

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 1
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: SMP phy control function
reason: unknown
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000cca027c64a91
attached SAS address = 0x500056b37789abff
attached phy identifier = 4
Invalid DWORD count = 12
Running disparity error count = 12
Loss of DWORD synchronization = 2
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 12
Running disparity error count: 12
Loss of dword synchronization count: 2
Phy reset problem count: 0
relative target port id = 2
generation code = 1
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca027c64a92
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0

root@truenas[~]# smartctl -x /dev/da2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUS724030ALS640
Revision: A124
Compliance: SPC-4
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Logical block size: 512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca027c6b62c
Serial number: P8KJ8U9W
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Jan 8 15:29:57 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 35 C
Drive Trip Temperature: 85 C

Manufactured in week 13 of year 2014
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 68
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1868
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 6048071880278016

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 453038 0 0 453038 86568375 1452301.097 0
write: 0 0 0 0 104536 23293.369 0
verify: 1711 0 0 1711 584 178.038 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Failed in second segment 2 46284 - [0x4 0x3e 0x3]
# 2 Background short Failed in second segment 2 46283 - [0x4 0x3e 0x3]
# 3 Background short Failed in second segment 2 46283 - [0x4 0x3e 0x3]
# 4 Background short Failed in second segment 2 46282 - [0x4 0x3e 0x3]
# 5 Background short Failed in second segment 2 46281 - [0x4 0x3e 0x3]
# 6 Background short Failed in second segment 2 0 - [0x4 0x3e 0x3]

Long (extended) Self-test duration: 29637 seconds [493.9 minutes]

Background scan results log
Status: scan is active
Accumulated power on time, hours:minutes 46284:47 [2777087 minutes]
Number of background scans performed: 239, scan progress: 33.25%
Number of background medium scans performed: 239

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 1
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: SMP phy control function
reason: unknown
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000cca027c6b62d
attached SAS address = 0x500056b37789abff
attached phy identifier = 5
Invalid DWORD count = 8
Running disparity error count = 8
Loss of DWORD synchronization = 2
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 8
Running disparity error count: 8
Loss of dword synchronization count: 2
Phy reset problem count: 0
relative target port id = 2
generation code = 1
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca027c6b62e
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0


Any ideas? I use this as cold storage only, and it runs maybe a few hours a month.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Please use [CODE][/CODE] tags to facilitate reading commands output.

What's the output of zpool status?
I would run a long smart test on each disk.

Those are some old drives, it's possibile they are mass dying.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Did you "passive the drives through to TrueNAS Core as a guest." or did you pass the whole "H710p flashed to IT mode" controller to the guest?

It maters, as it is generally more successful for VMs to pass the whole disk controller to the guest, than just the drives.
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
I believe that error :
"Background short Failed in second segment 2" is a bug in the drive firmware.
I have it intermittently on my bare metal Core SAS drives and read that somewhere !!!
I will try to find it again.
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177

Post #5

At the end of the day I just lived with it.
 
Last edited:

mallen

Cadet
Joined
Jan 8, 2023
Messages
3
Did you "passive the drives through to TrueNAS Core as a guest." or did you pass the whole "H710p flashed to IT mode" controller to the guest?

It maters, as it is generally more successful for VMs to pass the whole disk controller to the guest, than just the drives.

I did pass the whole controller through.
 

mallen

Cadet
Joined
Jan 8, 2023
Messages
3
Please use [CODE][/CODE] tags to facilitate reading commands output.

What's the output of zpool status?
I would run a long smart test on each disk.

Those are some old drives, it's possibile they are mass dying.

I guess I could try that, any idea how long it would take?

Otherwise, I guess I'd just listen to what others have hinted at and live with it if I can't update the FW on the drives.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Top