Lots of Drive Failures ST8000NM0095

stlscott

Cadet
Joined
Mar 6, 2023
Messages
1
The Basics:
Dell R730XD Server
Latest BIOS/Firmware
PERC H730 Mini with 25.5.9.0001 firmware version
Configured as HBA mode
10x 8tb SAS ST8000NM0095 drives in 2 5-drive pools
1x 140gb 10k SAS (boot volume)
1x 500gig SSD (Sata) for Cache
128gb RAM
2x Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
TrueNas Version: TrueNAS-13.0-U3.1

10 drives came from working DL380 G8 that was not having any issues but was needed to be re-purposed.
Since moving to the R730 and running True-Nas, I've had 5 of the drives throw CheckSum errors. I've removed the drives and ran Seagate tools on them, no errors. I've replaced 3 of the 5 drives only to have 1 replacement show up with Checksum errors again. Some Drive logs are below. It just seems like TrueNas is catching "read checksum errors" when other systems have not in the past with the same drives. Having 5 of the 10 drives go bad in 2-3 months seems extremely unlikely. The last one (below) DA9 is a replacement from Friday last week, already showing 1 checksum. SMART shows everything is good. SmartTest are all passing, both long and short, and Seagate tools say the drive is optimal with no errors. How does TrueNas keep showing these Checksum errors? I have 3 of these servers that are identical (same hardware, same drives, same firmware), all 3 keep having drive checksum error issues.

root@truenas03[~]# smartctl -a /dev/da7
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST8000NM0095
Revision: KT01
Compliance: SPC-4
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c50084ae87bb
Serial number: ZA10GNKT0000R619X582
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Mar 6 08:54:27 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 95200
Current Drive Temperature: 40 C
Drive Trip Temperature: 60 C

Accumulated power on time, hours:minutes 50497:51
Manufactured in week 17 of year 2016
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 327
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 11691
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 720551429
Blocks received from initiator = 1054667707
Blocks read from cache and sent to initiator = 2299902
Number of read and write commands whose size <= segment size = 22922888
Number of read and write commands whose size > segment size = 265011

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 50497.85
number of minutes until next internal SMART test = 36

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 1435742088 0 0 1435742088 0 2951.379 0
write: 0 0 0 0 0 4333.616 0

Non-medium error count: 2281


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 50465 - [- - -]
# 2 Background long Completed - 50446 - [- - -]
# 3 Background short Completed - 50297 - [- - -]
# 4 Background short Completed - 50129 - [- - -]
# 5 Background short Completed - 49961 - [- - -]
# 6 Background short Completed - 49793 - [- - -]
# 7 Background short Completed - 49625 - [- - -]
# 8 Background short Completed - 49456 - [- - -]


- - -]

Long (extended) Self-test duration: 47220 seconds [787.0 minutes]


root@truenas03[~]# smartctl -a /dev/da8
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST8000NM0095
Revision: KT02
Compliance: SPC-4
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500adc46b4b
Serial number: ZA1F5H1N0000C93866LA
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Mar 6 09:35:28 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 38 C
Drive Trip Temperature: 60 C

Accumulated power on time, hours:minutes 18797:04
Manufactured in week 22 of year 2019
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 135
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 2530
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 731390971
Blocks received from initiator = 932448958
Blocks read from cache and sent to initiator = 1978622
Number of read and write commands whose size <= segment size = 22832494
Number of read and write commands whose size > segment size = 260584

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 18797.07
number of minutes until next internal SMART test = 56

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 1461503400 0 0 1461503400 0 2995.777 0
write: 0 0 0 0 0 3822.316 0

Non-medium error count: 245


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 18763 - [- - -]
# 2 Background short Completed - 18595 - [- - -]
# 3 Background short Completed - 18427 - [- - -]
# 4 Background short Completed - 18259 - [- - -]
# 5 Background short Completed - 18091 - [- - -]
# 6 Background short Completed - 17923 - [- - -]
# 7 Background short Completed - 17755 - [- - -]



Long (extended) Self-test duration: 47220 seconds [787.0 minutes]

root@truenas03[~]# smartctl -a /dev/da9
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HP
Product: EG0146FARTR
Revision: HPDA
Compliance: SPC-3
User Capacity: 146,815,737,856 bytes [146 GB]
Logical block size: 512 bytes
Rotation Rate: 10025 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x500000e111b59dc0
Serial number: D0A1P9801STE0936
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Mar 6 09:36:33 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 58 C
Drive Trip Temperature: 65 C

Accumulated power on time, hours:minutes 108221:02
Manufactured in week 36 of year 2009
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 80
Elements in grown defect list: 1

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 1 0 0 0 647130.139 1
write: 0 0 0 0 0 25875.755 0

Non-medium error count: 509

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 65535 - [- - -]
# 2 Background short Completed - 65535 - [- - -]
# 3 Background short Completed - 65535 - [- - -]
# 4 Background short Completed - 65535 - [- - -]
# 5 Background short Completed - 65535 - [- - -]
# 6 Background short Completed - 65535 - [- - -]
# 7 Background short Completed - 65535 - [- - -]
 
Top