Large capacity drives randomly resetting, possible bug?

JerRatt · May 17, 2022

There may be an issue with large drives having random device resets, especially when lots of I/O or activity is present on the drives, on Debian or FreeBSD 12.x versions.

Running any kind of heavy I/O onto the 18TB drives that I have connected to a Supermicro BPN-SAS3-743A backplane, running through to a LSI 9400-8i HBA eventually results in the drives resetting randomly. This happens even without the drives assigned to any kind of ZFS pool. This also happens whether running from the shell within the GUI or from the shell itself. This eventually happens on all drives, that are using two separate SFF8643 cables with a backplane that has two separate SFF8643 ports, and sometimes multiple drives reset at the exact same time while others continue chugging along with whatever heavy I/O they were doing.

To cause this to happen, I can either run badblocks on each drive (using: badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a SMART extended/long test.

Eventually, sometimes it is only minutes, or sometimes many hours later, all the drives will reset, even spin down (according to the shell logs). Sometimes the drives reset in batches, while others continue chugging along only to reset individually later. It's made completing any kind of SMART extended test not possible. Badblocks will fail out, reporting too many bad blocks, on multiple hard drives all at nearly the exact same moment, yet consecutive badblock scans won't report bad blocks in the same areas. SMART test will just show "aborted, drive reset?" as the result.

And while it isn't Debian, I've found what looks to be a near identical issue others are having on the FreeBSD forums: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496

My setup:
TrueNAS Scale 22.02.0.1
AMD Threadripper 1920X
ASRock X399 Taichi
128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC
AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter
Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane
8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn
2 x Crucial 120GB SSD
2 x Crucial 1TB SSD
2 x Western Digital 960GB NVME
Supermicro 4U case w/2000watt Redundant Power Supply

The server is connected with a large APC data-center battery system and conditioner, in a HVAC controlled area. All hard drives have the newest firmware, and in 4k sectors both logical and native. The controller has the newest firmware, both regular and legacy roms, and with the SATA/SAS only mode flashed (dropping the NVME multi/tri-mode option that the new 9400 series cards support).

My plan was to replace the HBA with an older LSI 9305-16i, replace the two SFF8643-SFF8643 cables going from the HBA to the backplane just for good measure, install two different SFF8643-SFF8482 cables that bypass the backplane fully, then four of the existing Seagate 18TB drives and put them on the the SFF8643-SFF8482 connections that bypass the backplane, as well as install four new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using the backplane, some not). That should reveal if this is a compatibility/bug issue with all large drives or certain large drives on a LSI controller, the mpr driver, and/or this backplane.

If none of that works or doesn't eliminate all the potential points of failures, I'm left with nothing but subpar work arounds that have been reported in the thread I linked, such as just using the onboard SATA ports instead of the LSI controller, disabling NCQ function in the LSI controller, or setting up a L2ARC cache (or I might try a metadata cache to see if that circumvents the issue as well). Either way, it appears this may be a bug with larger drives used in tandem with a LSI HBA, certain backplane, etc. In that thread they report anyone who has downgraded to the 11.x version of FreeBSD no longer had the issue on the exact same system, so it appears this may be a SAS mpr/mps driver issue that may be on both FreeBSD and Debian.

Condensed logs when one drive errors out:

sd 0:0:0:0: device_unblock and setting to running, handle(0x000d)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
~
~
~
~
sd 0:0:0:0: Power-on or device reset occurred
.......ready
sd 0:0:6:0: device_block, handle(0x000f)
sd 0:0:9:0: device_block, handle(0x0012)
sd 0:0:10:0: device_block, handle(0x0014)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
sd 0:0:9:0: device_unblock and setting to running, handle(0x0012)
sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
sd 0:0:10:0: device_unblock and setting to running, handle(0x0014)
sd 0:0:9:0: Power-on or device reset occurred
sd 0:0:6:0: Power-on or device reset occurred
sd 0:0:10:0: Power-on or device reset occurred
scsi_io_completion_action: 5 callbacks suppressed
sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional power granted
sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00 00 01 40 00 00
print_req_error: 5 callbacks suppressed
blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional power use not yet granted
sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00 00 01 40 00 00
blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
~
~
~
~
sd 0:0:10:0: [sdd] Spinning up disk...
.
sd 0:0:3:0: device_block, handle(0x0013)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
.
sd 0:0:3:0: device_unblock and setting to running, handle(0x0013)
.
sd 0:0:3:0: Power-on or device reset occurred
.................ready

JerRatt · May 17, 2022

For reference, here is the storcli /c0 show all output:

root@truenas[~]# storcli /c0 show all
CLI Version = 007.1504.0000.0000 June 22, 2020
Operating system = Linux 5.10.93+truenas
Controller = 0
Status = Success
Description = None

Basics :
======
Controller = 0
Adapter Type = SAS3408(B0)
Model = HBA 9400-8i
Serial Number = SP81718633
Current System Date/time = 05/17/2022 19:38:24
Concurrent commands supported = 7168
SAS Address = 500605b00dd5ce40
PCI Address = 00:08:00:00
Mfg Date = 00/00/00
Rework Date = 00/00/00
Revision No = N/A

Version :
=======
Firmware Package Build = 22.00.00.00
Firmware Version = 22.00.00.00
Bios Version = 09.43.00.00_22.00.00.00
NVDATA Version = 22.00.00.22
PSOC Version = 05689001
Driver Name = mpt3sas
Driver Version = 35.100.00.00

PCI Version :
===========
Vendor Id = 0x1000
Device Id = 0xAF
SubVendor Id = 0x1000
SubDevice Id = 0x3010
Host Interface = PCIE
Device Interface = SAS-12G
Bus Number = 8
Device Number = 0
Function Number = 0
Domain ID = 0

Pending Images in Flash :
=======================
Image name = No pending images

Status :
======
Controller Status = OK
Memory Correctable Errors = 0
Memory Uncorrectable Errors = 0
Bios was not detected during boot = No
Controller has booted into safe mode = No
Controller has booted into certificate provision mode = No

Supported Adapter Operations :
============================
Alarm Control = No
Cluster Support = No
Self Diagnostic = No
Deny SCSI Passthrough = No
Deny SMP Passthrough = No
Deny STP Passthrough = No
Support more than 8 Phys = Yes
FW and Event Time in GMT = No
Support Enclosure Enumeration = Yes
Support Allowed Operations = Yes
Support Multipath = Yes
Support Security = No
Support Config Page Model = No
Support the OCE without adding drives = No
support EKM = No
Snapshot Enabled = No
Support PFK = No
Support PI = No
Support Shield State = No
Support Set Link Speed = No
Support JBOD = No
Disable Online PFK Change = No
Real Time Scheduler = No
Support Reset Now = No
Support Emulated Drives = No
Support Secure Boot = No
Support Platform Security = No

HwCfg :
=====
ChipRevision = B0
BatteryFRU = N/A
Front End Port Count = 1
Backend Port Count = 11
Serial Debugger = Absent
NVRAM Size = 0KB
Flash Size = 16MB
On Board Memory Size = 0MB
On Board Expander = Absent
Temperature Sensor for ROC = Present
Temperature Sensor for Controller = Absent
Current Size of CacheCade (GB) = 0
Current Size of FW Cache (MB) = 0
ROC temperature(Degree Celcius) = 49

Policies :
========

Policies Table :
==============

------------------------------------------------
Policy Current Default
------------------------------------------------
Predictive Fail Poll Interval 0 sec
Interrupt Throttle Active Count 0
Interrupt Throttle Completion 0 us
Rebuild Rate 0 % 30%
PR Rate 0 % 30%
BGI Rate 0 % 30%
Check Consistency Rate 0 % 30%
Reconstruction Rate 0 % 30%
Cache Flush Interval 0s
------------------------------------------------

Flush Time(Default) = 4s
Drive Coercion Mode = none
Auto Rebuild = Off
Battery Warning = Off
ECC Bucket Size = 0
ECC Bucket Leak Rate (hrs) = 0
Restore HotSpare on Insertion = Off
Expose Enclosure Devices = Off
Maintain PD Fail History = Off
Reorder Host Requests = On
Auto detect BackPlane = SGPIO/i2c SEP
Load Balance Mode = None
Security Key Assigned = Off
Disable Online Controller Reset = Off
Use drive activity for locate = Off

Boot :
====
Max Drives to Spinup at One Time = 0
Maximum number of direct attached drives to spin up in 1 min = 0
Delay Among Spinup Groups (sec) = 0
Allow Boot with Preserved Cache = On

Defaults :
========
Phy Polarity = 0
Phy PolaritySplit = 0
Cached IO = Off
Default spin down time (mins) = 0
Coercion Mode = None
ZCR Config = Unknown
Max Chained Enclosures = 0
Direct PD Mapping = No
Restore Hot Spare on Insertion = No
Expose Enclosure Devices = No
Maintain PD Fail History = No
Zero Based Enclosure Enumeration = No
Disable Puncturing = No
Un-Certified Hard Disk Drives = Block
SMART Mode = Mode 6
Enable LED Header = No
LED Show Drive Activity = No
Dirty LED Shows Drive Activity = No
EnableCrashDump = No
Disable Online Controller Reset = No
Treat Single span R1E as R10 = No
Power Saving option = Enable
TTY Log In Flash = No
Auto Enhanced Import = No
Enable Shield State = No
Time taken to detect CME = 60 sec

Capabilities :
============
Supported Drives = SAS, SATA
Enable JBOD = Yes
Max Parallel Commands = 7168
Max SGE Count = 128
Max Data Transfer Size = 32 sectors
Max Strips PerIO = 0
Max Configurable CacheCade Size = 0
Min Strip Size = 512Bytes
Max Strip Size = 512Bytes

Scheduled Tasks = NA

Security Protocol properties :
============================
Security Protocol = None

Enclosure Information :
=====================

------------------------------------------------------------------
EID State Slots PD PS Fans TSs Alms SIM ProdID VendorSpecific
------------------------------------------------------------------
0 OK 8 8 0 0 0 0 0 VirtualSES
------------------------------------------------------------------

Physical Device Information :
===========================

Drive /c0/e0/s0 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:0 5 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s0 - Detailed Information :
======================================

Drive /c0/e0/s0 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s0 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR53LG4F0000C148JGN1
WWN = 5000C500D7C929FC
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C0 x1

Drive /c0/e0/s0 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 4(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7c929fd
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 33 4c 47 34 46 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s1 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:1 7 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s1 - Detailed Information :
======================================

Drive /c0/e0/s1 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s1 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR5374S70000C20615FY
WWN = 5000C500D7BF6514
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C0 x1

Drive /c0/e0/s1 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 6(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7bf6515
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 33 37 34 53 37 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s2 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:2 2 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s2 - Detailed Information :
======================================

Drive /c0/e0/s2 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s2 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR5416QV0000C148FCWW
WWN = 5000C500D7BDFE70
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C0 x1

Drive /c0/e0/s2 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 1(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7bdfe71
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 34 31 36 51 56 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s3 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:3 4 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s3 - Detailed Information :
======================================

Drive /c0/e0/s3 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s3 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR53XNZB0000C201CA5V
WWN = 5000C500D7C08818
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C0 x1

Drive /c0/e0/s3 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 3(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7c08819
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 33 58 4e 5a 42 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s4 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:4 1 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s4 - Detailed Information :
======================================

Drive /c0/e0/s4 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s4 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR542X100000C2024T9N
WWN = 5000C500D7CAE718
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C1 x1

Drive /c0/e0/s4 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 0(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7cae719
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 34 32 58 31 30 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s5 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:5 6 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s5 - Detailed Information :
======================================

Drive /c0/e0/s5 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s5 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR544BFY0000C20619XG
WWN = 5000C500D7C62DAC
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C1 x1

Drive /c0/e0/s5 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 5(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7c62dad
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 34 34 42 46 59 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s6 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:6 8 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s6 - Detailed Information :
======================================

Drive /c0/e0/s6 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s6 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR5423KC0000C20617UL
WWN = 5000C500D7C0BBE0
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C1 x1

Drive /c0/e0/s6 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 7(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7c0bbe1
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 34 32 33 4b 43 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

Drive /c0/e0/s7 :
===============

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
0:7 3 JBOD - 16.370 TB SAS HDD N N 4 KB ST18000NM004J U
------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition

Drive /c0/e0/s7 - Detailed Information :
======================================

Drive /c0/e0/s7 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e0/s7 Device attributes :
=================================
Manufacturer Id = SEAGATE
Model Number = ST18000NM004J
NAND Vendor = NA
SN = ZR544BFE0000C20137K1
WWN = 5000C500D7C625A0
Firmware Revision = E002
Raw size = 16.370 TB [0x105efffff Sectors]
Coerced size = 16.370 TB [0x105efffff Sectors]
Non Coerced size = 16.370 TB [0x105efffff Sectors]
Device Speed = 12.0Gb/s
Link Speed = 12.0Gb/s
Sector Size = 4 KB
Config ID = NA
Number of Blocks = 4394582015
Connector Name = C1 x1

Drive /c0/e0/s7 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 2(path0)
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Needs EKM Attention = No
PI Eligible = No
Certified = Yes
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 12.0Gb/s 0x5000c500d7c625a1
-----------------------------------------

Inquiry Data =
00 00 07 12 8b 01 10 02 53 45 41 47 41 54 45 20
53 54 31 38 30 30 30 4e 4d 30 30 34 4a 20 20 20
45 30 30 32 5a 52 35 34 34 42 46 45 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 a2 0c 60 20 e0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 43 6f 70 79 72 69 67 68 74 20 28 63 29 20 32
30 32 31 20 53 65 61 67 61 74 65 20 41 6c 6c 20

JerRatt · May 17, 2022

For reference, here is the smartctl -x /dev/sda output (similar to all other 7 drives):

root@truenas[~]# smartctl -x /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.93+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST18000NM004J
Revision: E002
Compliance: SPC-5
User Capacity: 18,000,207,937,536 bytes [18.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500d7cae71b
Serial number: ZR542X100000C2024T9N
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue May 17 19:39:35 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 7557
Current Drive Temperature: 31 C
Drive Trip Temperature: 60 C

Manufactured in week 35 of year 2021
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 69
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 2140
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 328770439
Blocks received from initiator = 638624028
Blocks read from cache and sent to initiator = 134300
Number of read and write commands whose size <= segment size = 1299059
Number of read and write commands whose size > segment size = 494

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 747.75
number of minutes until next internal SMART test = 44

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 20753.176 0
write: 0 0 0 0 0 30384.810 0

Non-medium error count: 0

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Aborted (device reset ?) - 722 - [- - -]
# 2 Background long Aborted (device reset ?) - 705 - [- - -]
# 3 Background short Completed - 704 - [- - -]
# 4 Background long Aborted (device reset ?) - 703 - [- - -]
# 5 Background short Completed - 690 - [- - -]
# 6 Background short Completed - 671 - [- - -]
# 7 Background long Aborted (device reset ?) - 646 - [- - -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Background scan results log
Status: no scans active
Accumulated power on time, hours:minutes 747:45 [44865 minutes]
Number of background scans performed: 0, scan progress: 0.00%
Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 0
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: power on
reason: hard reset
negotiated logical link rate: phy enabled; 12 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500d7cae719
attached SAS address = 0x500605b00dd5ce40
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 1
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 1
Phy reset problem count: 0
relative target port id = 2
generation code = 0
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500d7cae71a
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0

Apollo · May 17, 2022

This type of errors seem familiar.
I have a very similar setup, currently running TrueNAS Core 12 and I did experience what appears to have been be the same issues with the LSI adapter. That was a couple of years ago and only with a specific version of FreeNAS.
I don't recall much of it, but from memory the issue might have been FreeNAS realease specific, so the driver version at the time seems the most probable root cause.
It could also be related to the LSI chip overheating.
I haven't had any issues since then, while following the release update train.

I have recently bought 4x EXOS 18TB ( ST18000NM000J ) SATA version, and been running through burn drive test on my spare Xeon Supermicro S10SL7-F (8-9 days for burn test completion) and no issues whatsoever. I did run one of the disk on my Threadripper 1900X and didn't experience any issues either. Not sure if it was connected to LSI adapter though.

So my best guess so far is LSI driver or thermal mitigation related issues.

JerRatt · May 17, 2022

Apollo said:
This type of errors seem familiar.
I have a very similar setup, currently running TrueNAS Core 12 and I did experience what appears to have been be the same issues with the LSI adapter. That was a couple of years ago and only with a specific version of FreeNAS.
I don't recall much of it, but from memory the issue might have been FreeNAS realease specific, so the driver version at the time seems the most probable root cause.
It could also be related to the LSI chip overheating.
I haven't had any issues since then, while following the release update train.

I have recently bought 4x EXOS 18TB ( ST18000NM000J ) SATA version, and been running through burn drive test on my spare Xeon Supermicro S10SL7-F (8-9 days for burn test completion) and no issues whatsoever. I did run one of the disk on my Threadripper 1900X and didn't experience any issues either. Not sure if it was connected to LSI adapter though.

So my best guess so far is LSI driver or thermal mitigation related issues.

Perhaps it's a new issue introduced with 22.02, or it somehow came back in a recent version and mpX driver for LSI. It's certainly not heat, the datacenter is HVAC controlled, storcli reports HBA chip sitting at 40c when wide open.

It seems related to either the driver, or driver version with how it communicates with the HBA and backplane. Backplane doesn't have a firmware update, as per Supermicro.

What's strange is that once a drive or two start to go into resetting, it seems to cascade to nearly all drives pretty quickly UNLESS I reboot the machine. DMESG is absolutely full of errors pertaining to the drives once this starts happening. I can restart badblocks right afterwards when the drive comes back onoine, but it seems to fail even quicker once these errors started. But if I reboot the server, and startup badblocks on all the drives, it almost always runs flawlessly for hours and hours until finally having a meltdown again.

It's almost as if after a certain amount of hours, the HBA/backplane/drives hit some sort of firmware or cache issue where they just start dropping out like crazy.

JerRatt · May 17, 2022

Apollo said:
This type of errors seem familiar.
I have a very similar setup, currently running TrueNAS Core 12 and I did experience what appears to have been be the same issues with the LSI adapter. That was a couple of years ago and only with a specific version of FreeNAS.
I don't recall much of it, but from memory the issue might have been FreeNAS realease specific, so the driver version at the time seems the most probable root cause.
It could also be related to the LSI chip overheating.
I haven't had any issues since then, while following the release update train.

I have recently bought 4x EXOS 18TB ( ST18000NM000J ) SATA version, and been running through burn drive test on my spare Xeon Supermicro S10SL7-F (8-9 days for burn test completion) and no issues whatsoever. I did run one of the disk on my Threadripper 1900X and didn't experience any issues either. Not sure if it was connected to LSI adapter though.

So my best guess so far is LSI driver or thermal mitigation related issues.

In fact, it's acting identical to what SMR drives eventually do, but these are CMR drives I have installed.

morganL · May 17, 2022

JerRatt said:
In fact, it's acting identical to what SMR drives eventually do, but these are CMR drives I have installed.

The problem sounds complicated... not clear whether its software/firmware or hardware.

SCALE is not yet as mature as TrueNAS CORE. TrueNAS 13.0 has been more extensively tested with large numbers of 18TB drives.If its an option, you could load TrueNAS 13.0 and see whether any of the same behaviour exists.

SCALE will go through more of this large-scale system testing prior to the June release.

JerRatt · May 17, 2022

morganL said:
The problem sounds complicated... not clear whether its software/firmware or hardware.

SCALE is not yet as mature as TrueNAS CORE. TrueNAS 13.0 has been more extensively tested with large numbers of 18TB drives.If its an option, you could load TrueNAS 13.0 and see whether any of the same behaviour exists.

SCALE will go through more of this large-scale system testing prior to the June release.

It's not in production yet so I can change things around, however the goal was for this server to be running multiple VM's as well that requires GPU passthrough. I'm under the impression that Core won't passthrough a GPU (without some heavy duty tinkering and rigging, anyhow). And since this issue is being reported on more than just TrueNAS, but on FreeBSD as well, I'm not sure it would resolve the issue yet (using this specific HBA and backplane).

And while I could do the TrueNAS Core within a Proxmox VM and passthrough the HBA, I didn't want to start adding complexity.

I'll run through some replacements of cables, HBA, drives, and bypass the backplane to see if I can pinpoint the combination of issues causing this. And if that doesn't work, maybe I'll drop down to using the onboard SATA ports, disabling NCQ in the controller, or setting up a L2ARC cache in front of the hard drives (these are all workarounds I've read about so far to those having similar issues). It's not ideal, but if that gets it going at least I can put this into production.

Still would mean there is quite a bug coming up for many people in the near future, as I figured in the near future a lot of people will be using these larger drives with a newish LSI 9400 series card and backplanes. The issue likely resides outside of TrueNAS itself, and is part of the wrapper or drivers for mp3sas, but it's going to wreak absolute havoc in the future if we don't figure this out on the platform.

Important Announcement for the TrueNAS Community.

Large capacity drives randomly resetting, possible bug?

JerRatt

Dabbler

JerRatt

Dabbler

JerRatt

Dabbler

Apollo

Wizard

JerRatt

Dabbler

JerRatt

Dabbler

morganL

Captain Morgan

JerRatt

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Large capacity drives randomly resetting, possible bug?

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

Dabbler

Captain Morgan

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Large capacity drives randomly resetting, possible bug?"

Similar threads