There may be an issue with large drives having random device resets, especially when lots of I/O or activity is present on the drives, on Debian or FreeBSD 12.x versions.
Running any kind of heavy I/O onto the 18TB drives that I have connected to a Supermicro BPN-SAS3-743A backplane, running through to a LSI 9400-8i HBA eventually results in the drives resetting randomly. This happens even without the drives assigned to any kind of ZFS pool. This also happens whether running from the shell within the GUI or from the shell itself. This eventually happens on all drives, that are using two separate SFF8643 cables with a backplane that has two separate SFF8643 ports, and sometimes multiple drives reset at the exact same time while others continue chugging along with whatever heavy I/O they were doing.
To cause this to happen, I can either run badblocks on each drive (using: badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a SMART extended/long test.
Eventually, sometimes it is only minutes, or sometimes many hours later, all the drives will reset, even spin down (according to the shell logs). Sometimes the drives reset in batches, while others continue chugging along only to reset individually later. It's made completing any kind of SMART extended test not possible. Badblocks will fail out, reporting too many bad blocks, on multiple hard drives all at nearly the exact same moment, yet consecutive badblock scans won't report bad blocks in the same areas. SMART test will just show "aborted, drive reset?" as the result.
And while it isn't Debian, I've found what looks to be a near identical issue others are having on the FreeBSD forums: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496
My setup:
TrueNAS Scale 22.02.0.1
AMD Threadripper 1920X
ASRock X399 Taichi
128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC
AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter
Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane
8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn
2 x Crucial 120GB SSD
2 x Crucial 1TB SSD
2 x Western Digital 960GB NVME
Supermicro 4U case w/2000watt Redundant Power Supply
The server is connected with a large APC data-center battery system and conditioner, in a HVAC controlled area. All hard drives have the newest firmware, and in 4k sectors both logical and native. The controller has the newest firmware, both regular and legacy roms, and with the SATA/SAS only mode flashed (dropping the NVME multi/tri-mode option that the new 9400 series cards support).
My plan was to replace the HBA with an older LSI 9305-16i, replace the two SFF8643-SFF8643 cables going from the HBA to the backplane just for good measure, install two different SFF8643-SFF8482 cables that bypass the backplane fully, then four of the existing Seagate 18TB drives and put them on the the SFF8643-SFF8482 connections that bypass the backplane, as well as install four new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using the backplane, some not). That should reveal if this is a compatibility/bug issue with all large drives or certain large drives on a LSI controller, the mpr driver, and/or this backplane.
If none of that works or doesn't eliminate all the potential points of failures, I'm left with nothing but subpar work arounds that have been reported in the thread I linked, such as just using the onboard SATA ports instead of the LSI controller, disabling NCQ function in the LSI controller, or setting up a L2ARC cache (or I might try a metadata cache to see if that circumvents the issue as well). Either way, it appears this may be a bug with larger drives used in tandem with a LSI HBA, certain backplane, etc. In that thread they report anyone who has downgraded to the 11.x version of FreeBSD no longer had the issue on the exact same system, so it appears this may be a SAS mpr/mps driver issue that may be on both FreeBSD and Debian.
Condensed logs when one drive errors out:
sd 0:0:0:0: device_unblock and setting to running, handle(0x000d)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
~
~
~
~
sd 0:0:0:0: Power-on or device reset occurred
.......ready
sd 0:0:6:0: device_block, handle(0x000f)
sd 0:0:9:0: device_block, handle(0x0012)
sd 0:0:10:0: device_block, handle(0x0014)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
sd 0:0:9:0: device_unblock and setting to running, handle(0x0012)
sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
sd 0:0:10:0: device_unblock and setting to running, handle(0x0014)
sd 0:0:9:0: Power-on or device reset occurred
sd 0:0:6:0: Power-on or device reset occurred
sd 0:0:10:0: Power-on or device reset occurred
scsi_io_completion_action: 5 callbacks suppressed
sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional power granted
sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00 00 01 40 00 00
print_req_error: 5 callbacks suppressed
blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional power use not yet granted
sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00 00 01 40 00 00
blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
~
~
~
~
sd 0:0:10:0: [sdd] Spinning up disk...
.
sd 0:0:3:0: device_block, handle(0x0013)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
.
sd 0:0:3:0: device_unblock and setting to running, handle(0x0013)
.
sd 0:0:3:0: Power-on or device reset occurred
.................ready
Running any kind of heavy I/O onto the 18TB drives that I have connected to a Supermicro BPN-SAS3-743A backplane, running through to a LSI 9400-8i HBA eventually results in the drives resetting randomly. This happens even without the drives assigned to any kind of ZFS pool. This also happens whether running from the shell within the GUI or from the shell itself. This eventually happens on all drives, that are using two separate SFF8643 cables with a backplane that has two separate SFF8643 ports, and sometimes multiple drives reset at the exact same time while others continue chugging along with whatever heavy I/O they were doing.
To cause this to happen, I can either run badblocks on each drive (using: badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a SMART extended/long test.
Eventually, sometimes it is only minutes, or sometimes many hours later, all the drives will reset, even spin down (according to the shell logs). Sometimes the drives reset in batches, while others continue chugging along only to reset individually later. It's made completing any kind of SMART extended test not possible. Badblocks will fail out, reporting too many bad blocks, on multiple hard drives all at nearly the exact same moment, yet consecutive badblock scans won't report bad blocks in the same areas. SMART test will just show "aborted, drive reset?" as the result.
And while it isn't Debian, I've found what looks to be a near identical issue others are having on the FreeBSD forums: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496
My setup:
TrueNAS Scale 22.02.0.1
AMD Threadripper 1920X
ASRock X399 Taichi
128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC
AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter
Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane
8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn
2 x Crucial 120GB SSD
2 x Crucial 1TB SSD
2 x Western Digital 960GB NVME
Supermicro 4U case w/2000watt Redundant Power Supply
The server is connected with a large APC data-center battery system and conditioner, in a HVAC controlled area. All hard drives have the newest firmware, and in 4k sectors both logical and native. The controller has the newest firmware, both regular and legacy roms, and with the SATA/SAS only mode flashed (dropping the NVME multi/tri-mode option that the new 9400 series cards support).
My plan was to replace the HBA with an older LSI 9305-16i, replace the two SFF8643-SFF8643 cables going from the HBA to the backplane just for good measure, install two different SFF8643-SFF8482 cables that bypass the backplane fully, then four of the existing Seagate 18TB drives and put them on the the SFF8643-SFF8482 connections that bypass the backplane, as well as install four new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using the backplane, some not). That should reveal if this is a compatibility/bug issue with all large drives or certain large drives on a LSI controller, the mpr driver, and/or this backplane.
If none of that works or doesn't eliminate all the potential points of failures, I'm left with nothing but subpar work arounds that have been reported in the thread I linked, such as just using the onboard SATA ports instead of the LSI controller, disabling NCQ function in the LSI controller, or setting up a L2ARC cache (or I might try a metadata cache to see if that circumvents the issue as well). Either way, it appears this may be a bug with larger drives used in tandem with a LSI HBA, certain backplane, etc. In that thread they report anyone who has downgraded to the 11.x version of FreeBSD no longer had the issue on the exact same system, so it appears this may be a SAS mpr/mps driver issue that may be on both FreeBSD and Debian.
Condensed logs when one drive errors out:
sd 0:0:0:0: device_unblock and setting to running, handle(0x000d)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
~
~
~
~
sd 0:0:0:0: Power-on or device reset occurred
.......ready
sd 0:0:6:0: device_block, handle(0x000f)
sd 0:0:9:0: device_block, handle(0x0012)
sd 0:0:10:0: device_block, handle(0x0014)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
sd 0:0:9:0: device_unblock and setting to running, handle(0x0012)
sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
sd 0:0:10:0: device_unblock and setting to running, handle(0x0014)
sd 0:0:9:0: Power-on or device reset occurred
sd 0:0:6:0: Power-on or device reset occurred
sd 0:0:10:0: Power-on or device reset occurred
scsi_io_completion_action: 5 callbacks suppressed
sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional power granted
sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00 00 01 40 00 00
print_req_error: 5 callbacks suppressed
blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional power use not yet granted
sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00 00 01 40 00 00
blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
~
~
~
~
sd 0:0:10:0: [sdd] Spinning up disk...
.
sd 0:0:3:0: device_block, handle(0x0013)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
.
sd 0:0:3:0: device_unblock and setting to running, handle(0x0013)
.
sd 0:0:3:0: Power-on or device reset occurred
.................ready