LSI 2308/MPS driver issue on 12.0

Joined
Jun 5, 2021
Messages
1
I think I may have discovered an issue with the mps driver in TrueNAS 12.0 and would appreciate any help.

I'm putting together a NAS running 2 LSI2308 (LSI9207-8e) HBA's flashed with IT mode 20.00.07.00. When running "badblocks -wvs" on various drives on both controllers, I'm seeing tons of read and write errors. I can reproduce this on 12.0-U4, U3.1, and BETA. However, I don't encounter any errors running the same experiment on FreeNAS 11.3-U5 or Ubuntu 20.04.

I'm including what I think are relevant info from dmesg below.

Code:
mps0: <Avago Technologies (LSI) SAS2308> port 0x7000-0x70ff mem 0xc5e40000-0xc5e4ffff,0xc5e00000-0xc5e3ffff irq 32 at device 0.0 numa-domain 0 on pci5
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
mps1: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem 0xe0e40000-0xe0e4ffff,0xe0e00000-0xe0e3ffff irq 40 at device 0.0 numa-domain 0 on pci7
mps1: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps1: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
...
mps1: Controller reported scsi ioc terminated tgt 4 SMID 2125 loginfo 31110d00
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 00 00 00 80 00
(da10:mps1:0:4:0): CAM status: CCB request completed with an error
(da10:mps1:0:4:0): Retrying command, 3 more tries remain
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 00 00 00 80 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da10:mps1:0:4:0): Retrying command (per sense data)
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 00 00 00 80 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da10:mps1:0:4:0): Retrying command (per sense data)
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 00 00 00 80 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da10:mps1:0:4:0): Retrying command (per sense data)
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 00 00 00 80 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da10:mps1:0:4:0): Error 5, Retries exhausted
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 00 00 00 02 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da10:mps1:0:4:0): Retrying command (per sense data)
...
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 42 dd 1a 00 00 02 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da10:mps1:0:4:0): Retrying command (per sense data)
(da10:mps1:0:4:0): WRITE(10). CDB: 2a 00 00 44 43 80 00 00 80 00
(da10:mps1:0:4:0): CAM status: SCSI Status Error
(da10:mps1:0:4:0): SCSI status: Check Condition
(da10:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da10:mps1:0:4:0): Retrying command (per sense data)
...
mps0: Controller reported scsi ioc terminated tgt 11 SMID 825 loginfo 31110d00
mps0: (da6:mps0:0:11:0): WRITE(10). CDB: 2a 00 09 0e ff 00 00 00 80 00
Controller reported scsi ioc terminated tgt 12 SMID 834 loginfo 31110d00
(da7:mps0:0:12:0): WRITE(10). CDB: 2a 00 07 3f 49 00 00 00 80 00
(da7:mps0:0:12:0): CAM status: CCB request completed with an error
(da6:mps0:0:11:0): CAM status: CCB request completed with an error
(da6:mps0:0:11:0): Retrying command, 3 more tries remain
(da6:mps0:0:11:0): WRITE(10). CDB: 2a 00 09 0e ff 00 00 00 80 00
(da7:mps0:0:12:0): Retrying command, 3 more tries remain
(da7:mps0:0:12:0): WRITE(10). CDB: 2a 00 07 3f 49 00 00 00 80 00
(da6:mps0:0:11:0): CAM status: SCSI Status Error
(da6:mps0:0:11:0): SCSI status: Check Condition
(da6:mps0:0:11:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:11:0): Retrying command (per sense data)
(da6:mps0:0:11:0): WRITE(10). CDB: 2a 00 09 0e ff 00 00 00 80 00
(da7:mps0:0:12:0): CAM status: SCSI Status Error
(da7:mps0:0:12:0): SCSI status: Check Condition
(da7:mps0:0:12:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:11:0): CAM status: SCSI Status Error
(da6:mps0:0:11:0): SCSI status: Check Condition
(da6:mps0:0:11:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:11:0): Retrying command (per sense data)
...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Any chance you can try running only one controller as a test? I'm surprised to see a potential regression in the mps driver slip through this long.
 

JerRatt

Dabbler
Joined
May 17, 2022
Messages
14
I'm having the same or similar issue. It's possibly it slipped through this long because it may need a newish type controller (2308 or newer) and using very large hard drives. The OP didn't say how large his drives were, but I'm willing to bet they are over 10TB.

There may be an issue with large drives having random device resets, especially when lots of I/O or activity is present on the drives, on Debian or FreeBSD 12.x versions.

Running any kind of heavy I/O onto the 18TB drives that I have connected to a Supermicro BPN-SAS3-743A backplane, running through to a LSI 9400-8i HBA eventually results in the drives resetting randomly. This happens even without the drives assigned to any kind of ZFS pool. This also happens whether running from the shell within the GUI or from the shell itself. This eventually happens on all drives, that are using two separate SFF8643 cables with a backplane that has two separate SFF8643 ports, and sometimes multiple drives reset at the exact same time while others continue chugging along with whatever heavy I/O they were doing.

To cause this to happen, I can either run badblocks on each drive (using: badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a SMART extended/long test.

Eventually, sometimes it is only minutes, or sometimes many hours later, all the drives will reset, even spin down (according to the shell logs). Sometimes the drives reset in batches, while others continue chugging along only to reset individually later. It's made completing any kind of SMART extended test not possible. Badblocks will fail out, reporting too many bad blocks, on multiple hard drives all at nearly the exact same moment, yet consecutive badblock scans won't report bad blocks in the same areas. SMART test will just show "aborted, drive reset?" as the result.

And while it isn't Debian, I've found what looks to be a near identical issue others are having on the FreeBSD forums: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496

My setup:
TrueNAS Scale 22.02.0.1
AMD Threadripper 1920X
ASRock X399 Taichi
128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC
AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter
Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane
8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn
2 x Crucial 120GB SSD
2 x Crucial 1TB SSD
2 x Western Digital 960GB NVME
Supermicro 4U case w/2000watt Redundant Power Supply

The server is connected with a large APC data-center battery system and conditioner, in a HVAC controlled area. All hard drives have the newest firmware, and in 4k sectors both logical and native. The controller has the newest firmware, both regular and legacy roms, and with the SATA/SAS only mode flashed (dropping the NVME multi/tri-mode option that the new 9400 series cards support).

My plan was to replace the HBA with an older LSI 9305-16i, replace the two SFF8643-SFF8643 cables going from the HBA to the backplane just for good measure, install two different SFF8643-SFF8482 cables that bypass the backplane fully, then four of the existing Seagate 18TB drives and put them on the the SFF8643-SFF8482 connections that bypass the backplane, as well as install four new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using the backplane, some not). That should reveal if this is a compatibility/bug issue with all large drives or certain large drives on a LSI controller, the mpr driver, and/or this backplane.

If none of that works or doesn't eliminate all the potential points of failures, I'm left with nothing but subpar work arounds that have been reported in the thread I linked, such as just using the onboard SATA ports instead of the LSI controller, disabling NCQ function in the LSI controller, or setting up a L2ARC cache (or I might try a metadata cache to see if that circumvents the issue as well). Either way, it appears this may be a bug with larger drives used in tandem with a LSI HBA, certain backplane, etc. In that thread they report anyone who has downgraded to the 11.x version of FreeBSD no longer had the issue on the exact same system, so it appears this may be a SAS mpr/mps driver issue that may be on both FreeBSD and Debian.




Condensed logs when one drive errors out:

sd 0:0:0:0: device_unblock and setting to running, handle(0x000d)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
~
~
~
~
sd 0:0:0:0: Power-on or device reset occurred
.......ready
sd 0:0:6:0: device_block, handle(0x000f)
sd 0:0:9:0: device_block, handle(0x0012)
sd 0:0:10:0: device_block, handle(0x0014)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
sd 0:0:9:0: device_unblock and setting to running, handle(0x0012)
sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
sd 0:0:10:0: device_unblock and setting to running, handle(0x0014)
sd 0:0:9:0: Power-on or device reset occurred
sd 0:0:6:0: Power-on or device reset occurred
sd 0:0:10:0: Power-on or device reset occurred
scsi_io_completion_action: 5 callbacks suppressed
sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional power granted
sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00 00 01 40 00 00
print_req_error: 5 callbacks suppressed
blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor]
sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional power use not yet granted
sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00 00 01 40 00 00
blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
~
~
~
~
sd 0:0:10:0: [sdd] Spinning up disk...
.
sd 0:0:3:0: device_block, handle(0x0013)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
.
sd 0:0:3:0: device_unblock and setting to running, handle(0x0013)
.
sd 0:0:3:0: Power-on or device reset occurred
 

kingtool

Dabbler
Joined
Sep 11, 2022
Messages
16
Although I'm not currently using TrueNAS on my machine, the above post is very similar to my situation. Disks under load + running an extended smart test just randomly reset. And right now, it's only affecting my largest disks (18TB). I have isolated individual HBAs and cables to no avail. I am still swapping things around, so for the sake of data, I can say that my cables are fine, they're seated, the problem travels from one HBA to the next (so it's tied to backplane/disks/disk firmware/disk type). The next thing I need to do is move the 18T disks to a different backplane and see if it moves there too.

The above logs look almost exactly like my own, including the specific error code masks thrown by mpt3sas when a reset occurs.

Some info on my setup:

I'm running Debian testing. I have an Epyc 7402P on a Supermicro H12SSL-CT.

I have a pool made up of five 11-disk vdevs (5x11). Disks vary in size with the smallest vdevs made up of 12T disks, and the newest/largest made of 18TB disks. Disks are attached on an SC846 (front 24 bays) to one on-board SAS3008 HBA, then a SC847-JBOD (front 24) to another HBA (9300-8e), and finally the rear 22 bays of the same to a third HBA (also 9300-8e). All of the HBAs are on 16.00.12.00, a firmware that was posted to these forums that was supposedly meant to fix the large numbers of resets that kept occurring.

I've had this issue come and go over time. I believe in addition to the spurious resets we've all seen, you can trigger resets many other ways, including the following two rakes I have stepped on repeatedly:

1. Overheating on the HBAs. These are passively cooled, with very little information or documentation from LSI/Broadcom on the maximum operating temperature. Supermicro's default SC846 layout is also terrible and provides almost no airflow to cards close to the outer edge of the chassis. If you don't add a fan to the card or a PCIe slot fan next to it, or some kind of ducting to redirect midplane fans to the cards, the card will overheat and the *card itself* will hard reset under load, completely taking out every disk attached to it and making zfs crap its pants.

2. If you decide you want to monitor the above temperatures, the only place you can get the data is from storcli. Until fairly recently, you could not retrieve the ROC temperature from storcli without running "storcli64 /call show all"; running this command frequently enough will eventually start blocking and cause the adapter to hang and reset disks. I have no idea why this happens, but it's reproducible. Recently, I think somewhere around storcli 7.2106, they finally added "show temperature" as an option which no longer forces the utility to scan every disk and do whatever else it does, so it may be possible to have netdata/prometheus/whatever regularly fetch the temperatures for alerting.

I sure hope there is some progress or resolution! I will continue to attempt to isolate the issue on my hardware, but it just keeps seeming to be these poorly-supported adapters. I am tempted to buy a 9500-series HBA just to see if it solves the problem, but I hate to give Broadcom more money when they aren't going to really provide support for individual consumers either way.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
2. If you decide you want to monitor the above temperatures, the only place you can get the data is from storcli. Until fairly recently, you could not retrieve the ROC temperature from storcli without running "storcli64 /call show all"; running this command frequently enough will eventually start blocking and cause the adapter to hang and reset disks. I have no idea why this happens, but it's reproducible. Recently, I think somewhere around storcli 7.2106, they finally added "show temperature" as an option which no longer forces the utility to scan every disk and do whatever else it does, so it may be possible to have netdata/prometheus/whatever regularly fetch the temperatures for alerting.
I think the FreeBSD driver for SAS3 cards recently-ish added support for a less-terrible option for reading the temperature sensor. I don't remember if this also applied to SAS2.5 cards (i.e. SAS2308) and am pretty sure SAS2.0 (SAS2008 and similar) don't support it.
I sure hope there is some progress or resolution! I will continue to attempt to isolate the issue on my hardware, but it just keeps seeming to be these poorly-supported adapters. I am tempted to buy a 9500-series HBA just to see if it solves the problem, but I hate to give Broadcom more money when they aren't going to really provide support for individual consumers either way.
It's been hard to get some clarity on the status of firmware and drivers for the later SAS3 stuff (LSI SAS 94xx and 95xx).

There's been a ton of SAS3008s in use with zero issues, so I somewhat suspect these resets might involve some newer disks with features that were not in common use back when the bulk of the development took place.

@mav@, do you know of any recent changes to the Broadcom SAS3 landscape that may be relevant here?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I assume those reporting this issue have looked at the related threads where users have been reporting the command timeouts? Largely Seagate related, but probably worth consuming.


Link is to a user who's reporting "solved" after effectively disabling NCQ on the drives (although as mentioned previous this is a subpar fix)


@JerRatt if you're still here do you have any feedback to report on if the WD HC550 drives also suffer from the same reset alerts?
 

kingtool

Dabbler
Joined
Sep 11, 2022
Messages
16
As an addendum to my post since I couldn’t edit quite yet due to my newness: the repro I have right now is on 18TB HC550’s. The rest of the disks are:

vdev 1,2,3: 11x HGST HUH721212ALN600
vdev 4: 11x Seagate Exos ST16000NM001G
vdev 5: 11x WD HC550 WUH721818ALE6L4

Those HC550’s are on the PCGNW232 (R232) firmware. There is a newer firmware available but I have not been able to find a detailed change log and didn’t want to upgrade for no reason.

Also, the backplane in question is the rear Supermicro 22-disk type with dual expanders. I don’t have the exact model number on hand but it would be a bit different than the model mentioned above, which was a direct attach model.
 

JerRatt

Dabbler
Joined
May 17, 2022
Messages
14
@JerRatt if you're still here do you have any feedback to report on if the WD HC550 drives also suffer from the same reset alerts?

After months of dealing with this, I'm fairly sure I've narrowed it down to the Seagate drives, at least in my scenario.

While MPS issues, NCQ, HBA temperature, or the way Seagate reports SMART (it's a bit different than most standard drives) might be an issue that causes similar symptoms for all above, I've had zero issues whatsoever by removing Seagate fully from the setup and instead going with WD's DC H550 drives.

It seems a lot of different issues can cause the same symptoms, so I'm betting all of us in this thread could be having different causes which is why one person may be having this issue across many different drive brands while I'm only having it on one brand. So I wouldn't discredit what others have posted being being a likely cause for them (HBA temps, NCQ, Seagate SMART reporting variance, MPS driver issues), but for me I can absolutely verify the Seagate drives I had have a fundamental flaw or firmware issue that causes my symptoms.

I can verify temps on the HBA at load is very low in storcli and I have a dedicate fan with a 3D printed shroud on the card that directs the airflow over the heatsink. I'm on literally the 4th server with this issue, servers with wildly different configurations that the others, including different HBA's, cables, backplanes (or outright skipping a backplane and using breakout cables for process of elimination) and the issue ONLY follows the Seagate drives. So now I've switch to using 15 of the Western Digital DC H550 drives, mixed with the R232 and R680 firmware (my understanding is it's the same firmware, one is for retail and the other for OEM drives), and I've yet to have any issue at all.

I've contacted Seagate about this issue and they fully reject even acknowledging the issue, so it's unlikely a firmware fix (if possible) will ever come. But what's worse is that they DO seem to know about the issue, as the moment I bring it up under any support account with them, they immediately go into a "provide invoice, no that's not a authorized reseller, and now your entire warranty is void" (despite the drives actually being bought from authorized resellers) and they even seem to go on to ban my email, address, and support accounts from submitting any other warranty as they know it's me if I use any of that same information on any new account. It's absolutely criminal.

They don't act this way if you submit for warranty or tech support on different drive models, using a different email, phone number, and address. That seems to me to imply they know of this issue and that a massive recall and refund for these drives would be needed, which they don't want to do.

I'm literally out $10k of worthless drives for this type of need.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Where's an ambulance chaser when you need one...
 

kingtool

Dabbler
Joined
Sep 11, 2022
Messages
16
After months of dealing with this, I'm fairly sure I've narrowed it down to the Seagate drives, at least in my scenario.

While MPS issues, NCQ, HBA temperature, or the way Seagate reports SMART (it's a bit different than most standard drives) might be an issue that causes similar symptoms for all above, I've had zero issues whatsoever by removing Seagate fully from the setup and instead going with WD's DC H550 drives.

It seems a lot of different issues can cause the same symptoms, so I'm betting all of us in this thread could be having different causes which is why one person may be having this issue across many different drive brands while I'm only having it on one brand. So I wouldn't discredit what others have posted being being a likely cause for them (HBA temps, NCQ, Seagate SMART reporting variance, MPS driver issues), but for me I can absolutely verify the Seagate drives I had have a fundamental flaw or firmware issue that causes my symptoms.

I'm literally out $10k of worthless drives for this type of need.
I wouldn't be so quick to blame Seagate. In my situation I have a batch of Seagate Exos disks (16TB) that aren't showing the issue, but my HC550 18TB's are.

To be clear my issue wasn't heat, I was just saying it's another source for resets that I've had to track down in the past. I think you and I have identical issues, or at least a ton of overlap. My cards are also cooled with blower fans I stuck on them and they're in a cold dedicated room now.

I'm going to swap my HC550s to the front backplane on the other chassis just to see if it follows the disks. As you know it's so random it might be a few days before I see resets again, but I'll see if I can accelerate it by running another scrub.
 

kingtool

Dabbler
Joined
Sep 11, 2022
Messages
16
So now I've switch to using 15 of the Western Digital DC H550 drives, mixed with the R232 and R680 firmware (my understanding is it's the same firmware, one is for retail and the other for OEM drives), and I've yet to have any issue at all.
On this there is a useless, but existent, changelog that indicates R680 is in fact a newer firmware for retail SATA disks. See attached, but the only info you get is:

Description of Change:
- Ongoing Improvements and Bug Fixes
- Detailed Change List Available Upon Request
 

Attachments

  • WD Ultrastar DC HC550 Firmware R680 PCN.pdf
    208.7 KB · Views: 299

JerRatt

Dabbler
Joined
May 17, 2022
Messages
14
I wouldn't be so quick to blame Seagate. In my situation I have a batch of Seagate Exos disks (16TB) that aren't showing the issue, but my HC550 18TB's are.

To be clear my issue wasn't heat, I was just saying it's another source for resets that I've had to track down in the past. I think you and I have identical issues, or at least a ton of overlap. My cards are also cooled with blower fans I stuck on them and they're in a cold dedicated room now.

I'm going to swap my HC550s to the front backplane on the other chassis just to see if it follows the disks. As you know it's so random it might be a few days before I see resets again, but I'll see if I can accelerate it by running another scrub.

I would, for my situation, because it certainly is Seagate that is causing my issue. I've eliminated it for months and months, using 4 wildly different server configurations, and it follows only the Seagate drives. The response from Seagate support was revealing as well, they seem to know about the issue but won't acknowledge it and break their own contract when it comes to mentioning it.

For your issue, I suspect it would be a different cause altogether. I do know backplanes and the cables can be a similar cause, some of the older backplanes from Supermicro have a firmware version that causes this and they do not have a firmware upgrade for those models. I also made the mistake once of not seeing there was a special cable needed from the HBA that was on some really well hidden qualification list from LSI that I can't seem to find anymore, similar symptoms but a totally different setup a while back.

If you have access to some, you could get breakout cables to fully bypass your backplane at least to test to see if the issue remains on those drives. It's one step I eventually did to eliminate the possibility of backplane compatibilities altogether.

For my issue, I could get the symptoms to reproduce nearly instantly by even just initiating a copy of a large file to the pool of disks, if it takes a few days for yours to show up that may be something else entirely.
 

kingtool

Dabbler
Joined
Sep 11, 2022
Messages
16
Bumping this thread with more data, and my own experience.

So first off, I had two issues. The first was apparently my cabling; by replacing all my cables, I eliminated the spurious resets on my HC550s.

Second, I reliably and consistently reproduce a full reset of the Seagate X16 series disks almost every time an extended smart test is run. I have smartd run them at the same time every week, and the dmesg output fully tracks with these tests hitting 90% or so, every time. If the disk resets during any activity on my zpool, or especially during a scrub, it will show up as errors in ZFS and need to be cleared/fixed on the next scrub.

These are currently the only spurious resets I see anymore. I do not know if there is any relation to the issues here, but I wanted to mention it. I know that @JerRatt can reproduce with an extended SMART test, but also with badblocks/other high disk activity. I can say that I do not see these resets from the high activity induced by scrubs, only from the extended SMART tests that get to ~90% and reset the disks. It is plausible the scrub isn't beating the disks up enough and maybe I could repro using badblocks -- I'll keep this in mind.

Also, on a *completely separate* server with all X18 disks I have the exact same issue: extended SMART tests never complete and cause bus resets, every time.

I'm giving up running extended SMART tests in an automated way on Seagate disks. Scrubs get me some/most of the way there, anyway. But it's annoying.
 
Last edited:
Top