LSI 9305 errors filling log

Submarine · Aug 9, 2023

I have the following:
Dell T420
2x IOM6 DS4246 (QSFP 8436)
LSI 9201-16e (Firmware Version : 20.00.06.00) (SAS8088)
Upgraded to LSI 9305-16e (Firmware Version : 16.00.12.00) (SAS8644)
Collection of SATA SSD's from Enterprise environments in one shelf, HDD's in the other
- INTEL_SSDSC2BG400G4R
- MK0400GCTZA
- MK0800GCTZB

I started experiencing problems running both disk shelves on the LSI 9201 HBA whenever I rebooted my truenas machine where the SSD's pools wouldn't come online with the system, so I'd have to export and import the pool, and then things seemed to work okay. No problem with the pools themselves, all tests and pools came back fine.
I upgraded to the 9305 card and have just the SSD disk shelf plugged in to it... My truenas server comes online properly now - all drives and pools without any messing with an export/import process.. all is well!

Except now I'm getting errors in my console and my log file:
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)

In my searching, I found lsi_decode_loginfo python script, which gave me the following results:

Code:

./lsi_decode_loginfo.py 0x31110e05
Value           31110E05h
Type:           30000000h       SAS
Origin:         01000000h       PL
Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:       00000E00h       PL_LOGINFO_SUB_CODE_DISCOVERY_SATA_ERR
Unparsed        00000005h

I've read various things on the forum about trim on SSD causing issues... trim is not enabled at all on these pools. I'm really not sure where else to turn or try. So.. here I am.

Please let me know if there's any other information I can provide to help troubleshoot / diagnose further.

dasaint · Aug 19, 2023

Hey there,

So I have been getting the same thing in my system same error code however the one thing you and I both have in common is the shelf!!

My Setup is the following

SuperMicro X12 w/AMD EPYC 16 Core CPU
2x LSI SAS 3008 9300-8i Controller (also got same errors with a 9400-16i)
2x IOM6 DS4246 (QSFP 8436)
Raid Z2 - 10 Drive 12TB SAS HGST
Raid Z2 - 10 Drive 12TB SAS HGST
Raid Z2 - 8 Drive 10TB SAS HGST
Mirror VDEV - 4x 3.2TB SAS SSDs

I recently just purchased the Compellant backplane controllers (Still SAS 6Gbps) but utlizes SAS 6gb to SAS 12Gb converters rather than the SAS 12Gb to QSFP Cable. I think i might replace the IOM6s and see if that error then goes away, since the only thing we have in common is the Shelf!

Submarine · Aug 21, 2023

Interesting!

I was trying to troubleshoot, so I went through all 24 drives and added interposers into the mix... At which point no drives at all showed up. Okay.. went and took 24 interposers out, but only doing a few at a time... went through the 24 drives, and I'm no longer getting the error. I think what happened was the new card was able to communicate faster than the 3gbps card, and I must have had a drive that wasn't fully seated.

By going through and making sure every drive didn't complain when being installed, I cleared the issue myself.

Hopefully that helps!

HeXXy · Sep 8, 2023

I'm having this same problem. I also have a 9305 chip on a 9400-16e card. Once per second im getting this log message.

Submarine · Sep 8, 2023

What kind of shelf are you interfacing with ? Have you tried fully reseating all the drives? I went through slowly to see if there was one drive in particular throwing the errors..now I don't have any errors. :D

HeXXy · Sep 8, 2023

Submarine said:
What kind of shelf are you interfacing with ? Have you tried fully reseating all the drives? I went through slowly to see if there was one drive in particular throwing the errors..now I don't have any errors. :D

This is now reliably reproducible. Every time I power cycle the DS4246 after rebooting the truenas scale system, the error goes away. After rebooting the host - the error comes back. Given it's only one of the two disk shelves causing problems, It's likely a firmware bug between the shelf and the controller. When the driver loads, it tells the card to reset. Odds are the card is not resetting properly, or the shelf has a bug in it's reset routine.

The problematic DS4246 has a different product_rev (0200) (cat sas_expander/expander-10:2/product_rev) than my other one (0191) that has always behaved nicely. I'm wondering if it's running a newer firmware version on the IOM6 that has a bug in the reset routine that affects SATA device discovery.

The only way to prevent this error for me is boot the system, export the pool, power cycle the DS4246, import the pool. No errors until the next host reboot. This is certainly a problem with reset sequencing.

Here's the boot-up initialization of the non-problematic DS4246

Code:

[    6.416837] mpt3sas_cm1: handle(0x1b) sas_address(0x500a09800268fbbe) port_type(0x1)
[   12.531651] scsi 10:0:3:0: Enclosure         NETAPP   DS424IOM6        0191 PQ: 0 ANSI: 5
[   12.535446] scsi 10:0:3:0: set ignore_delay_remove for handle(0x001b)
[   12.539220] scsi 10:0:3:0: SES: handle(0x001b), sas_addr(0x500a09800268fbbe), phy(36), device_name(0x0000000000000000)
[   12.546813] scsi 10:0:3:0: enclosure logical id (0x50050cc10201f4de), slot(0)
[   12.550626] scsi 10:0:3:0: enclosure level(0x0000), connector name( C0  )
[   12.554390] scsi 10:0:3:0: qdepth(64), tagged(1), scsi_level(6), cmd_que(1)
[   12.567702]  end_device-10:0:3: add: handle(0x001b), sas_addr(0x500a09800268fbbe)
[   16.503492] ses 10:0:3:0: Attached Enclosure device
[   21.755730] ses 10:0:3:0: Attached scsi generic sg7 type 13

Here's the boot-up initialization of the problematic DS4246

Code:

[    6.836886] mpt3sas_cm1: handle(0x35) sas_address(0x500a09800702b47e) port_type(0x1)
[   15.544049] scsi 10:0:28:0: Enclosure         NETAPP   DS424IOM6        0200 PQ: 0 ANSI: 5
[   15.548020] scsi 10:0:28:0: set ignore_delay_remove for handle(0x0035)
[   15.551921] scsi 10:0:28:0: SES: handle(0x0035), sas_addr(0x500a09800702b47e), phy(36), device_name(0x0000000000000000)
[   15.559676] scsi 10:0:28:0: enclosure logical id (0x500a0980073757c2), slot(0)
[   15.563621] scsi 10:0:28:0: enclosure level(0x0000), connector name( C2  )
[   15.567513] scsi 10:0:28:0: qdepth(64), tagged(1), scsi_level(6), cmd_que(1)
[   15.572684] scsi 10:0:28:0: Power-on or device reset occurred
[   15.584543]  end_device-10:1:24: add: handle(0x0035), sas_addr(0x500a09800702b47e)
[   16.505228] ses 10:0:28:0: Attached Enclosure device
[   21.936120] ses 10:0:28:0: Attached scsi generic sg32 type 13

Why is there a reset occurring? No other SCSI devices in the bootup sequence show that line. And why is the end_device location different from the kernel log lines ....

mealan · Dec 29, 2023

Where you able to figure this out? I am not sure if it's a reboot or an upgrade from 12.21.1 to 12.21.4.2 but I now get these every second as well. LSI SAS 9300-8e 12Gb/s as well.
Firmware 16.00.12.00

craptaincrunch · Feb 1, 2024

I have these errors also, but I only see them on boot.

Since this thread shows up when I search my hardware (DS4246) and that inscrutable kernel message, I'd like to add a bit that I ran across in another place that might be causing some unreliability:

Do you have both / all available PSUs plugged in on your DS4246? The manual indicates they're not redundant - you need 2 for disks <10k RPM, including just SSDs, and all 4 for 10k disks.

Important Announcement for the TrueNAS Community.

LSI 9305 errors filling log

Submarine

Cadet

dasaint

Cadet

Submarine

Cadet

HeXXy

Cadet

Submarine

Cadet

HeXXy

Cadet

mealan

Cadet

craptaincrunch

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

LSI 9305 errors filling log

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "LSI 9305 errors filling log"

Similar threads