LSI9300-16i leads to IO-faults in VM (onboard controller doesn't)

blacklight

Cadet
Joined
Oct 17, 2023
Messages
1
Hello TrueNAS Community,

I am currently in the process of a server build (the software part :P ) and I have significant problems with TrueNAS (TrueNAS-12.0-U8.1) on an Unraid VM.
I already saw that virtualizing TrueNAS divides the spirits of the communities, but since the server build is a private home-lab build, I decided to go with a VM to open more options in one machine. I am also 90% sure that my problem is not VM related (explanations follows).

Since the Hypervisor (Unraid) is running with the TrueNAS VM enabled, I bound all depending controllers to VFIO and handed the IOMMU groups over to the VM. So far so good I was able to normally boot TrueNAS (same version as a second bare metal machine). After that I wanted to Replicate my old NAS to the new VM with the Replication Task Tool in the UI. That worked great until the LSI HBA started throwing IO_faults (0x40005854). This made the machine unresponsive to the point where it completely froze the UI. I still could log in, but nothing was reporting. The bare metal machine was still working fine, so I doubt a connection issue. The only thing to mention here is, that the replication task failed.

After researching for hours I decided to freshly set up the VM just to encounter the same problem again, even in idle after a longer period.
So I researched again for hours and found this guide to flash the LSI firmware to the HBA:
https://www.truenas.com/community/resources/lsi-9300-xx-firmware-update.145/

So I flashed the HBA as explained in the post and the process got stuck.
After waiting for 30mins I decided to kill the VM and checked the status after the restart just to see that the correct firmware 16.00.12.00 was reported. At this point it was just an assumption that everything worked, because the 8 connected hdds were displayed in TrueNAS and I could easily create a Raid 6 VDEV without problems and start the replication task again to encounter the same problem again.

After more hours of research I found this tutorial
https://www.truenas.com/community/threads/remotely-flash-lsi-9300-with-freedos-live-cd.101770/
but decided to ask the experts in this forum first, because this guide is quite a good amount of work and I doubt that it will solve the problem.

Next to faulty transfer to the hdds I started a second replication task to a SSD pool that is wired over the onboard controller of the mobo and the result: NO IO faults in the vnc console.

Has anybody an idea what I could do to get the HBA to work inside the VM or at least troubleshoot correctly, the problem is driving me crazy and I actually need the storage ASAP. Please let me know if you need additional information. Should I try to flash the card over freedos ? Because I am afraid to brick it.



A few things to know:

- I have no direct physical access to the machine because the server is located in my home in Germany and I currently live abroad.
- I access the Unraid machine & the BIOS over a KVM and the rest (all VMs) over VNCs
- I already found a faulty adapter I tinkered with (a M2 key A/E to 2x Sata) that was immediately reported by Unraid. But there was nothing reported about the LSI HBA when binding the controller to Unraid and not the VFIO (of course this is not a validation but a first idea). The thing is that I really love TrueNAS and I want to stick with it. The second machine is now running for a over a year and I am pretty happy with it, just the hardware is very limited.
- I attached a picture of the faults in the TrueNAS Console
io_fault.png

- I wanto to avoid boting TrueNAS on bare metal, because the last time I did that, nearly all drives were reported to be bad/parity damaged.
So I don't know if switching OSs on the machine should be easy in terms of accesseing the drives ... but last time I did it for testing it was a mess, BUT
it is an option ....
- could the difference in IOMMU groups of the mobo controller and the LSI HBA really result in problems ? At least it is not perfromance related, because the SSDs work fine.

- The specs of the machine:
- MOBO: ASUS Pro WS W680-ACE
- CPU: i9 13900KS
- RAM: 128 GiB DDR5 Single-bit ECC (32Gb allocated for TrueNAS VM)
- HBA: LSI SAS 9300-16i (on a Gen 3 PCIe x4 slot – one of the lower ones, I think the one at the bottom)
- IOMMU of HBA: the LSI has two groups, one is bound to VFIO the other one to Unraid.
--> HBA storage: Unraid Group: 2x Sata SSDs and a Sata DOM
--> VFIO Group: 7x 4Tb Iron Wolfs and a DOM
- MOBO storage: 6x Sata SSDs and 1x 4Tb Iron Wolfs

Looking forward to any ideas about troubleshooting, because for now I am out of ideas how to continue despite switching to Unraid as a NAS.
 

Tigersharke

BOfH in User's clothing
Administrator
Moderator
Joined
May 18, 2016
Messages
893
For the old thread:
"Never found a fix in TrueNAS 12, BUT upgrading it to TrueNAS 13 seems to have solved the problem. I never saw the IO fault message again after the update."
"I have severe problems with virtualization rn ..."
Regards
Blacklight

-- so in a nutshell, problems persist even though it had seemed solved, and further discussion of this issue is at the new forums:
 
Top