SOLVED Heat-related hang under ESXi-6.7

Calochortus

Dabbler
Joined
Feb 18, 2019
Messages
10
I'm running FreeNAS in a VM (specs below) and have had an issue with the ESXi hanging, or otherwise being unresponsive. This was happening "randomly" after several days of uptime. Nothing (I could find) in the logs after rebooting. No IPMI (desktop board) so I couldn't scrape data that way. The only thing I could see is that the drive light seemed to be stuck on all/some of the time after hanging. Ran lots of RAM tests, etc but the drive light made me wonder about the HBA and PCIe settings.

I've since "fixed" this by changing the ASPM bios settings from "Auto" to "Disabled" and now I've had months of uptime. But is this the "right" fix? Is disabling ASPM a common thing for VM-passthrough? I find basically no mention of this sort of thing anywhere, so I thought I'd ask and record for posterity.

Hardware is a Supermicro X10SAT (Bios 3.2) + E3-1245 running ESXi 6.7

HBA hardware is an LSI SAS 9207-4i4e (no other PCIe cards).

ESXi sees a LSI / Symbios LSI2308_2, passed through to the FreeNAS VM.

FreeNAS sees:
pci3: <ACPI PCI bus> on pcib3
mps0: <Avago Technologies (LSI) SAS2308> port 0x4000-0x40ff mem 0xfd5f0000-0xfd5fffff,0xfd580000-0xfd5bffff irq 18 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 5285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
 

Calochortus

Dabbler
Joined
Feb 18, 2019
Messages
10
Well it seems like this may have been a 6.7U1 bug. I enabled smbios logging and caught one crash, but after that I updated ESXi to U2 (2019-06-20 release) and things have been stable since with ASPM back on "Auto".
 

Calochortus

Dabbler
Joined
Feb 18, 2019
Messages
10
Well I posted too soon. 45 day uptime and then locked. Reports:

Smbios 0x0A Bus00(DevFnE0)

So PCI error, at device 0xE0? Sadly that doesn't seem to be anything listed by ESXi or FreeBSD.
 

Calochortus

Dabbler
Joined
Feb 18, 2019
Messages
10
Well I *think* I figured out this problem. It is probably the LSI 9207 HBA overheating and either throwing an error through the bios directly or to/from via the Supermicro CSE-M35TQB. I've had a 92mm blowing on the HBA for a few weeks with no additional errors and have now tried screwing a 20mm directly to the heatsink.

Fingers crossed.
 
Top