iSCSI random freeze with ESX

atakacs

Explorer
Joined
Apr 23, 2012
Messages
92
Hello

I have two SuperMicro AS -1114S-WN10RT servers. Fully patched to latest bios.

One running ESX 6.7U3 (downgraded from the infamous 7…), the other TrueNAS (latest build) on “bare metal”. 256Gb RAM, 10x 2Tb NVME SSD into one pool. TrueNAS is connected via 10Gb direct link and presents an iSCSI storage for ESX.

Setup was pretty straightforward and things are working mostly ok, except that we are seeing these kind of errors at what appear to be random intervals (but a few times per day)

Screenshot 2021-12-10 at 07.20.37.png


Obviously this translates into a “freeze” of one or more VM. Thankfully no data loss / corruption thus far but clearly not acceptable.

The ping being shown are from NAS to ESX and ESX to NAS. The link seems solid and stable.

I’ve gone trough the forum and found a few similar issues but no clear resolution except “update your drivers” (which I have…). One suggestion was to disable speed negotiation on the NIC, which I have done (set to 10Gb).

Any suggestions as of how to diagnose / solve this would be most welcome. And if anyone has a similar setup defnitely interrested to hear from you !
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yeah, iSCSI likes to timeout and reconnect after 5 seconds, so hardware issues are always problematic. ESXi handles a low rate of these sorts of disconnects without catastrophic consequences in most cases, but you still want to get to the bottom of it.

This has nothing to do with speed negotiation of the ethernets. 10GBASE-T is not particularly reliable in my opinion though.

This is the LSI HBA throwing a timeout because it didn't complete an operation with a drive in the expected timeframe.

I'm not quite clear how what appears to be an old LSI 6Gbps SAS HBA (MPT driver) ended up in an Aplus 1U server. Are you driving an external shelf of disks or something like that? Please take a few moments to provide a much more detailed summary of your setup...

For MPT, please make sure that you are using firmware 20.00.07.00 on the HBA. You are responsible for making certain that the running firmware matches what the driver expects. The driver and firmware work together in concert in order to make HBA-attached drives work, and if you do not have 20.00.07.00 on the HBA, then that by itself could be the problem.

If this is happening with a specific drive, check cabling and/or replace the drive.

There's meaning that can be decoded from the hex data in the error, but I don't even like to dig into that for money, much less for free on a forum, and it often doesn't reveal anything that is directly relevant anyways.
 

atakacs

Explorer
Joined
Apr 23, 2012
Messages
92
Many thanks for your input

This is the LSI HBA throwing a timeout because it didn't complete an operation with a drive in the expected timeframe.

Ok - understood.

I'm not quite clear how what appears to be an old LSI 6Gbps SAS HBA (MPT driver) ended up in an Aplus 1U server.

That's a very good point and indeed surprising - let me investigate.

For MPT, please make sure that you are using firmware 20.00.07.00 on the HBA. You are responsible for making certain that the running firmware matches what the driver expects. The driver and firmware work together in concert in order to make HBA-attached drives work, and if you do not have 20.00.07.00 on the HBA, then that by itself could be the problem.

Sorry I'm not sure to follow you here - what exactly am I supposed to check ?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
When you are booting up, the BIOS will report the card's firmware version during POST. Some cards do not have their BIOS ROM flashed; I think that's a bad idea, because it lets you safely check on versions and connectivity outside of the UNIX environment. But opinions vary.

When the OS boots, it will also report it in the file /var/run/dmesg.boot, such as

mps0: <Avago Technologies (LSI) SAS2008> port 0x4000-0x40ff mem 0xfd3f0000-0xfd3fffff,0xfd380000-0xfd3bffff irq 18 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

You can also check at the FreeNAS CLI.

# mpsutil show adapter
mps0 Adapter:
Board Name: SAS9211-8i
Board Assembly:
Chip Name: LSISAS2008
Chip Revision: ALL
BIOS Revision: 7.31.00.00
Firmware Revision: 20.00.07.00
Integrated RAID: no

PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device
0 0003 000b N 6.0 1.5 6.0 SAS Initiator
1 0006 000e N 6.0 1.5 6.0 SAS Initiator
2 0008 0010 N 6.0 1.5 6.0 SAS Initiator
3 0002 000a N 6.0 1.5 6.0 SAS Initiator
4 0005 000d N 6.0 1.5 6.0 SAS Initiator
5 0007 000f N 6.0 1.5 6.0 SAS Initiator
6 0001 0009 N 6.0 1.5 6.0 SAS Initiator
7 0004 000c N 6.0 1.5 6.0 SAS Initiator
 
Top