Hi, I'm having a really frustrating issue where iSCSI connections to my ESXi hosts are being dropped and the hosts are locking up.
NAS Hardware
2x TrueNAS core (12.0-U8.1) systems.
System 1 (QNAP TVS-671A):
AMD Ryzen Embedded V1500B
32GB RAM
Storage for iSCSI:
2x Samsung PM983 960GB Enterprise M.2 PCIe NVMe SSD (presented as 2x block devices, not mirroring)
Dual 10 Gbps NIC (1 interface for home/backup, 1 interface for iSCSI only) 9k MTU is set.
System 2 (custom built)
AMD Ryzen 5 5600G
32GB RAM
Storage for iSCSI:
2x Sabrent 1TB NVMe (these are consumer class, presented as 2x block devices, not mirroring))
Dual 10 Gbps NIC (1 interface for home/backup, 1 interface for iSCSI only). 9k MTU is set.
System 1 is the primary system and I storage vMotion VMs from it to the second system for maintenence etc. System 1 also does Plex and regular home based SMB storage in a pool which ESXi does not touch. I also have a replication task to replicate the ZFS data from system 1 to system 2.
Network
UniFi US-16-XG
Jumbo frames enabled.
ESXi hosts
Intel(R) Xeon(R) W-1290 CPU
128GB RAM
Dual 10 Gbps NIC (1 for VM/network traffic, 1 for iSCSI only)
Configured with a Distributed Switch, 9k MTU is set
VMK for iSCSI set with 9k MTU.
I did some reading today and saw a few posts which mention potential issues with Jumbo Frames so I spent a bit of time dropping the VMK interfaces and also the interfaces in TrueNAS to 1500 and I still get the same issue (as well as quite a performance drop).
When I do a heavy data move between the two TrueNAS systems (ie, 50+ GB storage vMotion), one or both of the hosts can lock up. Networking drops, the console doesn't respond and the VMs are rebooted by vSphere HA onto the other host.
In vmkernel.log on the ESXi hosts I see (this may be unrelated and a bit of a red herring):
In /var/log/messages on TrueNAS I see:
The IP being one of the ESXi hosts.
I'm yet to start packet sniffing and I haven't set any advanced parameters. I'll look at that tomorrow (late UK time now). Between now and then if anyone has any pointers they'll be more than welcome.
NAS Hardware
2x TrueNAS core (12.0-U8.1) systems.
System 1 (QNAP TVS-671A):
AMD Ryzen Embedded V1500B
32GB RAM
Storage for iSCSI:
2x Samsung PM983 960GB Enterprise M.2 PCIe NVMe SSD (presented as 2x block devices, not mirroring)
Dual 10 Gbps NIC (1 interface for home/backup, 1 interface for iSCSI only) 9k MTU is set.
System 2 (custom built)
AMD Ryzen 5 5600G
32GB RAM
Storage for iSCSI:
2x Sabrent 1TB NVMe (these are consumer class, presented as 2x block devices, not mirroring))
Dual 10 Gbps NIC (1 interface for home/backup, 1 interface for iSCSI only). 9k MTU is set.
System 1 is the primary system and I storage vMotion VMs from it to the second system for maintenence etc. System 1 also does Plex and regular home based SMB storage in a pool which ESXi does not touch. I also have a replication task to replicate the ZFS data from system 1 to system 2.
Network
UniFi US-16-XG
Jumbo frames enabled.
ESXi hosts
Intel(R) Xeon(R) W-1290 CPU
128GB RAM
Dual 10 Gbps NIC (1 for VM/network traffic, 1 for iSCSI only)
Configured with a Distributed Switch, 9k MTU is set
VMK for iSCSI set with 9k MTU.
I did some reading today and saw a few posts which mention potential issues with Jumbo Frames so I spent a bit of time dropping the VMK interfaces and also the interfaces in TrueNAS to 1500 and I still get the same issue (as well as quite a performance drop).
When I do a heavy data move between the two TrueNAS systems (ie, 50+ GB storage vMotion), one or both of the hosts can lock up. Networking drops, the console doesn't respond and the VMs are rebooted by vSphere HA onto the other host.
In vmkernel.log on the ESXi hosts I see (this may be unrelated and a bit of a red herring):
Code:
2022-04-18T16:50:15.559Z cpu5:2097655)ScsiDeviceIO: 4161: Cmd(0x45b8c1c15fc8) 0x83, CmdSN 0x2b64b from world 2120192 to dev "naa.6589cfc0000006e2cb0aae8b5f1d86f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xa 0x8 0x4 2022-04-18T16:50:15.559Z cpu5:2097655)ScsiDeviceIO: 4161: Cmd(0x45b8c1d3acc8) 0x83, CmdSN 0x2b64c from world 2120192 to dev "naa.6589cfc0000006e2cb0aae8b5f1d86f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xa 0x8 0x4 2022-04-18T16:50:15.559Z cpu5:2097655)ScsiDeviceIO: 4161: Cmd(0x45b8c1d69ac8) 0x83, CmdSN 0x2b64d from world 2120192 to dev "naa.6589cfc0000006e2cb0aae8b5f1d86f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xa 0x8 0x4 2022-04-18T16:50:15.559Z cpu5:2097655)NMP: nmp_ThrottleLogForDevice:3798: last error status from device naa.6589cfc0000006e2cb0aae8b5f1d86f0 repeated 1280 times
In /var/log/messages on TrueNAS I see:
Code:
Apr 18 19:33:55 lando WARNING: 172.16.5.11 (iqn.1998-01.com.vmware:3440-1.chris.local:7130290:64): no ping reply (NOP-Out) after 5 seconds; dropping connection
The IP being one of the ESXi hosts.
I'm yet to start packet sniffing and I haven't set any advanced parameters. I'll look at that tomorrow (late UK time now). Between now and then if anyone has any pointers they'll be more than welcome.