ESXi iSCSI - no ping reply (NOP-Out) after 5 seconds; dropping connection

ChrisD. · Apr 18, 2022

Hi, I'm having a really frustrating issue where iSCSI connections to my ESXi hosts are being dropped and the hosts are locking up.

NAS Hardware
2x TrueNAS core (12.0-U8.1) systems.

System 1 (QNAP TVS-671A):
AMD Ryzen Embedded V1500B
32GB RAM
Storage for iSCSI:
2x Samsung PM983 960GB Enterprise M.2 PCIe NVMe SSD (presented as 2x block devices, not mirroring)
Dual 10 Gbps NIC (1 interface for home/backup, 1 interface for iSCSI only) 9k MTU is set.

System 2 (custom built)
AMD Ryzen 5 5600G
32GB RAM
Storage for iSCSI:
2x Sabrent 1TB NVMe (these are consumer class, presented as 2x block devices, not mirroring))
Dual 10 Gbps NIC (1 interface for home/backup, 1 interface for iSCSI only). 9k MTU is set.

System 1 is the primary system and I storage vMotion VMs from it to the second system for maintenence etc. System 1 also does Plex and regular home based SMB storage in a pool which ESXi does not touch. I also have a replication task to replicate the ZFS data from system 1 to system 2.

Network
UniFi US-16-XG
Jumbo frames enabled.

ESXi hosts
Intel(R) Xeon(R) W-1290 CPU
128GB RAM
Dual 10 Gbps NIC (1 for VM/network traffic, 1 for iSCSI only)
Configured with a Distributed Switch, 9k MTU is set
VMK for iSCSI set with 9k MTU.

I did some reading today and saw a few posts which mention potential issues with Jumbo Frames so I spent a bit of time dropping the VMK interfaces and also the interfaces in TrueNAS to 1500 and I still get the same issue (as well as quite a performance drop).

When I do a heavy data move between the two TrueNAS systems (ie, 50+ GB storage vMotion), one or both of the hosts can lock up. Networking drops, the console doesn't respond and the VMs are rebooted by vSphere HA onto the other host.

In vmkernel.log on the ESXi hosts I see (this may be unrelated and a bit of a red herring):

Code:

2022-04-18T16:50:15.559Z cpu5:2097655)ScsiDeviceIO: 4161: Cmd(0x45b8c1c15fc8) 0x83, CmdSN 0x2b64b from world 2120192 to dev "naa.6589cfc0000006e2cb0aae8b5f1d86f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xa 0x8 0x4         
2022-04-18T16:50:15.559Z cpu5:2097655)ScsiDeviceIO: 4161: Cmd(0x45b8c1d3acc8) 0x83, CmdSN 0x2b64c from world 2120192 to dev "naa.6589cfc0000006e2cb0aae8b5f1d86f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xa 0x8 0x4         
2022-04-18T16:50:15.559Z cpu5:2097655)ScsiDeviceIO: 4161: Cmd(0x45b8c1d69ac8) 0x83, CmdSN 0x2b64d from world 2120192 to dev "naa.6589cfc0000006e2cb0aae8b5f1d86f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xa 0x8 0x4         
2022-04-18T16:50:15.559Z cpu5:2097655)NMP: nmp_ThrottleLogForDevice:3798: last error status from device naa.6589cfc0000006e2cb0aae8b5f1d86f0 repeated 1280 times

In /var/log/messages on TrueNAS I see:

Code:

Apr 18 19:33:55 lando WARNING: 172.16.5.11 (iqn.1998-01.com.vmware:3440-1.chris.local:7130290:64): no ping reply (NOP-Out) after 5 seconds; dropping connection

The IP being one of the ESXi hosts.

I'm yet to start packet sniffing and I haven't set any advanced parameters. I'll look at that tomorrow (late UK time now). Between now and then if anyone has any pointers they'll be more than welcome.

ChrisD. · Apr 19, 2022

I added

Code:

kern.cam.ctl.iscsi.ping_timeout = 0

under tuning, while this has solved the no ping reply (NOP-Out) after 5 seconds; dropping connection issue I'm now seeing Apr 18 23:41:05 lando WARNING: 172.16.5.11 (iqn.1998-01.com.vmware:3440-1.chris.local:7130290:64): connection error; dropping connection.

jgreco · Apr 19, 2022

Yeah, disabling ping timeout isn't a fix for most iSCSI issues.

Neither of your systems are ideal for iSCSI, which I don't typically recommend until at least 64GB of RAM, but with SSD perhaps it is possible for sheer speed to cover up for that "sin".

Did you do burn-in testing on the SSD pool to make sure that it is communicating fine with the TrueNAS host? While unlikely, not unheard-of for there to be trouble, especially since you're using a Ryzen setup. Ryzen also requires some other tweaks to the BIOS in some cases to prevent lockups.

Perhaps @mav@ will come around and offer some reading as to the sense codes. My guess is that there's a good chance that the Sabrents aren't keeping up, fill up their cache, and then start responding slowly. This is much more likely to be the case if you've attempted to use every byte of space on them, rather than leaving a large (25-50%) amount of free space. SSD write speeds can be like faceplanting into a brick wall if you don't have lots of pages in the free page pool.

Please see

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

ChrisD. · Apr 19, 2022

Thanks.

Which BIOS settings are required for Ryzen?

I don't have the NVMe drives in any kind of ZFS pool, they are presented individually as block devices. My understanding of high RAM requirement with iSCSI is when using ZFS pools, not when using them as pure block devices. I know this is terrible as there is zero redundancy, but I understand the risk and accept it. I have daily backups of the VMs to a ZFS pool, which are replicated the second TrueNAS and subsequently into B2. Out of the 4 drives, only 2 are used at any given time and have between 20% and 50% free space.

jgreco · Apr 19, 2022

I don't know what BIOS settings are required for Ryzen. It's an exercise in frustration for those who choose that path.

As far as iSCSI goes, correct, ZFS memory requirements with iSCSI do not apply if you are using devices directly, my bad for assuming there. This is highly suggestive of the NVMe devices throttling and stalling I/O, possibly due to overheat. Most consumer grade NVMe isn't that great at sustained workload. This is too bad because the ZFS-side issues might be worked out, but it's harder to fix inherently broken devices, if it turns out to be the problem here.

ChrisD. · Apr 19, 2022

I'm not a storage expert, but an overheating NVMe tends to slow down/throttle from my experience. What's happening here is the iSCSI connection is dropping, if it is a hardware issue wouldn't that show in /var/log/messages or somewhere in the TrueNAS UI?

Prior to giving QNAP the boot, I had the NVMe's presented from QuTS (albeit, as part of an R1 mirror) and I didn't have such issues.

I'll monitor the temperature of the drives during a large transfer and see if that could be the issue.

HoneyBadger · Apr 19, 2022

ChrisD. said:
Which BIOS settings are required for Ryzen?

Definitely disable C6 sleep state and Cool-N-Quiet.

I'm suspecting it's the drives seizing up as @jgreco postulates. What's the exact model number of the Sabrent units, and can you pull SMART data from them to check temperature/peak temperature?

Using them as direct block devices does eliminate the write amplification of ZFS but also loses out on the compression/ARC benefits as well as the more gradual write throttle curve potentially smoothing this vs. the binary "running|stopped" switch of hardware.

ChrisD. said:
Valid sense data: 0xa 0x8 0x4

"COPY ABORTED - UNREACHABLE COPY TARGET"

Check your VAAI stats on a host via esxtop (then u f o inside) to see if it's trying to use XCOPY to clone between the two units by any chance.

ChrisD. · Apr 19, 2022

Thanks, I'll check out those BIOS settings. I don't have a monitor connected to either unit (or a GPU in one!) so it may take me a short while.

The exact model is SB-ROCKET-1TB. However VM data is stored on the Samsung PM983 enterprise drives 98% of the time. And I see the disconnects at seemingly random times when I am not doing a storage vMotion to the Sabrents.

Disk temperature over the previous 24h (where I have had issues) attached. The more erratic ones are the Samsungs but a high of 55 (to me at least) is not concerning.

esxtop output during a ~50 GB svMotion:

Thanks for all the help so far and I am open to suggestions. I don't mind as an example putting the NVMe drives into a mirror but then I'm aware I probably don't have enough RAM for that.

jgreco · Apr 19, 2022

ChrisD. said:
I'm not a storage expert, but an overheating NVMe tends to slow down/throttle from my experience.

So this gets back to SSD fundamentals, which involve uncomfortable truths of various varieties.

If it recovers, it is probably some combination of thermal and/or free page starvation. VM activity tends to be very stressy on SSD's due to the random nature of the workload.

If it is happening under stressy activities such as vmotion but works fine normally, that also points in that direction.

But the other thing to remember is that gear such as Sabrent's, which, hey, yes, they make some nice/handy/useful stuff, is nevertheless aimed at PC's and gamers, which means that firmware bugs which may get beaten out in datacenter-grade Intel and Kioxia SSD's aren't even noticed in the generic consumer-grade controllers, and never fixed.

HoneyBadger · Apr 19, 2022

ChrisD. said:
Thanks, I'll check out those BIOS settings. I don't have a monitor connected to either unit (or a GPU in one!) so it may take me a short while.

The exact model is SB-ROCKET-1TB. However VM data is stored on the Samsung PM983 enterprise drives 98% of the time. And I see the disconnects at seemingly random times when I am not doing a storage vMotion to the Sabrents.

Disk temperature over the previous 24h (where I have had issues) attached. The more erratic ones are the Samsungs but a high of 55 (to me at least) is not concerning.

Do the disconnects when not svMotioning also result in freezes? Any significant amounts of packet drop/discard being detected at the switch? Blowing out the buffers in the switch or getting weird packet queueing maybe? I'm not really familiar with the Unifi unit you've got there.

The writes will most likely be async if you're just presenting the block devices, so it could quite easily just be flooding the device queue or just sending writes endlessly and never getting a backoff.

ChrisD. said:
esxtop output during a ~50 GB svMotion:
View attachment 54848

Thanks for all the help so far and I am open to suggestions. I don't mind as an example putting the NVMe drives into a mirror but then I'm aware I probably don't have enough RAM for that.

You've got nonzero values in CLONE_F so that indicates it's trying (and failing) to use the XCOPY primitive somewhere.

ChrisD. · Apr 19, 2022

HoneyBadger said:
Do the disconnects when not svMotioning also result in freezes? Any significant amounts of packet drop/discard being detected at the switch? Blowing out the buffers in the switch or getting weird packet queueing maybe? I'm not really familiar with the Unifi unit you've got there.

Yes I get issues when not doing a storage vMotion. There's no particular pattern to it but now I think about it some have been overnight, when Veeam runs (but no where near as often as a full TrueNAS to TrueNAS migration. I can't see anything untoward on the switch, I'll keep looking.

HoneyBadger said:
You've got nonzero values in CLONE_F so that indicates it's trying (and failing) to use the XCOPY primitive somewhere.

Good sport, it's been years since I've delved into the depths of estxtop. Are there any logs on TrueNAS which I could correlate when doing the storage vMotion.

ChrisD. · Apr 19, 2022

jgreco said:
But the other thing to remember is that gear such as Sabrent's, which, hey, yes, they make some nice/handy/useful stuff, is nevertheless aimed at PC's and gamers, which means that firmware bugs which may get beaten out in datacenter-grade Intel and Kioxia SSD's aren't even noticed in the generic consumer-grade controllers, and never fixed.

I agree, but the issue happens on VMware HCL verified enterprise Samsung drives as well, when the Sabrents are sitting idle.

jgreco · Apr 19, 2022

The VMware HCL for SSD's only applies to gear directly attached to your hypervisor. It is unable to guarantee that there aren't incompatibilities between the enterprise Samsung drives and your TrueNAS host. I mean, I'd think it less likely, but you're seeing *some* sort of issue, so it isn't out of the question.

ChrisD. · Apr 19, 2022

jgreco said:
The VMware HCL for SSD's only applies to gear directly attached to your hypervisor. It is unable to guarantee that there aren't incompatibilities between the enterprise Samsung drives and your TrueNAS host. I mean, I'd think it less likely, but you're seeing *some* sort of issue, so it isn't out of the question.

I know that, what I'm saying is that it's an enterprise/datacenter level drive, and should not be suffering issues serving iSCSI at 10 Gbps. Whilst I do have 'consumer' grade in the backup NAS, I see issues when VMs are residing entirely on the Samsungs.

HoneyBadger · Apr 19, 2022

ChrisD. said:
Yes I get issues when not doing a storage vMotion. There's no particular pattern to it but now I think about it some have been overnight, when Veeam runs (but no where near as often as a full TrueNAS to TrueNAS migration. I can't see anything untoward on the switch, I'll keep looking.

An overnight Veeam run would also put a lot of traffic on the switches and the storage.

Mapping the block device directly as the extent might be doing this as I mentioned. TrueNAS is on the VMware HCL but only in the certified configuration as iSCSI/NFS through zvols/datasets respectively. This reminds me of behavior that I saw with the old SCSI target module and the legacy write throttle which was much more binary.

If this only happens during a period of heavy I/O, and you can reproduce it this way, perhaps try disabling delayed ACK on a single LUN (or a single host) and see if this resolves it.

https://kb.vmware.com/s/article/1002598

ChrisD. said:
Good sport, it's been years since I've delved into the depths of estxtop. Are there any logs on TrueNAS which I could correlate when doing the storage vMotion.

There's a sysctl tunable kern.cam.ctl.debug which you could set to 1 (although this might need to be done at boot-time, which would make it a loader.conf tunable instead of a sysctl) that should cause it to log commands with errors. See if it's actually receiving SCSI commands it can't support/handle.

ChrisD. · Apr 19, 2022

HoneyBadger said:
If this only happens during a period of heavy I/O, and you can reproduce it this way, perhaps try disabling delayed ACK on a single LUN (or a single host) and see if this resolves it.

https://kb.vmware.com/s/article/1002598

HoneyBadger said:
There's a sysctl tunable kern.cam.ctl.debug which you could set to 1 (although this might need to be done at boot-time, which would make it a loader.conf tunable instead of a sysctl) that should cause it to log commands with errors. See if it's actually receiving SCSI commands it can't support/handle.

Thanks for the suggestions, I will do some further testing and report back.

ChrisD. · Apr 20, 2022

@HoneyBadger I finally had time to do some testing. Disabling ACK stopped the iSCSI sense messages in vmkernel.log however I still had nonzero values in CLONE_F. Eventually the iSCSI dropped to both hosts even migrating a single 20 GB VM.

I then set kern.cam.ctl.debug to 1 (it's a sysctl unable by the way) and I get pages and pages of:

Code:

Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a98d/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)
Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a98e/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)
Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a98f/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)
Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a990/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)
Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a991/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)
Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a992/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)
Apr 20 16:55:29 bertie (1:3:0/0): EXTENDED COPY. CDB: 83 00 00 00 00 00 00 00 00 00 00 00 00 6c 00 00  Tag: 0x28a993/1
Apr 20 16:55:29 bertie (1:3:0/0): CTL Status: SCSI Error
Apr 20 16:55:29 bertie (1:3:0/0): SCSI Status: Check Condition
Apr 20 16:55:29 bertie (1:3:0/0): SCSI sense: COPY ABORTED asc:8,4 (Unreachable copy target)

In /var/log/messages of the receiving TrueNAS server. Something's definitely not right.

'Bertie' is the backup NAS, and I've already changed the LUN on it to be a mirrored pair rather than it being block devices to no avail.

I'm fairly certain that I had no such issues using QNAP, but then again I didn't have two NAS's to test with.

I'm at the stage of wanting to try out OMV or similar just to rule something else out. And although I'm fairly confident my switch is fine (I've checked the stats) I do have another switch I could try out.

Bottom line I just want it fixing and I'd much prefer to stick with TrueNAS.

I'm going to create a LUN on each NAS which is in the main data pool and test again. I'm not entirely sure what this will achieve but I'm keen to see if it will have the same behaviour.

I'm also beginning to wonder if this is perhaps some FreeBSD related issue and giving TrueNAS Scale a try. I don't particularly need the extra features that Scale offers but I'm led to believe that Debian has better driver support

HoneyBadger · Apr 20, 2022

ChrisD. said:
I'm going to create a LUN on each NAS which is in the main data pool and test again. I'm not entirely sure what this will achieve but I'm keen to see if it will have the same behaviour.

Let me know if this changes anything. This kind of behavior in previous setups has been down to the vdevs not being able to keep up with the network pipe - and it's entirely possible that a sustained 10Gbps is too much for these drives to cope with.

The other side of this coin is that you're likely doing the equivalent of using iSCSI with sync=standard currently, which means you're at a bit of risk for data loss. Enabling this will slam the brakes on the writes down to what your SSD can actually write in a more immediate mode.

ChrisD. · Apr 21, 2022

Evening @HoneyBadger.

Apologies it's taken me a while, newborn and family duties have taken up time today.

Since 'bertie' is my backup NAS, and I wanted to see what Scale looks like, I upgraded it to Scale this evening. I then tried again, same nonzero values in CLONE_F and iSCSI sense errors in vmkernel.log

I then created a simple 100 GB LUN from the main ZFS pool (4x 10TB RAIDZ - nothing special) and did a vMotion to that.

I've also tried using setting iSCSI traffic to 1 Gbps on the VDS, again, same result.

Even doing a VAAI internal move from the NVMe LUN to ZFS LUN I get the same problem.

Edited to say that the move to spinning rust based LUNs is different sense data:

2022-04-21T21:16:53.057Z cpu0:2097703)ScsiDeviceIO: 4161: Cmd(0x45b940778108) 0x83, CmdSN 0xdbee6 from world 2100023 to dev "naa.6589cfc00000046a4a4239e406ba6caf" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x26 0x0

HoneyBadger · Apr 22, 2022

I actually had to throw down with this ping-timeout error myself last night as well. Mine turned out to be a firmware/hardware level incompatibility with jumbo frames on the TrueNAS hardware (probably because the NIC is shared with the OOB management module, and no amount of convincing would get it to leave that hardware completely untouched)

I assume a vmkping -S 9000 -I <vmkernel interface> <your iSCSI target IP> returns good?

This is still making me think of a network issue.

Is there any anti-DoS feature or "port security" enabled on the Unifi?
Also kind of a blunt instrument, but can you enable flow control on an iSCSI interface on the Unifi for testing?

Important Announcement for the TrueNAS Community.

ESXi iSCSI - no ping reply (NOP-Out) after 5 seconds; dropping connection

Dabbler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

actually does care

Dabbler

Resident Grinch

actually does care

Dabbler

Dabbler

Resident Grinch

Dabbler

actually does care

Dabbler

Dabbler

actually does care

Dabbler

actually does care

Similar threads