VMware, iSCSI, dropped connections and lockups

2twisty · Jun 22, 2021

I'm not sure where to begin..

Right now I can't even log into my TN server to get version information. But I'm on TN Core 12.xx Not sure what rev.

This is a live production system and can't just be rebooted or wiped, etc without major inconvenence to the management -- they're already mad enough that we are having the problem I outline below...

HP DL385 Gen10 plus chassis
32-core Epyc
256GB ECC RAM
4x 10Gb Nics (all for iSCSI)
4x 1Gb Nics (1 used for management)

2 pools:

Tier1: 6 1tb NVMe in striped mirror
Tier2: 6 8tb SAS12 Hdd in striped mirror - 1tb NVMe as SLOG and 1TB NVMe cache

I can't tell you pool utilization at the moment because the server won't respond and I can't get the web gui up. I'll get more details as soon as I can, but suffice it to say that both pools are WAY past 50%.

The Problem:

Any time we write a large chunk of data to Tier2, we wind up in an oscillating condition where iSCSI drops all the connections to the VMs. Then, after a couple minutes, it reconnects, and then a couple minutes later it disconnects again. This goes on and on and on until TN is done writing (or whatever it is doing) with the data.

After our backup started and took its snapshot (Veeam), I wrote 400GB to the array. Now, VMWare is trying to consolidate the snapshot and it's entered the oscillation. Since VMWare is dependent on the iSCSI to consolidate the snapshot, the disconnect/reconnect business means that it is literally taking ALL DAY to complete the consolidation.

This is causing my main file server VM and my exchange VM (among others) to hang for minutes at a time. This is causing a major disruption in business continuity, and we are afraid that if we force a reboot to kill the oscillation that we may lose data -- so we wait.

The management is not very happy with us right now, and understandably so.

What information can I provide (as soon as I can get the GUI up) to help you help me:

1) get the oscillation to stop to restore operations
2) repair the misconfiguration that is allowing this to happen in the first place

Any help is greatly appreciated.

HoneyBadger · Jun 22, 2021

2twisty said:
What information can I provide (as soon as I can get the GUI up) to help you help me?
"HP DL385 Gen10 plus chassis"

Questions coming:

What is the HBA in use here? (If it's a RAID card, problem.)

What are the models of the "1TB NVMe" SSDs, and is it the same model across the data pool/SLOG/L2ARC?

What is the model of HDD? I don't suspect this is a pain point.

What is the sync setting on the pool/ZVOLs assigned to VMware?

What model of switch is in the middle, and please describe network topology (IP addressing/subnets/any overlap) and any specific policies in place on vSphere.

What I'm betting on: your network is way, WAY too fast for your vdevs, sync is standard (which is "off" in iSCSI defaults) and it's swamping your transaction groups. If you have a RAID card with cache in the middle it's getting absolutely blown out by the surge of I/O.

kspare · Jun 22, 2021

The only thing I would add, if you are insistent on running TN, ditch iscsi and run nfs.

hescominsoon · Jun 22, 2021

kspare said:
The only thing I would add, if you are insistent on running TN, ditch iscsi and run nfs.

I have several TN and TNC servers all running iscsi under heavy loads...without issues. Windows doesn't like NFS much..ISCSI works better with windows..which is what the hosts use(hyper-v)...

kspare · Jun 22, 2021

fair. I assumed you were running esxi

hescominsoon · Jun 22, 2021

2twisty said:
I'm not sure where to begin..

Right now I can't even log into my TN server to get version information. But I'm on TN Core 12.xx Not sure what rev.

This is a live production system and can't just be rebooted or wiped, etc without major inconvenence to the management -- they're already mad enough that we are having the problem I outline below...

HP DL385 Gen10 plus chassis
32-core Epyc
256GB ECC RAM
4x 10Gb Nics (all for iSCSI)
4x 1Gb Nics (1 used for management)

2 pools:

Tier1: 6 1tb NVMe in striped mirror
Tier2: 6 8tb SAS12 Hdd in striped mirror - 1tb NVMe as SLOG and 1TB NVMe cache

I can't tell you pool utilization at the moment because the server won't respond and I can't get the web gui up. I'll get more details as soon as I can, but suffice it to say that both pools are WAY past 50%.

The Problem:

Any time we write a large chunk of data to Tier2, we wind up in an oscillating condition where iSCSI drops all the connections to the VMs. Then, after a couple minutes, it reconnects, and then a couple minutes later it disconnects again. This goes on and on and on until TN is done writing (or whatever it is doing) with the data.

After our backup started and took its snapshot (Veeam), I wrote 400GB to the array. Now, VMWare is trying to consolidate the snapshot and it's entered the oscillation. Since VMWare is dependent on the iSCSI to consolidate the snapshot, the disconnect/reconnect business means that it is literally taking ALL DAY to complete the consolidation.

This is causing my main file server VM and my exchange VM (among others) to hang for minutes at a time. This is causing a major disruption in business continuity, and we are afraid that if we force a reboot to kill the oscillation that we may lose data -- so we wait.

The management is not very happy with us right now, and understandably so.

What information can I provide (as soon as I can get the GUI up) to help you help me:

1) get the oscillation to stop to restore operations
2) repair the misconfiguration that is allowing this to happen in the first place

Any help is greatly appreciated.

do you have ssh enabled? if so you can log into the mahcine and restart the middoleware with service middlewared restart. see if that gets the web interface back online.

2twisty · Jun 23, 2021

OK -- I'll try to answer all these questions:

HoneyBadger said:
Questions coming:

What is the HBA in use here? (If it's a RAID card, problem.)

The HBAs are not RAID cards. We have 2x HP H240 HBAs.

HoneyBadger said:
What are the models of the "1TB NVMe" SSDs, and is it the same model across the data pool/SLOG/L2ARC?

All 8 NVMe SSDs are identical, and are installed in M.2 adapter cards (2x per card) Asus HyperM.2 PCIe Gen3

(When I went to fetch the model numbers from TN, it was "sleeping" again and I can't access the GUI. I believe they are all Sabrent)

6 of these are striped and mirrored into a pool named Tier1, the other 2 are userd for SLOG and Cache on Tier2

HoneyBadger said:
What is the model of HDD? I don't suspect this is a pain point.

6x HGST HE8 8tb SAS12 drives in striped mirror in pool Tier2

HoneyBadger said:
What is the sync setting on the pool/ZVOLs assigned to VMware?

Standard, and atime is on

HoneyBadger said:
What model of switch is in the middle, and please describe network topology (IP addressing/subnets/any overlap) and any specific policies in place on vSphere.

I will have to ask our network guy to answer this part in full detail . My understanding is this: iSCSI is on separate subnet/vlan, and on separate 10G switches.

HoneyBadger said:
What I'm betting on: your network is way, WAY too fast for your vdevs, sync is standard (which is "off" in iSCSI defaults) and it's swamping your transaction groups. If you have a RAID card with cache in the middle it's getting absolutely blown out by the surge of I/O.

That's what we think, as well. However, we don't know what to do about it.

I have a TN server at home in my homelab, and last night I tried to replicate this failure and was unable to. I can write all day long at almost 200MB/s to and from the TN server over the iSCSI connections from my VMware host and VMs. I even tried fully allocating my pool to see if running it at 95% utilization had an effect. Nope. Happy camper.

Which leads me to ZFS Fragmentation, which is probably a separate post after I do some research...but in short: if the array has been over 50% utilization for a while and the fragmentation gets bad, I assume that it could cause (or contribute) to this problem. Ans since there is no REAL way to defrag ZFS, just deleting or moving data to get back to 50% won't necessarily help since the data is already fragmented. Right? Be on the lookout for a more complete post about this once I read and understand more and can formulate my thoughts....

2twisty · Jun 23, 2021

hescominsoon said:
do you have ssh enabled? if so you can log into the mahcine and restart the middoleware with service middlewared restart. see if that gets the web interface back online.

Yes, SSH is enabled.

When the GUI goes to "sleep," SSH will allow login, but as soon as I execute any binary - SSH locks up. I can do ls, cd, cat all I want on the drive, but if I try to restart middlewared or even run something like top, the ssh session locks up.

SSH will return to normal at the same time that the GUI does.

So, restarting middlewared doesn't work for me.

2twisty · Jun 23, 2021

kspare said:
The only thing I would add, if you are insistent on running TN, ditch iscsi and run nfs.

We are running ESXI 7 (we have an old host on 6.5 also as a DEV box)

Why is NFS better than iSCSI for VMware? What's involved in converting? Creating an NFS datastore and then VMotion?

2twisty · Jun 23, 2021

HEY! The GUI woke up! (it does this several times per day)

NVMe SSDs are 1tb Sabrent. TN only says "sabrent" under model number, so I can't be more specific.

Tier2 disks are HGST HUH728080AL4200

jgreco · Jun 23, 2021

2twisty said:
The HBAs are not RAID cards. We have 2x HP H240 HBAs.

My recollection is that the H240 is not an LSI HBA and is based on some other chipset. This means that you are in largely uncharted waters, and you would be well advised to swap them out for a known-compatible LSI HBA running the correct version of IT firmware.

The "goes to sleep" thing you're describing sounds very much like a ZFS write lockup. ZFS can only have two outstanding transaction groups for writing, and if the I/O subsystem is slow or hangs, then ZFS will hang, indefinitely, basically forever, waiting for the transaction group to complete.

So what can happen here is that an unsupported RAID or HBA card can maybe take ten seconds to timeout a transaction in progress, but when I say "transaction", I mean a conventional "IOPS", which might be 64KB of data. A ZFS transaction group can be a gigabyte of data, so what ends up happening is that your HBA gets presented with 16000 IO operations, each one of which takes maybe ten seconds to fail, so the ZFS transaction group that SHOULD have completed in well less than five seconds might instead take 160,000 seconds (two days) to complete.

Now of course I don't know exactly what your craHPpy HBA is actually doing or what its actual timeout characteristics are or what might be happening underneath the sheets or what specific bad thing is happening.

But I do know that the reason that we hammer home "LSI HBA" so hard is that the LSI HBA's, and specifically only the LSI HBA's with the correct IT firmware version, are known to actually handle these situations correctly.

The next argument I tend to hear when having this unhappy conversation with people is that "but I have other machines where it works" or "it used to work" or stuff along those lines. But that's not the point. There are actually LOTS of HBA's and RAID cards that work swimmingly well under normal conditions. The issue here is what happens under ADVERSE conditions, such as a failing disk, SCSI bus errors, and other stuff like that. The LSI HBA's have proven to handle these conditions correctly, as well, while almost everything else cocks it up somehow.

jgreco · Jun 23, 2021

Also for reference

https://www.truenas.com/community/r...bas-and-why-cant-i-use-a-raid-controller.139/

Which applies to non-LSI HBA's as well.

2twisty · Jun 23, 2021

Let me clarify the "sleep" thing:

There are 2 problems that happen with this server: gui/ssh "sleep" and iSCSI disconnect/reconnect oscillation. They are not connected to each other; one can happen without the other.

When I refer to GUI sleep, all my shares and iSCSI are still functioning perfectly; I just can't access the GUI or do anything useful via SSH until it "wakes up."

The disconnect oscillation is the bigger concern, since it takes out my VMs, and the users NOTICE that. Only the IT staff notice the GUI sleep issue.

I have done some searching on the HP240, and while you're right that it appears to not be LSI-based (I thought it was when we ordered it, GRRRRR) many people are having no issues with them and TrueNAS Core, while others report some issues. I do have the card in HBA mode.

I suspect this is due to transactions piling up, but I don't know how to verify that.

My next service window would be this weekend, and I am trying to figure out what (and how many) config changes I should make at a time to try to resolve this.

I think the most important thing would be to force sync=always and to turn off atime.

Creating the oscillation condition is actually pretty reproducible, since all I have to do is write a bunch of data to set if off.

Do you guys think this is a reasonable first step?

What before/after metrics can I use to see if there has been a positive effect (aside from "it's not doing that annoying thing anymore?")

jgreco · Jun 23, 2021

Speculation about known-problematic situations is frustrating at best. All the lab stuff we have here is for stuff expected to work, and since iXsystems doesn't pay me anything, any offroad adventures are something I have to fund out of my company's pockets, which usually doesn't happen unless we happen to have those exact things laying around, a server already open at one of the bays on the shop bench, and time to play.

ZFS is used for both the OS boot pool and for the data storage pool. If the non-LSI HBA is being used as a controller for the boot pool, it is trivial to imagine that there could be independent freezes. Even in the case where the boot pool is attached some other way, I have seen cases where a badly broken ZFS pool seemed to impact other operations on the host in unexpected ways, but that isn't a situation that I try to fully understand... I generally move to remediation immediately, and then don't need to worry about the "why". I'm sure some developer somewhere might be able to speculate as to the specifics of the behaviour. It probably has something to do with ZFS cache allocations and all that.

I fully appreciate that this isn't the answer that you'd prefer, but it's all I've got, sorry.

2twisty · Jun 23, 2021

I am not a BSD guy.

The boot pool is served by the HP HBA. Where might I find error messages in which log?

When the GUI goes to sleep, I see nothing in the console messages.

When we have the disconnect/reconnect oscillation I do get lots of messages like this:

Code:

ctl_datamove: tag 0x279a63ca on (3:3:0) aborted
(3:3:0/0): WRITE(16). CDB: 8a 00 00 00 00 02 e8 ef b1 80 00 00 00 80 00 00
(1:3:0/0): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 00 aa c0 00 00 00 01 00 00
(3:3:0/0): Tag: 0x279a63cb, type 1
(1:3:0/0): Tag: 0x4c958c57, type 1
(1:3:0/0): ctl_process_done: 93 seconds
(3:3:0/0): ctl_datamove: 93 seconds
ctl_datamove: tag 0x279a63cb on (3:3:0) aborted
(1:3:0/0): WRITE(10). CDB: 2a 00 3c ab 71 40 00 00 30 00
(1:3:0/0): Tag: 0x4c958c58, type 1
(1:3:0/0): ctl_datamove: 92 seconds
ctl_datamove: tag 0x4c958c58 on (1:3:0) aborted
(3:3:0/0): Tag: 0x279a63c8, type 1
(3:3:0/0): ctl_datamove: 93 seconds
ctl_datamove: tag 0x279a63c8 on (3:3:0) aborted
(3:3:0/0): WRITE(16). CDB: 8a 00 00 00 00 01 77 ad 0b e0 00 00 00 90 00 00
(3:3:0/0): Tag: 0x279a63cd, type 1
(3:3:0/0): ctl_datamove: 92 seconds
ctl_datamove: tag 0x279a63cd on (3:3:0) aborted
(1:3:0/0): READ(10). CDB: 28 00 8e 9e 37 60 00 01 00 00
(3:3:0/0): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 00 ae 88 00 00 00 01 00 00
(1:3:0/0): Tag: 0x4c958c59, type 1
(3:3:0/0): Tag: 0x279a63ce, type 1
(3:3:0/0): ctl_datamove: 92 seconds
ctl_datamove: tag 0x279a63ce on (3:3:0) aborted
(0:3:0/0): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 00 ae 80 00 00 00 01 00 00
(1:3:0/0): ctl_datamove: 92 seconds
(0:3:0/0): Tag: 0x46ece42, type 1
(0:3:0/0): ctl_datamove: 92 seconds
ctl_datamove: tag 0x46ece42 on (0:3:0) aborted
ctl_datamove: tag 0x279a63cf on (3:3:0) aborted
ctl_datamove: tag 0x4c958c59 on (1:3:0) aborted
ctl_datamove: tag 0x4c958c5b on (1:3:0) aborted
ctl_datamove: tag 0x4c958c5c on (1:3:0) aborted

jgreco · Jun 23, 2021

2twisty said:
I am not a BSD guy.

The boot pool is served by the HP HBA. Where might I find error messages in which log?

Not having your HP HBA, damned if I know. It might show up in the messages file. You'll have to look around. This is one of the problems with doing something different.

Well, that's pretty clear, replace the HBA with a supported HBA. This may not actually fix your issue, because it could well be a failing drive, bad cable, backplane issue, etc., but 92 second timeouts are not going to be happy for iSCSI.

2twisty · Jun 23, 2021

what is strange about this is that it only occurs when I write large volumes of data. If I trickle it in, the server runs fine with no errors.

hescominsoon · Jun 23, 2021

I'll defer about the HBA to others..although running drive adapters is where i would look first. An adapter that cannot handle the load can suddenly just stop working..which can lead to issues with the HBA because the drives seem to disappear.

jgreco · Jun 23, 2021

2twisty said:
what is strange about this is that it only occurs when I write large volumes of data. If I trickle it in, the server runs fine with no errors.

So basically what you're saying is when you flood the controller with I/O, which smashes into host and HBA CPU buffer limits, stresses the HBA CPU, causes I/O to pend, and other things like that, all sorts of compsci edge cases, you find yourself losing I/O.

I literally warn about all of this stuff in the linked article above. There's nothing strange about it. It takes a driver and firmware for the HBA that handle everything 100.00000% right, because ZFS likes to place crushing I/O loads on the devices, because it has all these ideas about creating monster transaction groups in order to create the magic performance wins ZFS is well known for. If the HBA firmware or device drivers aren't up to it, you start losing out, and because ZFS plays its game in a monster-giant-big way, you lose out monster-giant-big-horribly.

Stux · Jun 24, 2021

Also, if you are using the same SSD type for your slog as your tier 1 drives they might not be suitable as a SLOG drive.

but I suspect the primary issue is the HBA.

Important Announcement for The TrueNAS Community.

VMware, iSCSI, dropped connections and lockups

Contributor

actually does care

Guru

Patron

Guru

Patron

Contributor

Contributor

Contributor

Contributor

Resident Grinch

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Patron

Resident Grinch

MVP

Similar threads

Important Announcement for The TrueNAS Community.