VMware, iSCSI, dropped connections and lockups

2twisty · Jul 20, 2021

I have requested a maintenance window this weekend to gather the performance metrics that you've asked for.

Are there any other metrics/logs you'd like to see when the problem is occurring? Since it's basically once a week I can get an opportunity to do this kinda thing, I want to make sure I get all the data that we want to see.

Also, neither of those "dtrace one-liners" in that blog worked. They just sat there and never produce any data, so it's likely that some parameter has changed. I'll look at the link you sent, but I'm already in over my head on ZFS, and back onto the steeper part of the learning curve.

2twisty · Jul 20, 2021

Another question, and I think I know the answer.................

Would switching to NFS eliminate this? I'm thinking no, since it's happening at the ZFS level, not the network protocol level.

hescominsoon · Jul 20, 2021

2twisty said:
Another question, and I think I know the answer.................

Would switching to NFS eliminate this? I'm thinking no, since it's happening at the ZFS level, not the network protocol level.

i doubt nfs is going to help. This seems like a case of throwing too much data at the system than it can handle...

hescominsoon · Jul 20, 2021

2twisty said:
Been reading this:

Tuning the OpenZFS write throttle

However, it seems that we are pushing data INTO the san faster than it can write it -- if we adjust these parameters, will that tell our ESX to "slow the hell down" or will ESX keep pushing data at the max link speed (with round robin, that's 20GB/s per host, and we have 3, but only 2 are prod)

So I am going to test whether dropping the SAN down to a single 10GB link improves this. However, that's not acceptable long term since it leaves us without redundancy. I am hoping that our 10GB switches may be able to do some traffic shaping, but that's for our network guy to deal with -- I'm the server/linux guy at our operation.

try one as a test but you can setup a dual link for redundancy...

HoneyBadger · Jul 20, 2021

2twisty said:
Possibly. Our backup server also runs TN, but it's getting pretty full. I could drop a linux box in somewhere with a big single disk and move things one VM at a time and then back to kill the dedupe....

Of course, during that move, the data would be at risk, but I'd have a fresh backup before I did that anyway.

Ideally, I could move ALL the data off and rebuild the pool, to also reduce fragmentation.

Sadly, several of our VMDKs are HUGE, so doing one vmdk at a time is also going to be challenging.

Hmm.. Maybe I could drop a drive in the VM host as local storage, and vmotion a VM back and forth as well....

I'd definitely have a pair of drives at the very least. While you might have backups, relying on them should be a "last resort" kind of thing.

Not sure if I've asked this already - are you using sparse ZVOLs? How about VMFS6, and are your VMDKs thick or thin? Curious to see if you're having your UNMAP commands reach the array and free up the unused blocks. SSH into a vSphere host, run esxtop then go to page u, add fields with f then select VAAI statistics with o and look for the value in the DELETE column. If it's non-zero it means you're sending UNMAP commands to the device.

The way I'd do this, if I didn't have enough room to move everything off, would be to set up a new (sparse ZVOL-backed, VMFS6) datastore with deduplication off. svMotion from the "dedup datastore" to the "pivot storage" on the Linux or another TrueNAS machine (or the local datastore) while converting to thin, and then svMotion back to the "new datastore." Bear in mind that the svMotion off will still trigger writes (dedup table updates) so start with a small VMDK and watch the performance numbers. The svMotion on will of course trigger writes for data but not ddt updates.

2twisty said:
Another question, and I think I know the answer.................

Would switching to NFS eliminate this? I'm thinking no, since it's happening at the ZFS level, not the network protocol level.

I know @kspare and some other users had a whole raft of bizarre performance issues with iSCSI - but seemingly not to this degree, and only starting with TN12. Did this problem predate TrueNAS 12 ie: was it there on FreeNAS 11.3?

kspare · Jul 20, 2021

i run 4 tn and fn boxes. 1tb of ram, 24 4tb drives in mirrors, 4801x for slog. I ran into a ton of issues with iscsi. It’s 100% stable on nfs. we run hundreds of jsers on hundreds of terminal servers with ease. iscsi was a nightmare. Seriously. Try nfs3. It will work.

2twisty · Jul 20, 2021

HoneyBadger said:
I'd definitely have a pair of drives at the very least. While you might have backups, relying on them should be a "last resort" kind of thing.

Not sure if I've asked this already - are you using sparse ZVOLs? How about VMFS6, and are your VMDKs thick or thin? Curious to see if you're having your UNMAP commands reach the array and free up the unused blocks. SSH into a vSphere host, run esxtop then go to page u, add fields with f then select VAAI statistics with o and look for the value in the DELETE column. If it's non-zero it means you're sending UNMAP commands to the device.

The way I'd do this, if I didn't have enough room to move everything off, would be to set up a new (sparse ZVOL-backed, VMFS6) datastore with deduplication off. svMotion from the "dedup datastore" to the "pivot storage" on the Linux or another TrueNAS machine (or the local datastore) while converting to thin, and then svMotion back to the "new datastore." Bear in mind that the svMotion off will still trigger writes (dedup table updates) so start with a small VMDK and watch the performance numbers. The svMotion on will of course trigger writes for data but not ddt updates.

I know @kspare and some other users had a whole raft of bizarre performance issues with iSCSI - but seemingly not to this degree, and only starting with TN12. Did this problem predate TrueNAS 12 ie: was it there on FreeNAS 11.3?

zVols are NOT sparse, I think. When I edit an existing ZVOL, it doesn't show me whether it's sparse or not, but by default I would have NOT selected sparse because of the warning about overprovisioning -- in general I try not to overprovision.

Most of the VMDKs are thick. We have a few that are thin. VMFS6 throughout.

ESXTOP gives me just screens and screens of garbage.

Code:

acglp.com\PCPU Power State(PCPU 55)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 55)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 55)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 55)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 55)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 56)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 56)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 56)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 56)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 56)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 56)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 57)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 57)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 57)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 57)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 57)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 57)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 58)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 58)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 58)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 58)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 58)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 58)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 59)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 59)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 59)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 59)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 59)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 59)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 60)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 60)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 60)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 60)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 60)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 60)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 61)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 61)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 61)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 61)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 61)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 61)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 62)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 62)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 62)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 62)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 62)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 62)\%P-State P2","\\trinity.acglp.com\PCPU Power State(PCPU 63)\%C-State C0","\\trinity.acglp.com\PCPU Power State(PCPU 63)\%C-State C1","\\trinity.acglp.com\PCPU Power State(PCPU 63)\%C-State C2","\\trinity.acglp.com\PCPU Power State(PCPU 63)\%P-State P0","\\trinity.acglp.com\PCPU Power State(PCPU 63)\%P-State P1","\\trinity.acglp.com\PCPU Power State(PCPU 63)\%P-State P2","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Name","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Team Uplink","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Packets Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Mega Bits Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Packets Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Mega Bits Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Queue Pairs Allocated","\\trinity.acglp.com\RDMA Device(vmrdma0:qedrntv:Active)\Memory Regions Allocated","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Name","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Team Uplink","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Packets Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Mega Bits Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Packets Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Mega Bits Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Queue Pairs Allocated","\\trinity.acglp.com\RDMA Device(vmrdma1:qedrntv:Active)\Memory Regions Allocated","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Name","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Team Uplink","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Packets Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Mega Bits Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Packets Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Mega Bits Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Queue Pairs Allocated","\\trinity.acglp.com\RDMA Device(vmrdma2:qedrntv:Active)\Memory Regions Allocated","\\trinity.acglp.com\RDMA Device(vmrdma3:qedrntv:Active)\Name","\\trinity.acglp.com\RDMA Device(vmrdma3:qedrntv:Active)\Team Uplink","\\trinity.acglp.com\RDMA Device(vmrdma3:qedrntv:Active)\Packets Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma3:qedrntv:Active)\Mega Bits Transmitted/sec","\\trinity.acglp.com\RDMA Device(vmrdma3:qedrntv:Active)\Packets Received/sec","\\trinity.acglp.com\RDMA Device(vmrdma3:qedrntv:Active)\Mega Bits Received/sec","\\tr

2twisty · Jul 20, 2021

OK. Tried again with PuTTY. That worked, Strange that the SSH client in Ubuntu 20.04 on WSL couldn't handle it.

DELETE is about 8000 for each iscsi datastore -- not sure which is which since it only lists the naa.xxxxxx number instead of its friendly name.

So what does this mean?

2twisty · Jul 20, 2021

Also, there IS enough space on the pool to do these moves, but it would take the pool way past the 67% utilization it's at now. I'd think it would be better to not do that to avoid fragmentation, right?

2twisty · Jul 20, 2021

So, you're saying that both ESX and the ZVol should be sparse/thin, right? Does that offer a performance gain or just allow for overprovisioning?

kspare · Jul 20, 2021

if you run nfs you don’t need a zvol. just a dataset.

hescominsoon · Jul 20, 2021

2twisty said:
So, you're saying that both ESX and the ZVol should be sparse/thin, right? Does that offer a performance gain or just allow for overprovisioning?

it allows for overprovisioning mainly. of course if they are "thick" then the boundaries are already set and it doesn't have to write the expansions of the filesystem.

Stux · Jul 20, 2021

2twisty said:
Is there a way to maintain my 40GBs connections and have TrueNAS do the throttling based on disk performance? I figured that TrueNAS would tell VMware to slow the hell down if it couldn't keep up.

Or is that because it dumps to RAM and TN can accept the data at that rate, but not write it? Can we tell TN to slow down?

The reason we have multiple links is for redundancy. We have ESX set to Round Robin (each host has 2x 10G links to the switch). We tried turning round robin off and it didn't affect it, but from what folks are saying, a single 10GB link from the host to TN would be enough to saturate it.

Sadly, our VMWare licence does not give us the throttling options in iSCSI (gotta love VMWare's licensing model), so any throttling will need to be done on the TN side.

Ideas?

I've recently transitioned my main server from 2x 8-way RaidZ2 to a 8x mirrors. I did this as part of our ugprade to 10be as it wasn't coming close to being able to saturate 10gbe. I'm adding a few more mirrors for capacity and performance, but it seems to have done the trick raising performance from circa 100-200MB/s which was fine in 1gbe land to closer to 1GB/s. We were seeing about 20MB/s per disk, which for an aged pool seems about right. These are NAS grade disks, ie 5500rpm class.

Anyway, afaik, the way ZFS throttles writes is that it stops accepting them once the TXGs back up. This results in dropped iSCSI connections.

I'd be surprised if 3 disks worth of writes could keep up with 10gbe. You need to be able to sink a GB/s. And actually you're running a multiple of that.

Perhaps there are tunables etc that can help

2twisty · Jul 21, 2021

I am surprised that ZFS just stops accepting data rather than sending a "hold on, I'm busy" signal, and just drops the connection.

So, I either need to add more disks to increase the write speed or I need to slow down the network. Lol. We deliberately built this on 10G for max speed. I never dreamed that too much speed would cause this!

I will test dropping down to a single 10G link to see if that results in any improvement. Of course, that is only a test, because we want our redundancy.

Since we aren't licensed in VMWare to throttle the iSCSI on the VMWare side, are there any tunables I can tweak to affect this? I really don't want to cripple my speed, since most of the time our traffic into the TN box is bursty and the 10G works fine -- only when I try to write a large chunk of data does it overrun everything.

hescominsoon · Jul 21, 2021

2twisty said:
I am surprised that ZFS just stops accepting data rather than sending a "hold on, I'm busy" signal, and just drops the connection.

So, I either need to add more disks to increase the write speed or I need to slow down the network. Lol. We deliberately built this on 10G for max speed. I never dreamed that too much speed would cause this!

I will test dropping down to a single 10G link to see if that results in any improvement. Of course, that is only a test, because we want our redundancy.

Since we aren't licensed in VMWare to throttle the iSCSI on the VMWare side, are there any tunables I can tweak to affect this? I really don't want to cripple my speed, since most of the time our traffic into the TN box is bursty and the 10G works fine -- only when I try to write a large chunk of data does it overrun everything.

zfs tries to send a delay..but you are pouding it so fast there's too many delays that get built up and then the cumulative effects of lal the transacton group delays times out the connection.

2twisty · Jul 21, 2021

Can you guys think of any tunables I can use to throttle the network connection?

kspare · Jul 21, 2021

2twisty said:
I am surprised that ZFS just stops accepting data rather than sending a "hold on, I'm busy" signal, and just drops the connection.

So, I either need to add more disks to increase the write speed or I need to slow down the network. Lol. We deliberately built this on 10G for max speed. I never dreamed that too much speed would cause this!

I will test dropping down to a single 10G link to see if that results in any improvement. Of course, that is only a test, because we want our redundancy.

Since we aren't licensed in VMWare to throttle the iSCSI on the VMWare side, are there any tunables I can tweak to affect this? I really don't want to cripple my speed, since most of the time our traffic into the TN box is bursty and the 10G works fine -- only when I try to write a large chunk of data does it overrun everything.

Try. NFS! Why are you so hung up on iscsi?

hescominsoon · Jul 21, 2021

kspare said:
Try. NFS! Why are you so hung up on iscsi?

nfs would not solve the underlying issue..which is too much data coming into the box. nfs is not a panacea.

kspare · Jul 21, 2021

you know this for fact without trying it? I run more load than you do, and iscsi failed me too..nfs is flawless….take the advice how you will.

hescominsoon · Jul 21, 2021

kspare said:
you know this for fact without trying it? I run more load than you do, and iscsi failed me too..nfs is flawless….take the advice how you will.

what is in my signature is not the only systems i have under my control. You have no idea what loads i run on what machines I have. To say somehting is flawless is shortsighted and arrogant to say the least.

Important Announcement for the TrueNAS Community.

VMware, iSCSI, dropped connections and lockups

Contributor

Contributor

Patron

Patron

actually does care

Guru

Contributor

Contributor

Contributor

Contributor

Guru

Patron

MVP

Contributor

Patron

Contributor

Guru

Patron

Guru

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "VMware, iSCSI, dropped connections and lockups"

Similar threads