VMware, iSCSI, dropped connections and lockups

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Lol. Ok. Ive been down this road. I’m trying to give you a fix. Keep suffering and wasting time. Doesn’t bother me. you can’t easily try it though.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Lol. Ok. Ive been down this road. I’m trying to give you a fix. Keep suffering and wasting time. Doesn’t bother me. you can’t easily try it though.

Thanks for the suggestion, and we may still explore that. Our management specifies iSCSI, so we have to go "all the way" down that road before I could convince them to switch everything over to NFS.

I appreciate and will consider ALL suggestions.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
OK -- so... I will be doing some testing this weekend, and hope to prove that we are just pushing data into the array too damn fast for the 3-disk write bandwidth we have.

So -- if I wanted to build a disk array for bulk storage, how many drives would I need to absorb that much IO?

Thinking about a disk shelf with a bunch of 4tb drives in it, all mirrored, etc. Is this doable or am I stuck with either slowing down the IO or converting to all flash (not an option due to cost)?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
OK -- so... I will be doing some testing this weekend, and hope to prove that we are just pushing data into the array too damn fast for the 3-disk write bandwidth we have.

So -- if I wanted to build a disk array for bulk storage, how many drives would I need to absorb that much IO?

Thinking about a disk shelf with a bunch of 4tb drives in it, all mirrored, etc. Is this doable or am I stuck with either slowing down the IO or converting to all flash (not an option due to cost)?

Is Tier2/HDD just "bulk storage" being used as a backup target, or are there live VMs running on it? For a backup target, you might be able to handle that with a couple dozen disks - to keep up with 40Gbps for live VMs on spindles, you're likely going to need a 42U rack full of JBODs to even remotely stand a chance.

I do have to wonder if your current SLOG device is doing something like hanging up during TRIM (since the SLOG workload under normal operation is "write, delete, repeat") which causes it to be unable to take the new writes. Check the ms/d and delete bandwidth during workload.

Can you change the VMware PSP off of round-robin to a fixed-path or MRU policy? This would cap you to 10Gbps and let you see if ZFS is able to let that fail gracefully. The throttle slope might just be kicking in too gently to put the brakes on a combined 40GbE before it's able to put itself into a no-win situation with the amount of dirty data.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
OK folks, here's the update, and it's not good:
I have 6 drives in a 3-way striped mirror, so I have the write performance of 3 drives.

You misspelled "2 drives". If you are using three way mirrors, you have two vdevs, so you have the IOPS capacity of two drives, and the write characteristics of two drives.

  • 12Gb SAS means max transfer of 1.5GB/s
  • 1.5GB/s * 3 drives = 4.5 GB/s
  • 4.5GB in a 1s TXG * 2 TXGs = 10GB

That's okay for estimating the max SLOG you need, but don't forget that you need guaranteed performance for iSCSI, or else iSCSI freaks out and resets connections and everything goes to hell. If you have the IOPS capacity of two vdevs, you really only have about 300-400 write IOPS available to you, which could be as little as 205KBytes/sec (512 * 300) in the pessimistic case where you are doing maximum seeking.

Tested again. Marginal improvement, still fails.

So I thought, "OK, 12GB/s is MAX transfer, not likely these drives will achieve that in actual write speed." so I dropped the SLOG to 5GB.

Dropping SLOG size will not help, at least not for any reason I can think of.

Basically, having read this thread on and off for awhile now, it feels like you have super-high expectations out of what a hard drive is going to provide. ZFS can sometimes deliver on that, which is probably sabotaging the process of setting real world expectations here. When ZFS has very low pool occupancy rates, like let's say 10-15%, writes are often contiguous, so you can in fact get performance that is closer to the sequential write speeds of your HDD's, even for random write loads. However, as you start to fill those drives, your ability to find contiguous free space will diminish, and rewrite performance will suffer dramatically.

But at the end of the day, trying to aim a 40GbE at six hard drives is like trying to water a potted plant with a firehose.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
So -- if I wanted to build a disk array for bulk storage, how many drives would I need to absorb that much IO?

The hypothetical worst case situation here is that a single vdev might only be able to handle 150 IOPS of 512 byte I/O's. That is 77KBytes/sec or 614Kbps. To sustain 40Gbps, you would need 65146 vdevs, or ~128,000 hard drives. As an expectation-setting exercise, I hope you get the point being made.

In practice, of course, it shouldn't be anywhere near that bad. So let's calculate the most optimistic case, just for comparison. A single modern HDD should be able to sustain 150MByte/sec, or 1.2Gbit/sec, sequential write speeds. So you would need 34 vdevs, or 68 mirrored hard drives, or a hundred three-way mirrored hard drives, in order to be able to have a chance of being able to fully saturate a 40Gbps. Anything less is simply not going to be able to keep up.

Thinking about a disk shelf with a bunch of 4tb drives in it, all mirrored, etc. Is this doable or am I stuck with either slowing down the IO or converting to all flash (not an option due to cost)?

You're better off thinking about a disk shelf with a bunch of larger hard drives in it, and then not using most of the space. One of the biggest write accelerators for ZFS is when it can easily find contiguous free space.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
you need guaranteed performance for iSCSI, or else iSCSI freaks out and resets connections and everything goes to hell.

This is part of the reason that the "use NFS" advice holds some water; it's sending the writes as synchronous from vSphere, whereas the VMware iSCSI initiator sends the writes asynchronously which is why you need to force sync=always from the ZFS side.

But at the end of the day, trying to aim a 40GbE at six hard drives is like trying to water a potted plant with a firehose.
I'm stealing this metaphor. This is mine now.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
OK. We conducted some tests on Saturday, and, as Murphy's Law states, we were unable to create the issue, even while having the round robin turned on and pushing data into that server as hard as we could. Of course.

We did get it to "fail" twice, but the failure only lasted a few seconds, so there was no way to test against it.

However, since I was monitoring both the dirty data in RAM and the SLOG, I started to see a pattern.

When we were beating the hell out of the array, we saw both the read and write performance to the disks drop dramatically. We were getting read rates of 20-40M, and write rates of 100-200.

When I am doing write-only, I see numbers closer to 500-600.

I believe we are suffering from a lack of IOPS, which is what was suggested earlier in this thread. I know that I can't ever really expect to saturate 40Gb onto disks, but I was thinking that we might be able to mitigate the problem by adding a bunch more disks in an external disk shelf. I was thinking about adding 12 more disks, to effectively triple the IOPS we have now.

Do you think this is enough? Is it overkill?

To answer a previous question, these are live VMs on these disks -- not just bulk storage. Our main file server's data and all our Exchange data is stored on these disks, and we can't afford a flash solution of that capacity.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
I was thinking about adding 12 more disks, to effectively triple the IOPS we have now.
I would be looking to add 12 more vDevs (24 disks in striped mirrors) to that tier 2 pool. Supermicro has several JBOD enclosures to accomplish adding additional spinners to your server. I don't know what your budget looks like but I would get the 45 disk capacity JBOD so you can grow as needed.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
When we were beating the hell out of the array, we saw both the read and write performance to the disks drop dramatically. We were getting read rates of 20-40M, and write rates of 100-200.

When I am doing write-only, I see numbers closer to 500-600.

I believe we are suffering from a lack of IOPS, which is what was suggested earlier in this thread. I know that I can't ever really expect to saturate 40Gb onto disks, but I was thinking that we might be able to mitigate the problem by adding a bunch more disks in an external disk shelf. I was thinking about adding 12 more disks, to effectively triple the IOPS we have now.

Having worked with non-ZFS storage arrays, one of the things that storage guys tend to plan for is a relatively low number of IOPS from hard drives. ZFS can be better in properly designed circumstances, such as when you can get your read working set into ARC/L2ARC, or you maintain low occupancy rates to accelerate writes. These two things are key, as outlined in The path to success for block storage.

It is important to note that, as @Mlovelace indicates, the sheer number of vdevs you have plays an important role, but you can also gain massive write performance benefits from using 8TB or 12TB HDD's instead of 4TB HDD's **AND**NOT**FILLING**THEM**, and adding lots of ARC and L2ARC to accelerate read speeds. Many VM workloads include highly localized areas of the VM disks which are actually accessed on a regular basis (the "working set"), so if you maintain sufficient L2ARC to cache most blocks that are read on a daily or more frequent basis, you get this really awesome behaviour from ZFS where most reads are fulfilled from ARC/L2ARC, while the pool is able to sustain much higher levels of writes.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
I was thinking abouty grabbing a 24-slot used netapp shelf. I can get them inexpensively, and while only 6gb SAS, I think that might work for us.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
We are using 8TB drives now, and that's what I would use in the disk shelf I am thinking about. The added Vdevs would increase our capacity by huge amounts and allow us to not fill them.

The question is, would 12 more disks in striped mirrror added to the existing 6 be enough?

I know this is a Coke vs Pepsi question, but would switching to NFS help since it's not "block" storage?
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
This is part of the reason that the "use NFS" advice holds some water; it's sending the writes as synchronous from vSphere, whereas the VMware iSCSI initiator sends the writes asynchronously which is why you need to force sync=always from the ZFS side.


I'm stealing this metaphor. This is mine now.
But at the end of the day, trying to aim a 40GbE at six hard drives is like trying to water a potted plant with a firehose.

as am I. I am also putting into my notable quotables archive..:)
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
We are using 8TB drives now, and that's what I would use in the disk shelf I am thinking about. The added Vdevs would increase our capacity by huge amounts and allow us to not fill them.

The question is, would 12 more disks in striped mirrror added to the existing 6 be enough?
For 40 gigabits? I doubt it. 10? should be. 20? maybe.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
For 40 gigabits? I doubt it. 10? should be. 20? maybe.

10Gbps? With nine HDD vdevs? Very, very, very sketchy. It totally depends on what you're actually expecting out of the thing. I would think you'd probably be fine at 1Gbps, except that I know I could break it if given a free hand to place a torturous workload on it. If you aren't popping surprise stressy write workloads on it that break the ZFS write throttle, 10Gbps could be fine, but I guarantee it to be breakable if you get even moderately aggressive.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
10Gbps? With nine HDD vdevs? Very, very, very sketchy. It totally depends on what you're actually expecting out of the thing. I would think you'd probably be fine at 1Gbps, except that I know I could break it if given a free hand to place a torturous workload on it. If you aren't popping surprise stressy write workloads on it that break the ZFS write throttle, 10Gbps could be fine, but I guarantee it to be breakable if you get even moderately aggressive.
i sit corrected..:)
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
The hypothetical worst case situation here is that a single vdev might only be able to handle 150 IOPS of 512 byte I/O's. That is 77KBytes/sec or 614Kbps. To sustain 40Gbps, you would need 65146 vdevs, or ~128,000 hard drives. As an expectation-setting exercise, I hope you get the point being made.

In practice, of course, it shouldn't be anywhere near that bad. So let's calculate the most optimistic case, just for comparison. A single modern HDD should be able to sustain 150MByte/sec, or 1.2Gbit/sec, sequential write speeds. So you would need 34 vdevs, or 68 mirrored hard drives, or a hundred three-way mirrored hard drives, in order to be able to have a chance of being able to fully saturate a 40Gbps. Anything less is simply not going to be able to keep up.



You're better off thinking about a disk shelf with a bunch of larger hard drives in it, and then not using most of the space. One of the biggest write accelerators for ZFS is when it can easily find contiguous free space.

We are using 8TB drives now, and that's what I would use in the disk shelf I am thinking about. The added Vdevs would increase our capacity by huge amounts and allow us to not fill them.

The question is, would 12 more disks in striped mirrror added to the existing 6 be enough?

I know this is a Coke vs Pepsi question, but would switching to NFS help since it's not "block" storage?
as jgreco said above in the first quoted section..no you simply do not have enough hardware to keep up. you are going to need quite a few more vdevs to have a prayer.
nfs might help a little bit but you are still trying to use a firehose into a straw..NFS won't solve that problem.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I know this is a Coke vs Pepsi question, but would switching to NFS help since it's not "block" storage?

NFS used for block storage purposes is absolutely block storage. It can actually be suboptimal block storage if you're not careful, because if you design the NFS share for other-than-block-storage, like, let's say RAIDZ or large ZFS block sizes, you can actually thoroughly screw yourself.

It's not all bad though. A primary upside to NFS is that it doesn't have the iSCSI 5-second-freakout. Also some people don't like their VM files being locked up in VMFS, but that's more of a preference rather than technical merit.

Properly done, NFS and iSCSI should not be horribly different, though some of the forum regulars like @kspare have had different experiences at large scale which bear consideration as well.
 
Top