VMware, iSCSI, dropped connections and lockups

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
OK folks, here's the update, and it's not good:

We replaced the controllers and cables with LSI 9300-4e4i cards that are in IT mode. Array came right back up. We started a large data copy. It still fails with the dropped iSCSI and ctl_datamove aborted and WRITE(x) errors with timeouts over 100s. The performance was greatly improved, since it required a lot longer and much heavier "abuse" to trigger it, but the problem persists.

After googling a bit, I decided to change my transaction group timeout to 1s since many recommended that for iSCSI and block storage setups. Tested again. Still failed with a marginal improvement

More googling. Decided to underprovision the SLOG to10GB. Despite the fact that I have 256GB of RAM and the conventional thinking would be to set the SLOG to 64GB (1/8 ram * 2 transaction groups), but I saw other threads that talked about tuning the SLOG based on how fast the drives could write -- which made a lot of sense to me.

  • I have 6 drives in a 3-way striped mirror, so I have the write performance of 3 drives.
  • 12Gb SAS means max transfer of 1.5GB/s
  • 1.5GB/s * 3 drives = 4.5 GB/s
  • 4.5GB in a 1s TXG * 2 TXGs = 10GB

Tested again. Marginal improvement, still fails.

So I thought, "OK, 12GB/s is MAX transfer, not likely these drives will achieve that in actual write speed." so I dropped the SLOG to 5GB.

At this point, I thought I had it fixed. I beat on the array with a large file copy and lots of IO (read and write) for about 40 minutes and it looked good..... So I left my file copy running (I really did need to move that data) and headed home, feeling cautiously optimistic.

When I got home and checked on it -- it had failed again.

I was out of time in my maintenance window to try anything else, and frankly, I was out of ideas.

Management is ready to chuck TrueNAS, and I really don't want to do that. What do I need to do next to track this down? Would moving to NFS from iSCSI change anything (I'm thinking no)?

Are there any logs / config files / other info I can post here to help you guys help me figure this out?

Thanks again for all your help. I really want to make this work.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
507
Try nfs? I had nothing but problems with VMware and iscsi. Nfs is rock solid.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
What do I need to do next to track this down? Would moving to NFS from iSCSI change anything (I'm thinking no)?

Are there any logs / config files / other info I can post here to help you guys help me figure this out?

Pulling in an earlier response because I didn't see if you responded to this with details. Make that script, see if you're choking your pool. Have a couple SSH sessions up side-by-side, one running the dtrace script and another showing gstat info. Your network is 4x10Gb which is enough to put the hurt on an all-flash pool, let alone a six-disk spinner.

"TL;DR - what do?"

Let's verify that you're choking the pool by swamping transactions. Open an SSH session, use vi or nano to create a file called dirty.d and paste this text in there. You could also do it on a client system and then copy it to your pool, but you'll then need to get to that same pool directory.

Code:
txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}


Then from a shell do dtrace -s dirty.d YourPool and wait - eg dtrace -s dirty.d Tier2.

You'll see a bunch of lines that look like the following:

Code:
dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  4  56342                 none:txg-syncing   62MB of 4096MB used
  4  56342                 none:txg-syncing   64MB of 4096MB used
  5  56342                 none:txg-syncing   64MB of 4096MB used


Start up a big copy and watch the numbers in the "X MB of Y MB used" area. If you look like you're using more than 60% of your outstanding data, you're getting write-throttled. What I expect based on your "stalls" is that X will rapidly approach Y and hang out roughly a few dozen MB's below it or perhaps even crash violently right into the limit. Open a second SSH window and run gstat -dp - look at the columns representing ops/s as well as the write-focused ones w/s kBps ms/w for the HDDs.

Predictions is that your network (at 10Gbps or higher) is able to blast 4GBs down the pipe in an awful hurry, and your drives are able to take maybe 150MB/s sustained writes given fragmentation and other demands.

Note that the SLOG partition sizing doesn't impact the amount of dirty data ZFS uses in its tunables. You'll have to fiddle with those separately but let's not touch them until/unless we suspect it will change things, and let's see the "out of the box" numbers, both during regular traffic and during heavier use.

Try nfs? I had nothing but problems with VMware and iscsi. Nfs is rock solid.

At this stage this is worth considering (since we saw major benefits in your environment) but it's a non-trivial move logistically. At least with 4x10Gb connections you can peel two of them off and keep redundancy across both IP storage protocols. I'm still thinking that 40Gbps is way too much for that to handle. I've got an 8-spinner setup that isn't even close to saturating a single 10GbE when doing VM work.
 
Last edited:

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Pulling in an earlier response because I didn't see if you responded to this with details. Make that script, see if you're choking your pool. Have a couple SSH sessions up side-by-side, one running the dtrace script and another showing gstat info. Your network is 4x10Gb which is enough to put the hurt on an all-flash pool, let alone a six-disk spinner.

Will running this dirty.d script cause any data to be written or is it just a monitor? I can't run anything that might cause the issue during business hours "normal operation".

I will have to get clearance from management to deliberately trigger the event, and they may not want me to do that until the weekend when there is plenty of recovery time.

As for tweaking tunables, they have already been tweaked, and again, I can't make changes like that (to set it back to 5s and re-expand the SLOG to a full 1TB) until the weekend.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
I am running gstat -dp now (without the dtrace running) and I am seeing op/s of between 50 and 89, w/s of 40-70, kBPS of 300-3000 (with a few spikes to 9000), and ms/w of .2 to .8 (with spikes up to 1.5) on the hard drives.

Is that normal? If not, what should I be seeing during "normal" operation without the dtrace?
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
And the numbers for the SLOG device are:

ops/s: 20-80 with spikes to 150 or more
w/s: 20-120 with spokes to 150
kBps: 400-4000 with spikes to 10000
ms/w: .1 - .4 with spikes to .8
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
So, I YOLOd the dtrace, figuring it was safe. In normal operation I am seeing very low numbers.

Code:
14  74639                 none:txg-syncing   10MB of 4096MB used
31  74639                 none:txg-syncing    2MB of 4096MB used
  6  74639                 none:txg-syncing    0MB of 4096MB used
25  74639                 none:txg-syncing    3MB of 4096MB used
29  74639                 none:txg-syncing    2MB of 4096MB used
25  74639                 none:txg-syncing    1MB of 4096MB used
  5  74639                 none:txg-syncing    9MB of 4096MB used
27  74639                 none:txg-syncing    1MB of 4096MB used
20  74639                 none:txg-syncing    3MB of 4096MB used
20  74639                 none:txg-syncing   11MB of 4096MB used
  0  74639                 none:txg-syncing    2MB of 4096MB used
  2  74639                 none:txg-syncing    1MB of 4096MB used
27  74639                 none:txg-syncing    2MB of 4096MB used
25  74639                 none:txg-syncing    4MB of 4096MB used
18  74639                 none:txg-syncing    7MB of 4096MB used
14  74639                 none:txg-syncing    2MB of 4096MB used
  0  74639                 none:txg-syncing    2MB of 4096MB used
  7  74639                 none:txg-syncing    3MB of 4096MB used
24  74639                 none:txg-syncing    2MB of 4096MB used
12  74639                 none:txg-syncing   18MB of 4096MB used
21  74639                 none:txg-syncing    7MB of 4096MB used
27  74639                 none:txg-syncing    5MB of 4096MB used
26  74639                 none:txg-syncing    6MB of 4096MB used
30  74639                 none:txg-syncing    3MB of 4096MB used
28  74639                 none:txg-syncing    4MB of 4096MB used
21  74639                 none:txg-syncing    3MB of 4096MB used


I will likely have to wait for the weekend to deliberately trigger this, since it can take up to an hour or two sometimes for the system to settle back down to normal.
 
Last edited:

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
449
at 40 gigabits you are most likely overwhelming both the slog and most assuredly the spinners. did you try going from the 40 gigabits combined to 20 gigabits combined? They just simply cannot keep up. As the previous poster mentioned he has 8 drives in a z array and he cannot keep up with a single 10gig connection much less 4. Frankly I would go with mirrored vdevs instead of a z array. at that point with the slog 20 gigabit MIGHT be doable. 40 Gig, as noted above, is a stretch for all flash.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
ALSO: I am having some checksum errors on one of my drives that I don't think is related to this, since this problem has been present since we built the system a year ago, and this checksum thing started about 2 months ago or so.

But, in the interest of people having the full scoop on what's going on with our system, please see this thread: Checksum errors on pool drive causing "unhealthy" state
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Frankly I would go with mirrored vdevs instead of a z array.

Both my flash and disk pools are 3-way striped mirror (~RAID10)

Code:
 pool: Tier1
 state: ONLINE
  scan: scrub repaired 0B in 00:12:42 with 0 errors on Sun Jun  6 00:12:43 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tier1                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/a7f2b211-20e1-11eb-b324-6805cac4f4f6  ONLINE       0     0     0
            gptid/a8183a88-20e1-11eb-b324-6805cac4f4f6  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/a7d66da9-20e1-11eb-b324-6805cac4f4f6  ONLINE       0     0     0
            gptid/a8511702-20e1-11eb-b324-6805cac4f4f6  ONLINE       0     0     0
          mirror-2                                      ONLINE       0     0     0
            gptid/a81f169f-20e1-11eb-b324-6805cac4f4f6  ONLINE       0     0     0
            gptid/a8130a5b-20e1-11eb-b324-6805cac4f4f6  ONLINE       0     0     0

errors: No known data errors

  pool: Tier2
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.09T in 1 days 08:19:05 with 0 errors on Mon Jul 19 17:07:56 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tier2                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/2342098f-0b57-11eb-8e08-6805cac4f4f6  ONLINE       0     0     0
            gptid/235b6319-0b57-11eb-8e08-6805cac4f4f6  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/239e11a4-0b57-11eb-8e08-6805cac4f4f6  ONLINE       0     0     0
            gptid/23a6691a-0b57-11eb-8e08-6805cac4f4f6  ONLINE       0     0     0
          mirror-2                                      ONLINE       0     0     0
            gptid/2406d819-0b57-11eb-8e08-6805cac4f4f6  ONLINE       0     0     0
            da5                                         ONLINE       0     0    15
        logs
          gpt/Tier2-SLOG                                ONLINE       0     0     0
        cache
          gptid/211c6069-0b57-11eb-8e08-6805cac4f4f6    ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:24 with 0 errors on Sat Jul 17 03:45:24 2021
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0p2   ONLINE       0     0     0
            da4p2   ONLINE       0     0     0

errors: No known data errors
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
449
ALSO: I am having some checksum errors on one of my drives that I don't think is related to this, since this problem has been present since we built the system a year ago, and this checksum thing started about 2 months ago or so.

But, in the interest of people having the full scoop on what's going on with our system, please see this thread: Checksum errors on pool drive causing "unhealthy" state
well before anything else then you need to get that problem nailed down. ZFS will do everything it can to preserve data..and if a drive is erroring out..that's going to cause issues..especially with the loads you are asking it to do.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
449

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Is there a way to maintain my 40GBs connections and have TrueNAS do the throttling based on disk performance? I figured that TrueNAS would tell VMware to slow the hell down if it couldn't keep up.

Or is that because it dumps to RAM and TN can accept the data at that rate, but not write it? Can we tell TN to slow down?

The reason we have multiple links is for redundancy. We have ESX set to Round Robin (each host has 2x 10G links to the switch). We tried turning round robin off and it didn't affect it, but from what folks are saying, a single 10GB link from the host to TN would be enough to saturate it.

Sadly, our VMWare licence does not give us the throttling options in iSCSI (gotta love VMWare's licensing model), so any throttling will need to be done on the TN side.

Ideas?
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
well before anything else then you need to get that problem nailed down. ZFS will do everything it can to preserve data..and if a drive is erroring out..that's going to cause issues..especially with the loads you are asking it to do.

Agreed, however, the checksum issue is 2 months old. The issue in this thread is 1 year old. They didn't start at the same time.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
As you've already found out the dtrace script is just monitoring and doesn't add writes on its own.

Based on gstat your disks still seem somewhat busy (40-70 writes per second) when the network is "doing nothing" so to speak - this could be the DDT getting pruned down. Does zpool status -D show reductions compared to your previous stats?

Code:
Tier1 (SSD) dedup: DDT entries 34188575, size 829B on disk, 267B in core
Tier2 (HDD) dedup: DDT entries 318453153, size 685B on disk, 221B in core


Is there a way to maintain my 40GBs connections and have TrueNAS do the throttling based on disk performance? I figured that TrueNAS would tell VMware to slow the hell down if it couldn't keep up.

TrueNAS is already doing the throttling in a sense, when you get past 60% of the dirty data limit it applies a gradually increasing delay to incoming writes. The closer to 100% you get, the longer that delay is, and if you actually reach "full" it won't respond until there's "free space" in that dirty data area. I discussed it a bit in this thread, where the workload is SMB, and you can see the "throttle" come into action there in the reduction of copy speed over time.


The challenge is that you'll need to be dtrace-ing when the problem is happening, or at least when you know there will be some increased workload on the pool. Grab a screenshot and/or tee the output to a file for later reading as well.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Been reading this:

Tuning the OpenZFS write throttle

However, it seems that we are pushing data INTO the san faster than it can write it -- if we adjust these parameters, will that tell our ESX to "slow the hell down" or will ESX keep pushing data at the max link speed (with round robin, that's 20GB/s per host, and we have 3, but only 2 are prod)

So I am going to test whether dropping the SAN down to a single 10GB link improves this. However, that's not acceptable long term since it leaves us without redundancy. I am hoping that our 10GB switches may be able to do some traffic shaping, but that's for our network guy to deal with -- I'm the server/linux guy at our operation.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
As you've already found out the dtrace script is just monitoring and doesn't add writes on its own.

Based on gstat your disks still seem somewhat busy (40-70 writes per second) when the network is "doing nothing" so to speak - this could be the DDT getting pruned down. Does zpool status -D show reductions compared to your previous stats?

Code:
Tier1 (SSD) dedup: DDT entries 34188575, size 829B on disk, 267B in core
Tier2 (HDD) dedup: DDT entries 318453153, size 685B on disk, 221B in core


As of right now:
Code:
Tier1: dedup: DDT entries 9419092, size 2.94K on disk, 972B in core
Tier2: dedup: DDT entries 312735882, size 700B on disk, 226B in core
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110

That's Adam Leventhal's page that I got the dtrace script from, extremely valuable reading for those wanting to learn more about the throttle behaviour and appreciating how much better the "new" one is over the "old." There's also some dtrace scripts there that will identify the duration to commit a txg - worth running here as well - some one-liners, and histogram generation for your write behaviours. Full of valuable information, although OpenZFS 2.0 changed a few of the defaults so double-check that they haven't changed at the URL below

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
As of right now:
Code:
Tier1: dedup: DDT entries 9419092, size 2.94K on disk, 972B in core
Tier2: dedup: DDT entries 312735882, size 700B on disk, 226B in core

Tier1 DDT got chopped pretty handily (reduced to 27.5% of original entries, hopefully soon to vanish) but Tier2 is still 98% of its original size so that's going to keep hurting you.

I know this sounds crazy, but is it possible to make a third pool (or even another system?) with more disks (and dedup off of course) and start svMotioning VMs there? It could be repurposed as a backup or replication target later.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Possibly. Our backup server also runs TN, but it's getting pretty full. I could drop a linux box in somewhere with a big single disk and move things one VM at a time and then back to kill the dedupe....

Of course, during that move, the data would be at risk, but I'd have a fresh backup before I did that anyway.

Ideally, I could move ALL the data off and rebuild the pool, to also reduce fragmentation.

Sadly, several of our VMDKs are HUGE, so doing one vmdk at a time is also going to be challenging.

Hmm.. Maybe I could drop a drive in the VM host as local storage, and vmotion a VM back and forth as well....
 
Top