VMware, iSCSI, dropped connections and lockups

HoneyBadger · Jun 24, 2021

2twisty said:
The HBAs are not RAID cards. We have 2x HP H240 HBAs.

Unfortunately as you've found those aren't LSI - it doesn't mean they immediately fail, but as @jgreco mentioned, the ciss driver doesn't have the same billions of running hours and combined testing behind it. Swapping to LSI (HP H220 works) might be the quickest way to test if this is the trouble spot.

2twisty said:
All 8 NVMe SSDs are identical, and are installed in M.2 adapter cards (2x per card) Asus HyperM.2 PCIe Gen3

(When I went to fetch the model numbers from TN, it was "sleeping" again and I can't access the GUI. I believe they are all Sabrent)

6 of these are striped and mirrored into a pool named Tier1, the other 2 are userd for SLOG and Cache on Tier2

Exact model numbers are probably going to be important here, but 99% chance they aren't good candidates for SLOG. The use of sync=standard means you aren't actually leveraging the SLOG for iSCSI and everything is just whizzing by it. If you enable sync=always (which you should, at some point, for safety) you should expect to take a major hit in performance vs. just being cached in RAM. You may need to invest in an Optane or other "SLOG-fast" NVMe device.

2twisty said:
6x HGST HE8 8tb SAS12 drives in striped mirror in pool Tier2

Should be fine, capacity/fragmentation might be more of a concern.

2twisty said:
Standard, and atime is on

atime not really relevant in a block-only world but the sync=standard means that you've got the opportunity to fill your outstanding transactions up way faster than your pool can handle.

2twisty said:
I will have to ask our network guy to answer this part in full detail . My understanding is this: iSCSI is on separate subnet/vlan, and on separate 10G switches.

It's highly unlikely that HDDs will be able to sustain ingest on a single 10Gbps link let alone four of them.

2twisty said:
That's what we think, as well. However, we don't know what to do about it.

I have a TN server at home in my homelab, and last night I tried to replicate this failure and was unable to. I can write all day long at almost 200MB/s to and from the TN server over the iSCSI connections from my VMware host and VMs. I even tried fully allocating my pool to see if running it at 95% utilization had an effect. Nope. Happy camper.

Home specs? The 200MB/s number makes me think you've got a pair of 1Gbe lines in MPIO (same as one of my machines) and that's generally at the point where an 8-drive HDD mirror pool can comfortable hold its own almost indefinitely.

2twisty said:
Which leads me to ZFS Fragmentation, which is probably a separate post after I do some research...but in short: if the array has been over 50% utilization for a while and the fragmentation gets bad, I assume that it could cause (or contribute) to this problem. Ans since there is no REAL way to defrag ZFS, just deleting or moving data to get back to 50% won't necessarily help since the data is already fragmented. Right? Be on the lookout for a more complete post about this once I read and understand more and can formulate my thoughts....

Fragmentation shouldn't kill your SSD pool obviously but it might be an aggravating factor in your HDD pool being too slow to cope with the ingest.

"TL;DR - what do?"

Let's verify that you're choking the pool by swamping transactions. Open an SSH session, use vi or nano to create a file called dirty.d and paste this text in there. You could also do it on a client system and then copy it to your pool, but you'll then need to get to that same pool directory.

Code:

txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

Then from a shell do dtrace -s dirty.d YourPool and wait - eg dtrace -s dirty.d Tier2.

You'll see a bunch of lines that look like the following:

Code:

dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  4  56342                 none:txg-syncing   62MB of 4096MB used
  4  56342                 none:txg-syncing   64MB of 4096MB used
  5  56342                 none:txg-syncing   64MB of 4096MB used

Start up a big copy and watch the numbers in the "X MB of Y MB used" area. If you look like you're using more than 60% of your outstanding data, you're getting write-throttled. What I expect based on your "stalls" is that X will rapidly approach Y and hang out roughly a few dozen MB's below it or perhaps even crash violently right into the limit. Open a second SSH window and run gstat -dp - look at the columns representing ops/s as well as the write-focused ones w/s kBps ms/w for the HDDs.

Predictions is that your network (at 10Gbps or higher) is able to blast 4GBs down the pipe in an awful hurry, and your drives are able to take maybe 150MB/s sustained writes given fragmentation and other demands.

2twisty · Jun 24, 2021

OK -- so before I go throwing money at an HBA -- is there any way to absolutely PROVE that the HBA is the issue?

Also, Would this HBA (flashed to IT mode) be appropriate? We want the 4i4e variant so we can add external expansion later if needed:

https://www.newegg.com/lsi-9300-4i4e-sata-sas/p/N82E16816118221

Also, just in case it matters, we plan to run 2 of these HBAs (which we have now) so that one half of the mirrors are on one HBA and the other half are on the second to provide fault tolerance against an HBA failing.

jgreco · Jun 24, 2021

2twisty said:
OK -- so before I go throwing money at an HBA -- is there any way to absolutely PROVE that the HBA is the issue?

"PROVE"? No. But you can use the forum search to find out that other users who have shown up with the H240 or other CISS driver based cards have had horrible experiences, swapped to LSI IT, and then been fine. I wasn't remembering that it was a CISS-based card or I might have been a bit nastier about the card's heritage ...

I just did a quick forum search on "CISS" and ... couldn't find anything happy-sounding.

I know, that's not super-scientific, it clearly isn't proof of anything, but sometimes you just gotta go with what works.

Also, Would this HBA (flashed to IT mode) be appropriate? We want the 4i4e variant so we can add external expansion later if needed:

https://www.newegg.com/lsi-9300-4i4e-sata-sas/p/N82E16816118221

Also, just in case it matters, we plan to run 2 of these HBAs (which we have now) so that one half of the mirrors are on one HBA and the other half are on the second to provide fault tolerance against an HBA failing.

In my experience, the choice of connectors is irrelevant. It is fine to use a 9300 card, and having drives split between controllers is fine as long as the topology is rational, which you seem to understand.

2twisty · Jun 24, 2021

HoneyBadger -- Thanks -- I will have to try that test this weekend when users won't squawk if I make the server mad.

As for my homelab, the VMWare server (12 core threadripper,256gb ram) has a dedicated 10GB link to the san (Ryzen 1600, 32GB ram) and those copy speeds quoted were from pulling data from the san and writing back to it, so those numbers are a bit skewed. I will have to try a copy from an external source to the san to get accurate numbers. For the purposes of trying to replicate the issue, it was a valid test, since we can make the SAN at the office fail by moving data between VMs that have their backend storage on that same iscsi target.

As for the number of 10GB links from the SAN: we have multiples for fault tolerance. ESXi is set for MPIO, and those 4 links go into 2 separate switches, etc. The way its configured, we could lose one of the 10gb (dual sfp+) or one of the switches and still not lose connectivity or performance.

2twisty · Jun 24, 2021

Stux said:
Also, if you are using the same SSD type for your slog as your tier 1 drives they might not be suitable as a SLOG drive.

but I suspect the primary issue is the HBA.

Yes, both the SLOG and L2ARC drives on the Tier2 Disk pool are the same 1TB Sabrent NVMe drives that make up the Tier1 pool.

So, grabbing an Optane would be the more recommended approach for the SLOG? What size? Is using one of the 1TB drives for the L2ARC acceptable or do we need to swap that out, as well? If so, what size?

The SSDs and the PCIe cards we have them mounted in (Asus Hyper M.2) are PCIe Gen3, but the motherboard slots are gen4 -- at the time we built this, the Gen4 cards were completely unavailable. would going to a Gen4 card help with the optane enough to justify the cost?

HoneyBadger · Jun 24, 2021

I just saw your other thread here.

ZFS Fragmentation Performance issue mitigation?

I know this comes up all the time, and I am hoping that there is newer wisdom that what the older threads say about this..... My pools are all used for block storage. I have 1 NVMe-only pool that has 91% utilization, 50% free space frag my disk pool is 67% utilized with 68% free space frag and...

www.truenas.com

2twisty said:

You have non-zero values in the DEDUP column. If you're using deduplication, turn it off right now and start making plans to shuffle data around and eliminate it.

2twisty · Jun 24, 2021

Why is dedupe so bad? I know it uses a ton of RAM -- I have 256GB in that server for that reason.

HoneyBadger · Jun 24, 2021

2twisty said:
Why is dedupe so bad? I know it uses a ton of RAM -- I have 256GB in that server for that reason.

Short answer: "RAM helps the reads, you're bottlenecking on writes."

Long answer:

Bit of background in a gross oversimplification - ZFS handles deduplication by storing a hash of each record in a deduplication table (ddt). If two records are identical (all 4K, 8K, 16K or more) then it "bumps up the counter" for that hash value.

Those tables can get really big, and they need to be traversed when each potentially new record is written to a dataset with deduplication enabled. Hence the "you need gobs of RAM" - this lets that multi-GB table reside in memory and be looked through quickly.

But the part that bites you, and what people often overlook, is that table needs to persist on disk. Even if every record written is just "bump up the counter" on an existing record - every one of those counters needs to be updated, on the physical underlying disk. And if that underlying disk is a spinning platter, it's going to generate a flurry of very small (4K) I/O that ends up being effectively random. And spinning platters suck at delivering random I/O.

The TrueNAS 12/OpenZFS 2.0 "special vdev" class tries to mitigate this by letting you pull the metadata/ddt out to separate physical devices - SSDs, for their significantly better random I/O characteristics.

Disabled deduplication will probably solve a lot of the immediate "lockup" issues because it will no longer be updating those tables on every write. You'll still have the RAM footprint from the existing tables though, but with 256GB you're likely okay. Drop a zpool status -D and find the line talking about "DDT size" to paste here, if you can.

2twisty · Jun 24, 2021

Tier1 (SSD)

Code:

dedup: DDT entries 34188575, size 829B on disk, 267B in core

Tier2 (Disk)

Code:

dedup: DDT entries 318453153, size 685B on disk, 221B in core

HoneyBadger · Jun 24, 2021

Based on the "in core" numbers that's about 8.7GB of RAM for Tier1, and 65.5GB for Tier2 - 74.2GB of RAM total.

I believe all the TN12 revisions are using the OpenZFS arc_meta_limit_percent threshold of "75% of ARC can be metadata" so it's not being ejected to disk, but that's still an awful lot of memory getting burnt there, and it's going to take a non-trivial amount of time to compare hashes against it all.

I'd suggest turning it off temporarily and seeing if the issue is resolved. I'm entirely expecting it to be, at which point you need to plan for expanding the underlying capacity to account for deduplication being removed entirely (adding a new ZVOL/datastore with dedup off and svMotioning everything there will do it, but just be judicious about how much you move at a time. Moving data out of a volume with dedup one will still cause updates/deletes to the DDT.)

HoneyBadger · Jun 25, 2021

2twisty said:
So, grabbing an Optane would be the more recommended approach for the SLOG? What size? Is using one of the 1TB drives for the L2ARC acceptable or do we need to swap that out, as well? If so, what size?

Bumping this to reply to this part:

Yes, an Optane card is probably the best available option here for SLOG. Assuming you want to continue leveraging the multiple M.2 slots and can't dedicate a PCIe slot, the best option is the Optane DC P4801X 200GB card - the 100GB isn't much lower down the performance tier though if cost is a concern. Your carrier board can support the M.2 22110 (110mm) cards just fine.

Getting a proper SLOG and enabling it with sync=always should be on the same level of priority after sorting out the deduplication issue, as right now your system has a potential for data loss if you suffer an unexpected hardware fault or power outage. VMware has sent the data over iSCSI and thinks it's safe, but your TrueNAS machine is still holding it in RAM only. An internal PSU or HBA failure would mean that data just goes "poof."

1TB for L2ARC is fine especially with 256GB of RAM.

2twisty · Jul 2, 2021

We have made some adjustments to settings on this server, and plan to conduct a test (hopefully this weekend) to verify that the problem continues. I believe that the problem will present itself, but we want to be sure before we move forward.

If the problem persists, we will be ordering a couple 9300-4e4i cards and an Optane for SLOG.

Thank you all for your guidance!

HoneyBadger · Jul 2, 2021

2twisty said:
We have made some adjustments to settings on this server

I do have to ask the direct question - "was disabling deduplication one of those adjustments?" :)

The Optane P4801X 100G is fairly inexpensive now (under USD$200) and could be a good SLOG option.

2twisty · Jul 16, 2021

Yes, we turned off dedupe and set sync to always. The problem persists.

So, we are installing the LSI 9300s tomorrow. I pray this resolves it.

2twisty · Jul 16, 2021

Boss wants to hold off on replacing the SLOG with the Optane for now. I will probably just underprovision the existing 1TB drive so that it has so much unused space that it won't fail for a long time.

If I provision the SLOG to 64GB, I'll have TONS left over for wear leveling.

HoneyBadger · Jul 16, 2021

2twisty said:
Yes, we turned off dedupe and set sync to always. The problem persists.

So, we are installing the LSI 9300s tomorrow. I pray this resolves it.

Dedup off is good, and should mean that net-new writes don't have to be processed - updates to existing records though (chunks of that VMFS) are still going to cause activity to the HDD-based DDT for a bit. The issue should hopefully lessen over time but there's 65GB of DDT on that second pool that needs to get cleaned up. But it could still be the ciss driver choking when you try to flush a big group.

Dedup backed by spinning disk is always a bad situation to be in.

sync=always likely means you are bottlenecking on your SLOG device. If you look at something like gstat -dp do you see your NVMe SLOG device being extremely busy?

2twisty said:
Boss wants to hold off on replacing the SLOG with the Optane for now. I will probably just underprovision the existing 1TB drive so that it has so much unused space that it won't fail for a long time.

If I provision the SLOG to 64GB, I'll have TONS left over for wear leveling.

As long as the current SLOG can provide the throughput needed, the negative impact would be on the endurance. The Sabrent 1TB (assuming consumer) NVMe will not handle the abuse the same way Optane will (wear leveling will help here, but enterprise Optane drives measure their lifespan in petabytes) so just keep an eye on the SMART stats especially w.r.t endurance and total writes.

Please keep us posted.

hescominsoon · Jul 16, 2021

2twisty said:
Boss wants to hold off on replacing the SLOG with the Optane for now. I will probably just underprovision the existing 1TB drive so that it has so much unused space that it won't fail for a long time.

If I provision the SLOG to 64GB, I'll have TONS left over for wear leveling.

with how cheap that optane is..forget underprovisioning..just get the optane and be done with it(use the right hardware for the job). it's on amazon for less than 300 bucks right now.

jgreco · Jul 17, 2021

2twisty said:
Boss wants to hold off on replacing the SLOG with the Optane for now. I will probably just underprovision the existing 1TB drive so that it has so much unused space that it won't fail for a long time.

If I provision the SLOG to 64GB, I'll have TONS left over for wear leveling.

If your SLOG device is a consumer-grade SSD without power loss protection, you are basically just kidding yourself here, you're better off removing the SLOG and disabling sync writes entirely. The purpose of SLOG is to guarantee sync writes, and you need Optane or power loss protection or something like that.

HoneyBadger · Jul 17, 2021

If I can be pedantic here; SLOG only technically requires "PLP for data at rest" to satisfy the ZFS requirements. But it likely won't be fast enough to serve as a viable SLOG unless it also carries "PLP for data in flight" - most consumer drives do the former, enterprise the latter.

What you really want to avoid is any drive that lies about its cache. Generally speaking that hasn't been an issue since 2015ish. I do have an OCZ device that is guilty of lying though. Needless to say it's not used as an SLOG, it's in a Windows machine (and one that I care very little about.)

Details get expounded upon in this thread:

SLOG and power loss protection

Hi guys, Looking for clarity on something... So I understand that in the case of synchronous writes, the SLOG (ZIL) keeps an on-disk copy of the data that will be written to the pool in the next transaction group. As opposed to asynchronous, where between the 5s writes to the pool, ZFS keeps...

www.truenas.com

TL;DR most consumer drives aren't fast enough because they lack in-flight PLP, but they aren't always inherently unsafe, Optane is the current king of the ring for generally-available solutions. If you want faster, you're playing with specialty NVRAM devices.

jgreco · Jul 17, 2021

HoneyBadger said:
If I can be pedantic here; SLOG only technically requires "PLP for data at rest" to satisfy the ZFS requirements.

If you're going to be pedantic, be fully pedantic. Don't halfarse it. The required guarantee is that upon confirmation of a write request, future reads are guaranteed to return that data.

Trying to create "tiers" of PLP is a losing game, because drive manufacturers change things, often without notice. An SSD is either designed to provide this guarantee, or it is not, regardless of whether it kinda-sorta-usually-does-the-rightish-thing, which can easily break at the next controller firmware update.

Important Announcement for the TrueNAS Community.

VMware, iSCSI, dropped connections and lockups

actually does care

Contributor

Resident Grinch

Contributor

Contributor

actually does care

Contributor

actually does care

Contributor

actually does care

actually does care

Contributor

actually does care

Contributor

Contributor

actually does care

Patron

Resident Grinch

actually does care

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "VMware, iSCSI, dropped connections and lockups"

Similar threads