- Joined
- Feb 6, 2014
- Messages
- 5,112
The HBAs are not RAID cards. We have 2x HP H240 HBAs.
Unfortunately as you've found those aren't LSI - it doesn't mean they immediately fail, but as @jgreco mentioned, the
ciss
driver doesn't have the same billions of running hours and combined testing behind it. Swapping to LSI (HP H220 works) might be the quickest way to test if this is the trouble spot.All 8 NVMe SSDs are identical, and are installed in M.2 adapter cards (2x per card) Asus HyperM.2 PCIe Gen3
(When I went to fetch the model numbers from TN, it was "sleeping" again and I can't access the GUI. I believe they are all Sabrent)
6 of these are striped and mirrored into a pool named Tier1, the other 2 are userd for SLOG and Cache on Tier2
Exact model numbers are probably going to be important here, but 99% chance they aren't good candidates for SLOG. The use of
sync=standard
means you aren't actually leveraging the SLOG for iSCSI and everything is just whizzing by it. If you enable sync=always
(which you should, at some point, for safety) you should expect to take a major hit in performance vs. just being cached in RAM. You may need to invest in an Optane or other "SLOG-fast" NVMe device.6x HGST HE8 8tb SAS12 drives in striped mirror in pool Tier2
Should be fine, capacity/fragmentation might be more of a concern.
Standard, and atime is on
atime
not really relevant in a block-only world but the sync=standard
means that you've got the opportunity to fill your outstanding transactions up way faster than your pool can handle.I will have to ask our network guy to answer this part in full detail . My understanding is this: iSCSI is on separate subnet/vlan, and on separate 10G switches.
It's highly unlikely that HDDs will be able to sustain ingest on a single 10Gbps link let alone four of them.
That's what we think, as well. However, we don't know what to do about it.
I have a TN server at home in my homelab, and last night I tried to replicate this failure and was unable to. I can write all day long at almost 200MB/s to and from the TN server over the iSCSI connections from my VMware host and VMs. I even tried fully allocating my pool to see if running it at 95% utilization had an effect. Nope. Happy camper.
Home specs? The 200MB/s number makes me think you've got a pair of 1Gbe lines in MPIO (same as one of my machines) and that's generally at the point where an 8-drive HDD mirror pool can comfortable hold its own almost indefinitely.
Which leads me to ZFS Fragmentation, which is probably a separate post after I do some research...but in short: if the array has been over 50% utilization for a while and the fragmentation gets bad, I assume that it could cause (or contribute) to this problem. Ans since there is no REAL way to defrag ZFS, just deleting or moving data to get back to 50% won't necessarily help since the data is already fragmented. Right? Be on the lookout for a more complete post about this once I read and understand more and can formulate my thoughts....
Fragmentation shouldn't kill your SSD pool obviously but it might be an aggravating factor in your HDD pool being too slow to cope with the ingest.
"TL;DR - what do?"
Let's verify that you're choking the pool by swamping transactions. Open an SSH session, use vi or nano to create a file called dirty.d and paste this text in there. You could also do it on a client system and then copy it to your pool, but you'll then need to get to that same pool directory.
Code:
txg-syncing { this->dp = (dsl_pool_t *)arg0; } txg-syncing /this->dp->dp_spa->spa_name == $$1/ { printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024, `zfs_dirty_data_max / 1024 / 1024); }
Then from a shell do
dtrace -s dirty.d YourPool
and wait - eg dtrace -s dirty.d Tier2
.You'll see a bunch of lines that look like the following:
Code:
dtrace: script 'dirty.d' matched 2 probes CPU ID FUNCTION:NAME 4 56342 none:txg-syncing 62MB of 4096MB used 4 56342 none:txg-syncing 64MB of 4096MB used 5 56342 none:txg-syncing 64MB of 4096MB used
Start up a big copy and watch the numbers in the "X MB of Y MB used" area. If you look like you're using more than 60% of your outstanding data, you're getting write-throttled. What I expect based on your "stalls" is that X will rapidly approach Y and hang out roughly a few dozen MB's below it or perhaps even crash violently right into the limit. Open a second SSH window and run
gstat -dp
- look at the columns representing ops/s as well as the write-focused ones w/s kBps ms/w for the HDDs.Predictions is that your network (at 10Gbps or higher) is able to blast 4GBs down the pipe in an awful hurry, and your drives are able to take maybe 150MB/s sustained writes given fragmentation and other demands.