Large-File Transfers -- Peaks to 600MB/s (5-10s) ... drops to 0KB/s & peaks again after pausing.

no_connection · Aug 31, 2020

TrumanHW said:
Does a scrub show you individual drive performance...?

Should show up in drive graph same as read. And I would expect read speed to be pretty good, do keep in mind that scrub is kinda chronological and not sequential read of the drive. So many small files will be limited by iops but any large file shuold spike to good portion of drive read speed.

c77dk · Aug 31, 2020

Just one question after reading another thread: are you using deduplication ?

TrumanHW · Sep 1, 2020

c77dk said:
Just one question after reading another thread: are you using deduplication ?

YES! Is that the cause..? OMG!!

Do I disable that feature..?

c77dk · Sep 1, 2020

TrumanHW said:
YES! Is that the cause..? OMG!!

Do I disable that feature..?

@Stilez is the one to diagnose weird things like this

The work done to dig into problems like this one and finding solutions is impressive:
This sounds so much like what you experience

Baffling Performance issues with large zfs pool

ARC isn't hitting its max size, so unless you've recently rebooted there's probably downward pressure on it. Cache hits are almost entirely (98.74%) metadata which makes me think it's constantly looking up the DDT. You've got about 6.5G of metadata in your ARC, and another 6.7G of ghost...

www.ixsystems.com

From what I've understood there's two solutions: disable deduplication, or get some insanely fast SSDs and hop on TrueNAS CORE -BETA train with special vdevs

TrumanHW · Sep 1, 2020

YOU FUCKING FIXED IT! YOU DID!

What's your paypal..?

TrumanHW · Sep 1, 2020

You're the freaking MAN!!! HOLY SHIZ!!!

Do you know I've been trying to figure this shiz out for THREE EFFING YEARS !?
If you'll give me your paypal ... I'd like to thank you that way also.
Just for taking the time to mention that to me or I might have missed it.
(same for @Stilez ... whom I will make the same offer of... I wish I were in a financial position to give even more.)
THANK YOU!!!!! Thank you! thank you!!

c77dk · Sep 1, 2020

TrumanHW said:
YOU ****ING FIXED IT! YOU DID!

Hehe, just found the post in the forum

This is one of the things I love about opensource - people helping and learning. I've never used deduplication, but now at least I've learned a little about it, and could help point you in the right direction.

c77dk · Sep 1, 2020

TrumanHW said:
If you'll give me your paypal ... I'd like to thank you that way also.
Just for taking the time to mention that to me or I might have missed it.
(same for @Stilez ... whom I will make the same offer of... I wish I were in a financial position to give even more.)
THANK YOU!!!!! Thank you! thank you!!

just as @Stilez , I'm happy just to help you solve this

If you really mean you want to give some money, please give a small donation to the OpenZFS project ( https://openzfs.org/wiki/Main_Page ), or the FreeBSD Foundation ( https://freebsdfoundation.org/donate/ ). Both projects are integral to FreeNAS/TrueNAS

Stilez · Sep 1, 2020

c77dk said:
Just one question after reading another thread: are you using deduplication ?

That was the ten thousand dollar question. And - what a surprise. 10G networking too.

(A 1G link provides a natural 120MB/sec throttle which at least greatly slows the buildup of the issue, maybe so it doesn't ever arise for many intermittent workloads, and also may be just slow enough to handle ongoing anyway, depending on the server hardware. It doesn't stand a chance with anything much faster, as the pipes refill so quickly. See other linked post for details if you just found this thread. Also be aware it will happen in local file transfers within the server too, not just networked. But those don't disconnect, so there's less to fall over).

As a taster of what's possible, after upgrading to special vdevs, I'm getting 300+ MB/sec across SMB, without issues, on 1 - 3 TB file writes, to my deduped pool, compared to stalling before. But the SSDs are pulling huge amounts of 4K mixed RW to do that, and the server still needs quite a lot of RAM (less so because there's less penalty to pulling metadata+dedup data off Optane, its as fast as the fastest L2ARC anyway, only battery backed RAM is faster).... so it will all depend on your individual hardware and uses.

I'm using Optane for it, but it's expensive. At minimum, even good enthusiast SSDs (Samsung Pro NVMe) should stand a very good chance of handling it enough to make it manageable. Pro because you need a good controller and rock solid SSD for this demand, and you dont want your special vdev slowed down too badly when its internal cache is full and it falls back to the native speed of the controller+NVRAM. NVMe because I don't know what you'll get, but I get 4k random reads at 250k IOPS for long enough to point a camera at it and the best SATA tops out around 100k. As of 2020 if you aren't sure, then get Samsung (not older PM and NEVER QVO) or Intel, even 2nd hand, nothing else. Just remember special vdevs need to be redundant, too, their data is fundamental to the pool being usable. And whatever the webUI says and whatever the rest of your pool is, make them mirrored *never* raidz. Override the webUI if needed to do it.

Shows one other thing. When you think you have a problem, there's usually others who do, too. Open source and community ftw

TrumanHW · Sep 1, 2020

c77dk said:
just as @Stilez , I'm happy just to help you solve this If you really mean you want to give some money, please give a small donation to the OpenZFS project ( https://openzfs.org/wiki/Main_Page ), or the FreeBSD Foundation ( https://freebsdfoundation.org/donate/ ). Both projects are integral to FreeNAS/TrueNAS

I'll make a donation to OpenZFS project forthwith and if there's a field to reference you as the motivating cause, I'll do that as well. Again, THANK YOU.

TrumanHW · Sep 1, 2020

Stilez said:
Be aware it'll happen via local file transfers within the server too -- not just networked.

Exactly. I'd tried emphasizing just that to another person; that a 'DISK IMPORT' task was slower..! than even 10GbE... Which seemed diagnostic in my experience, but I lacked the acumen feature-knowledge to point at anything.

Stilez said:
Special vdevs, of 1-3 TB file-writes, to my deduped pool ... get ~300+ MB/sec via SMB vs. stalling before.

...special vdevs?
Like, nested arrays..?
(What I think of as RAID 50/51/55 or 60/61/65 ...etc..?)

I don't stand enough to gain from DeDupe.

FreeNAS should add some info as a tooltip ... or idiots like me will tick checkboxes 'willy-nilly' thinking:
"sounds good if it's 'free' ...why not..?"
In reality it'd take using OPTANES to get only 150% of a spinning drive's performance!..?

Dedupe literally increases the cost per performant-gigabyte by a factor of 50+
...ignoring the added costs of hardware to support an NVMe-based system.

.. granted, NOT if I were using 1GbE ... or hey, not at all at 100Mb. lol. But still ...
Features // options which likely degrade performance might benefit from being behind an 'expert' field.

Stilez said:
...only battery backed RAM is faster...

Battery backed RAM..? Is this related to dedupe as well or is it something to evaluate in lieu of Optane..? Speaking of optane ... which model would you advice (given that value is a factor): 900p, 905p or P4800X P4801X..?

Stilez said:
I'm using Optane for it, but it's expensive. At minimum, good SSDs (Samsung Pro NVMe)

I actually have 4x Samsung PM983 3.84TB SSDs ... but not a good HBA for it yet. I want to get either 4 more + an Optane for high-performance storage... I thought I'd be able to get the HighPoint SSD7120 to run in HBA mode, but I haven't experimented with that yet so I'm not confident about that.

What's the least expensive x16 controller (for 4x x4 NVMe devices) you know of..? And do you think the Samsung PM983 drives are okay to use in a RAIDz2 config..? Or is that exactly what you were talking about in the next section re: special VDevs needing redundancy..? (As I'm still a little unclear on whether that's nested arrays for Dedupe or just a generalized admonishment)

Stilez said:
Just remember special vdevs need to be redundant, too, their data is fundamental to the pool being usable.

Again... special VDevs...? Which data..? The dedupe table..? If so, I see that. Makes sense.

Stilez said:
Whatever the webUI says and whatever the rest of your pool is, make them mirrored *never* raidz.

I assume this also is something I'll become acclimated to once using TN 12..?

Stilez said:
Override the webUI if needed to do it.

Is this re: DeDupe..?

Stilez said:
Shows one other thing. When you think you have a problem, there's usually others who do, too.

Aside from assuming I had a bad HD ... some used the trick Doctors use when they have an out...
Kinda like telling fat people their back would get better if they lost weight.

It's virtually unfalsifiable bc few patients have the willpower to test the Doctors claim. but the Dr just wants to get paid

Over the last 3 years I've had some pretty ridiculous suggestions ... despite having this problem on 3 very different systems and 3 otherwise identical systems! lol. The most recent of which was, "Maybe your CPUs not fast enough" ... re: the Xeon E5 @ 40MB/s ... lol.

I also do data recovery, so the idea that I couldn't tell a drive was bad, on 5 different systems ... in which 3 of them are virtually identical..? Just statistically ridiculous.
I KNEW it was a setting ... it's just that there are so many settings...(UI has more choices than a 747) that I couldn't prove it!

It took a TRUE GENIUS to do it!. :) :)

Stilez · Sep 1, 2020

TrumanHW said:
Snip

A lot of questions there, hope this helps.

Special vdevs:
You understand (I hope) that ZFS like many advanced filing systems keeps an awful lot of data about how stuff is organised, internally. Pointers, indexes, search trees, tables, internal logs, additionap data like ACLs and ownership associated with files, free space, dedup table if that's in use... stuff like that. That's collectively the pool's "metadata" (data ABOUT the data you're storing in the pool). It tends to be small, random, and often changes/updates. You also know, (I hope) that ZFS pools are created from redundant sets of disks - vdevs - and you just add more of them if needed. ZFS spreads data (including metadata) across all the vdevs in the pool, and across the disks in each vdev, for efficiency and data protection. That means usually metadata is scattered on those HDDs as well. There is no way to control what disks, what data is on - you give ZFS a bunch of disks and their vdev structure, and ZFS will use them as it "thinks" best within its programming and tunables.

A couple of exceptions exist. Historically, you can dedicate some disks to read cache (L2ARC) or write log (ZIL/SLOG) and that'll be respected. That means you can get SSDs for those high performance tasks, without needing a pure SSD pool.

As of TrueNAS-12, you can also dedicate some disks to metadata as well, and that'll be respected too, meaning those can be given smaller dedicated SSDs as well. Those are called "special vdevs" - perhaps " Small data vdevs" would be a better name. Same result in the end. Big user files on HDD, and read cache, write log, metadata, DDT and (optionally) tiny files, on SSD for speed/efficiency.

Dedup:
To be fair, FreeNAS/TrueNAS/ZFS commentators have *always* said, to avoid dedup, but also to be fair they've said "unless you have the hardware for it". The problem is, this specific issue just never got appreciated or perceived or awareness. So that was the advice. It's not an expert feature though. If you were on 1G you probably wouldnt have the problem. The combination of 10G + dedup hasnt really ever been appreciated - maybe because domestic 10G until very recently was so rare.

Dedup's use cases are *extreme* data size reduction. In my case, for example, the maths goes like this (roughly)

40 TB of data now. Say 80TB in a while. ZFS likes to run with quite a lot of free space, so that 80% should still ideally only be about 60% full . So about 133 TB raw pool capacity. I like to run 3 way mirrors. That's about 400 TB raw disk space. Double it because backup server. 800 TB raw capacity in theory between them. 0.8 PB for a home server and backup? Ridiculous. Using say enterprise 8 TB disks? That's 100 disks at £200 each. And connectors/backplanes. And power costs.
Enable dedup? Now 13 TB, 1/3 of the size. But my future writes will also dedup more (more likely to already have copies in the existing 40TB) so my deduped size wont double in 5 years. It might go up by 50%, say 18 TB deduped. At 60% full and 3 way mirrors, thats 90TB, or 11 disks. Maximum 22 disks inclusing a 2nd backup server.
This is when and why one uses dedup. Not to just shave off a few tens of percent. When storage cost and scale is actually prohibitive otherwise.
Other use cases are limited disk size, or limited bandwidth (historically one could cut data sent by a huge amount if replicating or backing up/restoring)

The point is, this isn't because ZFS is wasteful. Its inherent in my data size, choice of mirroring with 2 failure tolerance in a set as protection, and the fact disks die/fail so redundancy is needed across 2 servers in order to never need to worry about it. I could be using anything - ZFS, BTRFS, windows server, I'd still probably need comparable disk use for that level of safety.

Cost and SSD choices:
Being fair again, I dont think you *need* optane. Good pro-am SSDs could be enough. I'm just not in a mood to have to buy stuff only to haver to replace it, and data isn't out there to confirm what I need to predict my best uses and choices. I know Im a very heavy user when busy, so I spent than have risk of issues. Thats me though.

But yes, good quality storage costs. Fast storage costs. Whatever you do, it can be cheap, fast, or reliable. Pick any 2. Cheap and fast won't be reliable, cheap and reliable will skip premium fast components, and fast and reliable will cost. It's not cheap running a file server. I happen to like the benefits and peace of mind is huge.

Specific other questions:
Battery-backed RAM - Because SSDs are tricky to make fast (they need a lot of tricks to work efficiently!), some people havbe made PCIe cards with ordinary volatile RAM - but with battery on board, to either stop the RAM losing its data, or to provide the 30 seconds needed to back it up to separate on-board NVRAM before power fades (it'll be restored when main power returns). In effect, an SSD with the speed and durability of ordinary main motherboard RAM modules. Less common now, usually expensive too ;-)

4x4 controllers? Look at things like Asus Hyper thing, 4 way M.2 slots on a card. But whatever you get, there's a twist you need to be aware of. When you put multiple PCIe devices on one slot, you actually need the slot to handle them as independent devices, not as one device. If the motherboard can do it, then the feature you need in its firmware is called "bifurcation". That means, the motherboard can be told in BIOS to treat an 8x slot as 4x + 4x, or a 16x slot as four 4x or an 8x + two 4x, or whatever. If the motherboard can't, you need a PCIe card that can handle 2 or 4 M./2 devices but make them all visible to the BIOS. Usually that means an expensive card with a PLX chip on board. Yes these exist, but are silly prices sometimes.

An alternative option is an HBA that can take U.2 NVMe devices. But these kinds of SSDs truly do want 4 PCIe lanes, because they can use them. Not like SATA or SAS where you can squeeze 10 devices onto 3 lanes or whatever and they'll all get enough bandwidth. NVMe SSDs really *can* use all their lanes in some cases. The problem is however you fiddle it, a CPU/board only has so many lanes - and it's not a big number. So your 4 way U.2 HBA either needs 16 PCIe3 lanes, or if it's squeezing them into 8 lanes, you wont get full bandwidth off some good SSDs. So be aware and think about that aspect. HBAs that can handle U.2/M.2/NVMe are newish, but Broadcom's 9400 range I think is an example. Ask others for help in this area,. I could be wrong.

Your best bet for the Samsung is probably cheap single or dual M.2 adapters, depending what slots your board has and any support for bifurcation.

ZFS pools - the DISKS don't really "nest". Datasets can, but thats the logical structures on within the pool's files and data, not the pool's actual design. The disks are in vdevs, and vdevs make up the pool, thats as deep as vdev and disk nesting goes.

The 12-BETA web UI warns against using mirrors on some vdevs and raidz on others. But SLOG/ZIL, special (metadata) and probably L2ARC should all be non-raidz if present. If the web UI complains, it's wrong advice, and hopefully fixed by the final release. Check the box marked "ignore this warning" and do mirrors for those anyway.

I wouldn't say it's "just a setting". You run a very uncommon setup - a 10G+ LAN, with dedup. It's that combination that is the issue. If the problem has arisen commercially Ive never heard of it. Or they solved it other ways. But it's not "just" a setting. You could equally say its because 10G is a bad choice for networking. Switching down to 1G would fix it, too, probably. Or not having all-SSDs. The point is, you can fix this numerous ways, because its an oversight in older ZFS about "if A and B and C you can get this kinda pathological behaviour emerging. Preventing or disabling any one of A, B, C in older ZFS will prevent the issue."

I think I prefer to think of myself as driven by necessity. I didnt have an easy option to not look for it. Glad it helped!

Constantin · Sep 2, 2020

Hi Stilez,
Thank you for that very informative post. FreeNAS has a ton of tools built into one package. But there is a steep learning curve re: its features and dedup is one of them. Per the forum conversations I looked up at the time, most users didn't have a use case, it was a memory hog, it should be limited to specific data sets that benefit from it explicitly, and so on.

I remember puzzling eons ago whether to or not to enable dedup on my sole pool and having read the various warning deciding not to go there. My use case is very different from yours or the OPs and I have actually run various de-dup programs manually on my Mac to limit the number of duplicate files on the server.

I look forward to special VDEVs in the future, I plan on splitting the three S3610 into two partitions each - one for Metadata, the other for small files - and then mirroring them 3-ways to let me sleep at night. The extant metadata-only L2ARC will then become general-purpose... some day, it may even become persistent... in the meantime, I really notice the performance delta as the L2ARC cache gets hot when I back up my pool to an external array via rsync.

Stilez · Sep 2, 2020

Constantin said:
I plan on splitting the three S3710 into two partitions each - one for Metadata, the other for small files - and then mirroring them 3-ways to let me sleep at night. The extant metadata-only L2ARC will then become general-purpose... some day, it may even become persistent... in the meantime, I really notice the performance delta as the L2ARC cache gets hot when I back up my pool to an external array via rsync.

That's almost exactly what I've done, except I don't have small files, and may need SLOG/ZIL in future, so my 2nd partitions are used that way not small files. Yes my L2ARC got retired as well - it couldn't do anything that the RAM + special vdevs couldn't do equally or better at that point, as its main aim was to speed up metadata.

Constantin · Sep 2, 2020

The only reason to keep the L2ARC is being able to launch frequently-used files faster, no? With 1TB of storage capacity, I want to see if my L2ARC may not significantly speed up file responsiveness for things like the iTunes index / xml file.

In the meantime, I’ve consolidated a lot of my smaller files into larger zip or image files. Eliminating as many dormant small files as possible should then allow the server to store more of the remaining small files on the SSDs instead of the HDDs.

deduping stuff is painfully slow / bad on the Mac as the apps I’ve tried out for this purpose like Gemini have significant improvement potential. But once the dedup is done, I’ll hopefully have one consistent dataset with everything sorted, etc.

Stilez · Sep 3, 2020

I just spotted another 4 way M.2 PCIe card online.

M.2 X16 TO 4X NVME PCIE3.0 GEN3 X16 TO 4*NVME RAID CARD Expansion VROC CARD 4XX4 | eBay

Find many great new & used options and get the best deals for M.2 X16 TO 4X NVME PCIE3.0 GEN3 X16 TO 4*NVME RAID CARD Expansion VROC CARD 4XX4 at the best online prices at eBay! Free delivery for many products!

www.ebay.co.uk

Cheap, but note as with most such cards it will need bios/firmware bifurcation support.

Constantin · Sep 3, 2020

Those cards have a lot of potential - buy a few, distribute NVME modules across them (so no single card failure hoses the pool) and you could have dedicated NVME mirrored special VDEV for small files, another for metadata, and so on. Granted, the motherboard has to have enough slots, which mine doesn't. That's where Xeon and AMD systems can really shine.

Stilez · Sep 3, 2020

Constantin said:
Those cards have a lot of potential - buy a few, distribute NVME modules across them (so no single card failure hoses the pool) and you could have dedicated NVME mirrored special VDEV for small files, another for metadata, and so on. Granted, the motherboard has to have enough slots, which mine doesn't. That's where Xeon and AMD systems can really shine.

That's probably wasteful or pointless. Mostly, if the SSDs have similar performance and running out of space/adding more SSDs isn't a worry, there won't be much point in replacing say, 2 pairs of NVMe mirrors used for metadata+small files, by one mirror.for metadata and one for small files. You're preventing ZFS pulling metadata off 4 SSDs at once or striping it to 2 vdevs. The point is to speed up small IOs, and segregating small from metadata on similar devices probably won't add much by itself, as a rule. Create 2 special vdevs and let both be used for metadata+small, if that's the scenario.

Important Announcement for the TrueNAS Community.

Large-File Transfers -- Peaks to 600MB/s (5-10s) ... drops to 0KB/s & peaks again after pausing.

Patron

Patron

Contributor

Patron

Contributor

Contributor

Patron

Patron

Guru

Contributor

Contributor

Guru

Vampire Pig

Guru

Vampire Pig

Guru

Vampire Pig

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Large-File Transfers -- Peaks to 600MB/s (5-10s) ... drops to 0KB/s & peaks again after pausing."

Similar threads