Pushing SMB large multi-gigabyte files TO the NAS causes a large transfer to slow and then fail.

SpiritWolf

Dabbler
Joined
Oct 21, 2022
Messages
29
Pushing SMB large multi-gigabyte files TO the NAS causes a large transfer (via 2.5 network interface (intel) or single gig interface (Asus/Realtek) then dropping off.

Start of transfer, push from Mac or PC, 300MB/sec. After about 1 minute, 100, then 70/80MBS, then down to 1MB or so into the Kbytes. Often it disconnects.
Different Mobos (another Asus) different ethernet connections both fast and standard, they all fail.

Test Files: AntMan 4k rip 55gig (compressed as a zip)
Other files but not compressed, doesn't matter as much but will still exhibit

Smaller Files (seems that these do better)
Will come back to speed IF manually copied smaller files come afterward AND it doesn’t disconnect.

Happens with PC and MAC, using 2.5 from *original* tested Mobo (Asus internal REALTEK) or SFP+ 10Gig to Unify switch or combo or Macintosh 2.5 dongle.
The ONLY things not changed out are the RAM sticks and the Processor. Even the switches have been changed.


When connection fails, can’t re-connect w/o resetting CIFS/SMB or rebooting. Typically need to reboot. Reboot can take many minutes.
Deprecated protocol AFP will work…but weakly

———————————————————


TrueNAS 13-U2

Motherboard
: Asus B-550 Plus AC-HES w/ current BIOS (tweaks perhaps?)
CPU: Ryzen 5600x 6 cores
Memory: 64 Gig ECC TrueColor 3200

Nics:
10Gb SFP+ PCI-E Network Card NIC, Compare to Intel X520-DA1, with Intel 82599EN Chip, Single SFP+ Port, PCI Express X8
...and
10Gtech 10GBase-T SFP+ Transceiver, 10G T, 10G Copper, RJ-45 SFP+ CAT.6a Module, to 2.5 switch (Mac, PC, and TrueNAS with 2.5 connection to Unifi 48 port switch for general distribution,
…and
Asus Realtek Gig MoBo Ethernet (Attached to Unify 48 port 1st Gen Switch but not always: sometimes use just the 2.5 connection alone)

HBA Card:
9207-8i PCIE3.0 6Gbps HBA LSI FW:P20 IT Mode ZFS FreeNAS unRAID 2* SFF-8087 US
(Attached to SODOLA 8-Port Unmanaged 2.5G Switch| 8 x 2.5GBASE-T Ports)


Drives: 5x Exos 18Tib RAID-Z2

GPU Radeon (used to just to see when output is wrong), not installed usually
PSU: Corsair RMX Series (2021), RM750x, 750 Watt, Gold, Fully Modular Power Supply

testparam-s attached
 

Attachments

  • testparam-s.txt
    3 KB · Views: 185
Joined
Jan 7, 2015
Messages
1,150
Ive had this very issue sometime very long ago and its maddening, but I think it ended up being a faulty disk causing zfs cache to fill up and not empty. In fact I found my post from 2015 on this. But we have seen plenty of this over the years and almost certainly a caching issue of some sort. Alot of this was flying around when the WD Red debacle was happening with the shady SMR disks. People did not realize that the disks were crap until large data moves were made, resilvering, scrubbing, large data transfers, etc.. Exos are enterprise disks so I doubt that's the issue, but I would test them all thoroughly anyways.
 

SpiritWolf

Dabbler
Joined
Oct 21, 2022
Messages
29
Ive had this very issue sometime very long ago and its maddening, but I think it ended up being a faulty disk causing zfs cache to fill up and not empty. In fact I found my post from 2015 on this. But we have seen plenty of this over the years and almost certainly a caching issue of some sort. Alot of this was flying around when the WD Red debacle was happening with the shady SMR disks. People did not realize that the disks were crap until large data moves were made, resilvering, scrubbing, large data transfers, etc.. Exos are enterprise disks so I doubt that's the issue, but I would test them all thoroughly anyways.
Ok.. Duly noted. Now...as Im a relitive tyro on this, how do I follow up and see? no SMART notifications have I on this so how would I track this down? At the moment, anything pushed to the device is fine for the first few gigabytes and then slows to a trickle (KiB!).
 

SpiritWolf

Dabbler
Joined
Oct 21, 2022
Messages
29
Hmm.. Being so new, I've made PLENTY of mistakes.
I was still getting this:
1671163544859.png


The data to the server was good for about 5 gigs and then would peter off into never-never land for about 20 seconds or so, then do another ramp.


II also did a "zpool iostat -v" and the results were this:
1671163937287.png


Going further with my research, the gstat was VERY informative:
1671163793341.png


and my pool is getting hit HARD with ~20ms latency.

That's where I think I'm getting killed. I'm not sure how quite to interpret this, but Im' learning.

Would a couple of Optane drives as a SLOG solve this issue?

Thanx ahead of time :)
 

Attachments

  • 1671163520959.png
    1671163520959.png
    7 KB · Views: 112

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Would a couple of Optane drives as a SLOG solve this issue?

No, why would it? If you are laboring under some incorrect understanding that the SLOG is some sort of cache, please disabuse yourself of that notion. Check out


and please accept my assurances that if you turn off sync writes, your pool is going to be as fast as it can be; adding a SLOG will ALWAYS slow a pool down when compared to a pool that has sync writes disabled.

and my pool is getting hit HARD with ~20ms latency.

Yes. So what? A common hard drive is able to do about 50-150 IOPS (seeks) per second. At 20ms, 50 IOPS would be 1 second, so your drives are maybe just performing on the lower end of that curve. RAIDZ is never that performant, so this doesn't come as a shock to me.
 

SpiritWolf

Dabbler
Joined
Oct 21, 2022
Messages
29
1671164363946.png

No, why would it? If you are laboring under some incorrect understanding that the SLOG is some sort of cache, please disabuse yourself of that notion. Check out

I'm reading it...slow going to a newb. But consider me disabused.
and please accept my assurances that if you turn off sync writes, your pool is going to be as fast as it can be; adding a SLOG will ALWAYS slow a pool down when compared to a pool that has sync writes disabled.
Ok...that's easy to understand. So what are the downsides of disabling sync writes? And how is it done?
Yes. So what? A common hard drive is able to do about 50-150 IOPS (seeks) per second. At 20ms, 50 IOPS would be 1 second, so your drives are maybe just performing on the lower end of that curve. RAIDZ is never that performant, so this doesn't come as a shock to me.
Allow it to be a shock to me. That's what happens when one learns, Mr. Grinch. :)

What would allow me to be data-safe AND get the performance out of my hardware? What tweaks would you recommend? I suppose I can always return the dual P-series Optane units I've asked for...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
So what are the downsides of disabling sync writes? And how is it done?

There's a per-dataset flag you can set, like "zfs set sync=disabled [yourdatasethere]"

So sync writes are important. Maybe. Sometimes. Not always.

Consider the case where you are JPMorgan Chase Bank. Picked only because they're the largest US bank, no negative inferences should be made.

You have stored your customer ledger on a ZFS based filesystem. It's the middle of the night, and ACH transactions are flowing in from the Fed. You get a transaction that says "Deposit $1000 to SpiritWolf from WolfEmployerCo". So you go into your database, read the current balance, record the transaction, add $1000 to the balance, but just as you're writing that back to disk, the system crashes, or a goober hits RESET, or power goes out, or a tornado hits the building.

So with ZFS, your write cache is in memory, and that update to the disk vanishes, along with your paycheck or reimbursement, even though it shows up as a line in your ledger.

The purpose of the ZIL (ZFS Intent Log) is to make a record of the write prior to having it committed to the main pool. When the system comes back up, events in the ZIL are then replayed to make sure they got committed to the pool. This offers programmers a guarantee that if the system returned from the write() syscall successfully, the data was ACTUALLY stored and can be relied upon to be retrieved correctly. You want for your bank balance to be reliably stored and retrieved correctly.

But the in-pool ZIL is slow as hell. So we have this option for a device called SLOG. It can keep track of the sync writes at a very fast clip, nearly the speed of your SLOG device. So if you need sync writes, SLOG is a speed accelerator (but still not a cache).

This can also come into play if you are running VM's on another machine and using the NAS for storage.

Allow it to be a shock to me. That's what happens when one learns, Mr. Grinch. :)

Well, to be honest, I wasn't real happy with that line either. It feels a little too slow, especially since it is unlikely that the IOPS are quite that low. If we look at the line for da2 in your screen shot, that shows 108 reads per second at 20ms, but also significant writes per second. The problem is that a lot of these reads and writes can be sequential in nature, so if you asked to read sectors 6, 7, 8, 9, and 10, that would only be one seek to get there but five reads. That's five IOPS by some definitions, though I prefer to think of an IO as a seek event (since that dominates performance on HDD's).

What would allow me to be data-safe AND get the performance out of my hardware?

Depends on what you're doing, unfortunately. With RAIDZ in play, be sure you are trying to store large files, not small files. That will make a big difference. Adding more memory might be advisable if you are finding read speeds to be poor, you can check the output of arc_summary to get a better idea there. If that's still not enough read speed, main memory ARC can be augmented with L2ARC (flash).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
So what are the downsides of disabling sync writes? And how is it done?

Maybe useful also to answer that and expand on the explanation already provided, the type of transactions you're running are important.

If you're copying around large media files, then sync writes aren't really helping you a lot, since if there's a cut/crash in the middle of the transfer, you can just do it again when the system comes back, so the guarantee that every block is actually on the disk before allowing the request for the next one to go ahead is just slowing things down in your case.

For database or bank transactions where individual blocks can be very important in a constantly changing file, sync writes matter a lot.
 
Joined
Jan 27, 2020
Messages
577
So for fast storage (nvme mirror) of VMs, that are running lots of databases, where does it leave me? Sync on or off? And I'm no bank nor a business...
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
lots of databases
You're going to need Sync... or a bunch of crossed fingers and limbs hoping for no failures or power cuts... or an acceptance of potential data loss in case of some kind of failure. Otherwise it's a SLOG for you if you don't want really bad performance.
 
Joined
Jan 27, 2020
Messages
577
You're going to need Sync... or a bunch of crossed fingers and limbs hoping for no failures or power cuts... or an acceptance of potential data loss in case of some kind of failure. Otherwise it's a SLOG for you if you don't want really bad performance.
Would help a mirror of consumer SATA SSDs help for SLOG or is this just too much of a downgrade to the I/O perf?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702

SpiritWolf

Dabbler
Joined
Oct 21, 2022
Messages
29
There's a per-dataset flag you can set, like "zfs set sync=disabled [yourdatasethere]"

So sync writes are important. Maybe. Sometimes. Not always.

Consider the case where you are JPMorgan Chase Bank. Picked only because they're the largest US bank, no negative inferences should be made.

You have stored your customer ledger on a ZFS based filesystem. It's the middle of the night, and ACH transactions are flowing in from the Fed. You get a transaction that says "Deposit $1000 to SpiritWolf from WolfEmployerCo". So you go into your database, read the current balance, record the transaction, add $1000 to the balance, but just as you're writing that back to disk, the system crashes, or a goober hits RESET, or power goes out, or a tornado hits the building.

So with ZFS, your write cache is in memory, and that update to the disk vanishes, along with your paycheck or reimbursement, even though it shows up as a line in your ledger.

The purpose of the ZIL (ZFS Intent Log) is to make a record of the write prior to having it committed to the main pool. When the system comes back up, events in the ZIL are then replayed to make sure they got committed to the pool. This offers programmers a guarantee that if the system returned from the write() syscall successfully, the data was ACTUALLY stored and can be relied upon to be retrieved correctly. You want for your bank balance to be reliably stored and retrieved correctly.
Ok, that is easy enough to understand. I got that...minus the place I'd enter "zfs set sync=disabled [yourdatasethere]".

Presuming I understand, if the pool has multiple datasets, I can individually tell them to use Sync writes or not to Sync writes.

But the in-pool ZIL is slow as hell. So we have this option for a device called SLOG. It can keep track of the sync writes at a very fast clip, nearly the speed of your SLOG device. So if you need sync writes, SLOG is a speed accelerator (but still not a cache).

This can also come into play if you are running VM's on another machine and using the NAS for storage.
So an SLOG is a way to move the ZIL off the main rotating pool of rust to something faster, say an SSD that won't burn out with all the writes, say the pair of Intel Optane 905P Series 960GB devices that are on their way.
At least, so I grok. Not a cache (didn't think so actually), but an "expediter", so-to-speak.
Well, to be honest, I wasn't real happy with that line either. It feels a little too slow, especially since it is unlikely that the IOPS are quite that low. If we look at the line for da2 in your screen shot, that shows 108 reads per second at 20ms, but also significant writes per second. The problem is that a lot of these reads and writes can be sequential in nature, so if you asked to read sectors 6, 7, 8, 9, and 10, that would only be one seek to get there but five reads. That's five IOPS by some definitions, though I prefer to think of an IO as a seek event (since that dominates performance on HDD's).



Depends on what you're doing, unfortunately. With RAIDZ in play, be sure you are trying to store large files, not small files. That will make a big difference. Adding more memory might be advisable if you are finding read speeds to be poor, you can check the output of arc_summary to get a better idea there. If that's still not enough read speed, main memory ARC can be augmented with L2ARC (flash).
The unit is holding my Mac and PC backups and a Plex server. Plus my moms old 16mm film conversion to files storage.
---------
And that was the other question; to use one of the drives as an L2ARC as you suggest may be useful.

I had thought 64 gigs of RAM would suffice with this box; seems I was overoptimistic especially with de-dup and such.

As writes are the bugaboo here for the moment, anything I can do to speed that up would be a "good thing" (Pat. Pending). The box sounds like an incipient earthquake and the 5 drives were used, in my ignorance, for three as data and 2 as parity.

My reasoning was that drives are so large these days, it was my idea if a drive failed in service, I still hadn't a degraded pool and had time to get another; I could suffer as many as two failures and still be whole. Syncing a new drive would take enough time that another drive in the pool could actually fail if it was inclined to killing my data were it a single parity drive.
 

SpiritWolf

Dabbler
Joined
Oct 21, 2022
Messages
29
Maybe useful also to answer that and expand on the explanation already provided, the type of transactions you're running are important.

If you're copying around large media files, then sync writes aren't really helping you a lot, since if there's a cut/crash in the middle of the transfer, you can just do it again when the system comes back, so the guarantee that every block is actually on the disk before allowing the request for the next one to go ahead is just slowing things down in your case.

For database or bank transactions where individual blocks can be very important in a constantly changing file, sync writes matter a lot.
To reiterate from another post, my PC and Mac backups, Plex server, and General File copies.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,903
I had thought 64 gigs of RAM would suffice with this box; seems I was overoptimistic especially with de-dup and such.
De-duplication is tricky at best and changes sizing, esp. for RAM. Very, very few people here use it, because the use-cases are fewer than one might think.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
especially with de-dup

Oh. Dedup. That explains it. You're way short on RAM and it's busily pushing massive quantities of IOPS trying to make it all work out.

Check out the Resource by @Stilez on this topic.

 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,107
So an SLOG is a way to move the ZIL off the main rotating pool of rust to something faster, say an SSD that won't burn out with all the writes, say the pair of Intel Optane 905P Series 960GB devices that are on their way.
One is enough, and 960 GB is WAY oversized since it only needs to hold 5 seconds of input.

I had thought 64 gigs of RAM would suffice with this box; seems I was overoptimistic especially with de-dup and such.
Bump that to 128 GB for dedup.
Else this is a prime case for a (metadata-only) persistent L2ARC, to hold the dedup table. With a bit of command line sorcery, you may actually use two partitions on a single 905p as SLOG and L2ARC devices. (But your use case rather warrants "sync=never" and no SLOG, and 960 GB is too big a L2ARC with only 64 GB RAM.)
Best overall would be to disable dedup and move the data to a new dataset, to remove dedup entirely.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,107
So for fast storage (nvme mirror) of VMs, that are running lots of databases, where does it leave me? Sync on or off? And I'm no bank nor a business...
Presumably "sync=always" because lost databases transactions could not be replayed, and a qualified NVMe SLOG because SLOG only makes sense if it's faster than a stripe of the data drives (the ZIL) and if it has PLP. Mirroring is NOT required.
A European seller on the ServeTheHome forum has 100 GB Optane DC P4801X drives (M.2 22110). Radian RMS-200 cards can be found on ebay.co.uk. Grab one of these devices as your SLOG—whichever fits best in your hardware.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Bump that to 128 GB for dedup.

Quite possibly more. I had calculated 192GB "back of envelope" but hesitated committing to it because dedup is very sensitive to factors such as the recordsize, and I believe @Stilez discusses it in his article. You need a DDT record for each unique block stored. With 5x18TB HDD, you should be prepared for a world where you could need 270GB, calculated as three data drives times 18TB times 5GB per TB. See how things get crazy fast.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
especially with de-dup

Everyone's jumped all over this one and linked to the excellent paperwork from @Stilez on this one - let's pull the raw numbers out with the following and find out what the "cost" and "return on investment" has been for dedup.

zpool list

zpool status -D poolname

Output in [code][/code] tags please?
 
Top