10G Starts Maxed & Then Slows

StoreMore

Dabbler
Joined
Dec 13, 2018
Messages
39
My FreeNAS will start a transfer at 1GB/s then slows down to roughly 500MB/s after 10 seconds or so. It will copy indefinitely at this 500MB/s threshold. How can I figure out what is actually happening and a good resolution to keep the transfers closer to the 1GB/s.

I understand it maybe be on the slower side for batches of small files but these are all large files in the 10-100GB range.

My FreeNAS box has 8 WD Reds in a Z2 config, 64 GM ram, 6 core processor. I have 2 optane (905P) drives installed in the NVME slots but I don't have them activated for anything yet.

Thanks!
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
My FreeNAS box has 8 WD Reds in a Z2 config

There's your answer. To me, a sustained 500MB/s read on a single raidz2 vdev sounds plenty fast.
 

StoreMore

Dabbler
Joined
Dec 13, 2018
Messages
39
What model are those WD Reds? Hopefully not EFAX SMR, or slowing to 500MB/s will be the least of your worries.

They are the 10TB models so they shouldn't be SMR.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
It sounds a bit like filling a cached write buffer on the client at the start, then writing at the speed of the disks after that... what do you have on the other end of that 10G connection?
 

StoreMore

Dabbler
Joined
Dec 13, 2018
Messages
39
There's your answer. To me, a sustained 500MB/s read on a single raidz2 vdev sounds plenty fast.

I guess fast is relative. When you see 1G speeds for a quick bit then it drops to half that it seems slow. I guess I am just after making sure that I have fully tuned the system. Is there anything else I can or should do to get the max out of the hardware that I have? The HDs should be able to each sustain 150MB/s which should get to me to about 900MB/s when I multiply by six drives and drop 2 for the Z2. I don't think a zil/slog will make a difference but is there something else that needs to be tuned? Any other way to actually monitor or log the transfers that would help in the diagnosis?

It sounds a bit like filling a cached write buffer on the client at the start, then writing at the speed of the disks after that... what do you have on the other end of that 10G connection?

My desktop sits on the other side. Its a Ryzen 3 system with an optane 900P system drive and the secondary drive where the files were being copied from is a Gen4 Sabrent Rocket NVME. Using the built in Aquantica 10G nic card on the X570 Creator motherboard.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Hard drives only sustain 150MBytes/sec if there is straightline (sequential) access happening. Any seeks at all will drop that speed dramatically (in worst case to less than 1MByte/sec).

So if you are writing, it's ballpark reasonable for a not-too-full not-too-fragmented newish pool. After the pool is fuller and has experienced fragmentation, you will see slower write speeds.

Check to see if the disks are actually running at a healthy capacity using gstat. If the tool is reporting that the disks are very busy (70%+) you are really just hitting actual performance limits.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
You could try copying to NUL or a ram disk on the PC.
 

StoreMore

Dabbler
Joined
Dec 13, 2018
Messages
39
You could try copying to NUL or a ram disk on the PC.

Trying a local copy from the data folder (where all the copies originate from - gen4 nvme) to my operating system drive temp folder (optane 900p) I get 1.4-1.5GB/s sustained.

Hard drives only sustain 150MBytes/sec if there is straightline (sequential) access happening. Any seeks at all will drop that speed dramatically (in worst case to less than 1MByte/sec).

So if you are writing, it's ballpark reasonable for a not-too-full not-too-fragmented newish pool. After the pool is fuller and has experienced fragmentation, you will see slower write speeds.

Check to see if the disks are actually running at a healthy capacity using gstat. If the tool is reporting that the disks are very busy (70%+) you are really just hitting actual performance limits.

I'll check the gstats and report back once I have to time to generate a new fileset.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Trying a local copy from the data folder (where all the copies originate from - gen4 nvme) to my operating system drive temp folder (optane 900p) I get 1.4-1.5GB/s sustained.
OK. did I misunderstand the direction you were talking about then? (PC to FreeNAS?)

Then in that case, it's exactly the story @jgreco is mentioning... you can fake it with SLOG for maximum 30GB (5 seconds), but then you're back to spindle speeds and IOPS to catch up.
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Those streaming reads/writes are not actually what make a system seem fast. It's more about IOPS when it comes to actually using the system vs just benchmarking the system. My 3 vdev 24 drive raidz2 system can do 1.7GB/s streaming reads but drops down to about 193MB/s when it comes to doing real work.
You should use iozone to really test the streaming vs random performance. The random read and random write will tell a more accurate performance metric. Here is an example off my system, there are several different options you can pass to iozone to get different numbers.
Code:
iozone -i 0 -i 1 -i 2 -s 150g -t 1
        Iozone: Performance Test of File I/O
                Version $Revision: 3.487 $
                Compiled for 64 bit mode.
                Build: freebsd

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
                     Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
                     Vangel Bojaxhi, Ben England, Vikentsi Lapa,
                     Alexey Skidanov, Sudhir Kumar.

        Run began: Thu May  7 13:00:53 2020

        File size set to 157286400 kB
        Command line used: iozone -i 0 -i 1 -i 2 -s 150g -t 1
        Output is in kBytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 kBytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 1 process
        Each process writes a 157286400 kByte file in 4 kByte records

        Children see throughput for  1 initial writers  =  732941.12 kB/sec
        Parent sees throughput for  1 initial writers   =  726624.73 kB/sec
        Min throughput per process                      =  732941.12 kB/sec
        Max throughput per process                      =  732941.12 kB/sec
        Avg throughput per process                      =  732941.12 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for  1 rewriters        =  752973.31 kB/sec
        Parent sees throughput for  1 rewriters         =  743161.17 kB/sec
        Min throughput per process                      =  752973.31 kB/sec
        Max throughput per process                      =  752973.31 kB/sec
        Avg throughput per process                      =  752973.31 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for  1 readers          = 1787787.88 kB/sec
        Parent sees throughput for  1 readers           = 1787760.13 kB/sec
        Min throughput per process                      = 1787787.88 kB/sec
        Max throughput per process                      = 1787787.88 kB/sec
        Avg throughput per process                      = 1787787.88 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for 1 re-readers        = 1753536.25 kB/sec
        Parent sees throughput for 1 re-readers         = 1753511.14 kB/sec
        Min throughput per process                      = 1753536.25 kB/sec
        Max throughput per process                      = 1753536.25 kB/sec
        Avg throughput per process                      = 1753536.25 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for 1 random readers    =  193732.53 kB/sec
        Parent sees throughput for 1 random readers     =  193725.93 kB/sec
        Min throughput per process                      =  193732.53 kB/sec
        Max throughput per process                      =  193732.53 kB/sec
        Avg throughput per process                      =  193732.53 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for 1 random writers    =   78160.55 kB/sec
        Parent sees throughput for 1 random writers     =   78130.06 kB/sec
        Min throughput per process                      =   78160.55 kB/sec
        Max throughput per process                      =   78160.55 kB/sec
        Avg throughput per process                      =   78160.55 kB/sec
        Min xfer                                        = 157286400.00 kB
 

StoreMore

Dabbler
Joined
Dec 13, 2018
Messages
39
Hard drives only sustain 150MBytes/sec if there is straightline (sequential) access happening. Any seeks at all will drop that speed dramatically (in worst case to less than 1MByte/sec).

So if you are writing, it's ballpark reasonable for a not-too-full not-too-fragmented newish pool. After the pool is fuller and has experienced fragmentation, you will see slower write speeds.

Check to see if the disks are actually running at a healthy capacity using gstat. If the tool is reporting that the disks are very busy (70%+) you are really just hitting actual performance limits.

It is a new pool for now. I ran gstat and for the majority of the time they are 0s (screenshot attached). Every now and then the drives to go 1-3.

NO WRITE:
NoWrite.png


WRITE (70GBs being written to disk pool):
Write.png


OK. did I misunderstand the direction you were talking about then? (PC to FreeNAS?)

Then in that case, it's exactly the story @jgreco is mentioning... you can fake it with ARC for maximum 30GB (10 seconds), but then you're back to spindle speeds and IOPS to catch up.

Yes, this is exactly what I am after. For this ARC faking (30GB, 10 seconds) is that a tuneable figure that is set somewhere? Or is it automatically calculated based on my RAM size of 64GB? If I double my ram would those figures double (60GB, 20 seconds)?

Alternatively, is there a way to add a cache drive in front (not zil, slog) that receives all data in front of the HD pool at the fast SSD speeds (limited by my 10G network of course) and then flushes it to the pool at the speed of the mechanical drives in the background?
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
There is no write cache. You can do read cache.
64GiB would give you more ARC, you can tune with vfs.zfs.arc_max.
Even more ARC can be had as L2ARC.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
WRITE (70GBs being written to disk pool):
View attachment 38376

Yep, you're simply hitting the vdev speed limit here. Because you're writing asynchronously, your first bunch of writes can land in RAM, and then it slowly has to taper off until the line speed rate matches how fast your drives can go.

You can band-aid the issue by increasing the dirty data maximum for a longer "burst period" at the start, at the cost of RAM, but the only way to get faster sustained long-term, is to have more vdevs or faster drives.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
you can fake it with ARC for maximum 30GB (10 seconds), but then you're back to spindle speeds and IOPS to catch up.

There is no write cache.

Ok guys let's focus here, the wrong information is NOT helpful to the OP and increases confusion. I realize the discussion is already confusing. ;-)

If we are talking writes, this is not an ARC issue and there sure as hell is a write cache of sorts.

Yes, this is exactly what I am after. For this ARC faking (30GB, 10 seconds) is that a tuneable figure that is set somewhere? Or is it automatically calculated based on my RAM size of 64GB? If I double my ram would those figures double (60GB, 20 seconds)?

What you're seeing is the transaction group time limit. Additionally, you are probably doing this on a cold system so the transaction group write throttle hasn't yet adapted to the pool.

When you write data, it does not go to disk, it goes into a "transaction group" which has a certain size (dependent on RAM, total transaction group space is limited to 1/8th of RAM) or a temporal limit of five seconds (this is adjustable but it isn't the fix you are looking for) or a throttle rate (which is learned by ZFS depending on your pool performance). You write into a txg until one of these limits is hit, at which point it closes.

When a transaction group closes, it starts to be flushed to disk and a new transaction group opens, and your writes start getting deposited into that new txg. This is an oversimplification for this discussion but good enough. So what can happen is that you start writing insane quantities of data, and you can do so for up to 5 seconds, then that transaction group starts to get flushed to disk, and a new one opens, and you get five more seconds. This can soak up to 1/8th the RAM in your system. And that's the ten seconds.

The problem is that if you have 256GB of RAM, you might then have 32GB of txgs cached to head out to your pool. One txg is actively being written and the other is pending. But there ARE NO MORE (and to avoid your inevitable question, WILL NOT BE MORE). So ZFS stops accepting further writes aimed at the pool, and waits for that up-to-16GB in-progress txg to finish writing to the pool. When that happens, that txg is fully flushed and then a new one opens up, the one that HAD been pending moves into the flushing-to-disk status, and things start moving again.

But let's say you have 256GB of RAM and your pool is a single mirror pair of conventional HDD's that write at 100MBytes/sec. Something bad happens. You have 16GB of data to write at 100MBytes/sec from a transaction group that was only open for five seconds, but it is going to take 160 seconds to write. This means all writes to your pool freeze for almost THREE WHOLE MINUTES while that txg flushes.

To combat this, ZFS has a write throttle that tries hard to learn the performance characteristics of your pool, and will limit the size of a txg further to make sure that it's likely to be able to flush within around five seconds.

But ZFS needs time to observe the pool under stressy write conditions to learn the performance characteristics of the pool. So when you first boot the NAS, it is totally possible to achieve AWESOME WRITE SPEEDS for a LARGE amount of data for the first try at copying a large file to the NAS, then this scales back for the second and third attempt as the write throttle gets smarter about your pool.

Your way to get faster writes isn't to try to tune this feature and subvert it, because under the sheets, the hardware simply cannot sustain what you are asking, so you will always end up hitting that limit. You can get faster performance by adding more vdev's, up to a point, and that's the real "fix". You can also artificially REDUCE the maximum potential size of a txg to make sure that you never get into a situation where I/O stalls, because for many applications a txg stall is a terrible thing.

Alternatively, is there a way to add a cache drive in front (not zil, slog) that receives all data in front of the HD pool at the fast SSD speeds (limited by my 10G network of course) and then flushes it to the pool at the speed of the mechanical drives in the background?

Not at this time.
 

no_connection

Patron
Joined
Dec 15, 2013
Messages
480
If you really need a speedy offload of data for some reason you can have a manual write cache on SSD and a script that moves the files after.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
If we are talking writes, this is not an ARC issue and there sure as hell is a write cache of sorts.

I think "if we are talking writes" was the open question. Whether this was a read or write test was up in the air for a while.

Point taken on async writes.

Curious question, and without meaning to derail the thread: What happened to Nexenta's Write-Back Cache implementation of 2015? It didn't make it into OpenZFS: Not the right fit / performance? Not funded? Licensing worries? Morphed into special-alloc vdev? (https://www.youtube.com/watch?v=MkdrnG7GwdE)
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Ok guys let's focus here, the wrong information is NOT helpful to the OP and increases confusion. I realize the discussion is already confusing. ;-)

If we are talking writes, this is not an ARC issue and there sure as hell is a write cache of sorts.

I don't mean to be rude, but this entire post is referencing the legacy write throttle - which thankfully is a thing of the past, along with those brutally long and 100% "you aren't writing anything" stalls. So let's look at what's changed and why the new OpenZFS write throttle sucks a whole lot less. Fair warning; I'm going to refer to the default settings of tunables a lot here. These can be changed, but shouldn't unless you know what you're doing. (And there are more beyond these, too, of course.)

vfs.zfs.delay_min_dirty_percent: The limit of outstanding dirty data before transactions are delayed (default 60% of dirty_data_max)
vfs.zfs.dirty_data_sync: Force a txg if the number of dirty buffer bytes exceed this value (default 64M)
vfs.zfs.dirty_data_max_percent: The percent of physical memory used to auto calculate dirty_data_max (default 10% of system RAM)
vfs.zfs.dirty_data_max_max: The absolute cap on dirty_data_max when auto calculating (default 4G)
vfs.zfs.dirty_data_max: The maximum amount of dirty data in bytes after which new writes are halted until space becomes available (the lower of the percentage of system RAM or the absolute cap in dirty_data_max_max)
vfs.zfs.txg.timeout: The maximum age of a transaction group (default 5s)

The maximum transaction group space is now set to 10% of RAM or a maximum of 4GB, whichever is lower (based on the default tunables) - and a maximum age of 5 seconds is the default. However, there's also a threshold of 64M, once this threshold is reached it will close off a transaction group. So the initial transaction group that results from the burst of writes will be 64M - the next one will be much larger - however much data gets dropped into the system in as long as it takes your vdevs to write 64M.

Where the brakes start getting hit is at the 60% of max threshold - so with default 4GB limits, that's 2.4G of data - at that point, ZFS starts to introduce delays before acknowledging writes. First in nanoseconds, then microseconds - eventually reaching 100ms of delay. But it's a gradual slope that scales up exponentially as you get closer to dirty_data_max of 4G. Even under a maximum load you won't ever hit the maximum, because at 100ms of artificial delay, you only get 10 IOPS from the perspective of an outside system.

There isn't a "learning" process over any longer period of time than it takes to fill up your dirty_data_max - and crucially, it "forgets" it just as fast. 0M of dirty data? It'll gobble it up at full speed. If you only have 2G to write? You'll absorb it all into a RAM-based group at line speed and never throttle. If there's enough time between this 2G write burst and the next for your dirty_data to be "clean" again, you could use a pair of 5400rpm drives as backing vdevs and never know the difference. But if you do sustained writes? You'll feel it.

But regardless of all the improvements in the write throttle, the bottom line remains the same:

Your way to get faster writes isn't to try to tune this feature and subvert it, because under the sheets, the hardware simply cannot sustain what you are asking, so you will always end up hitting that limit. You can get faster performance by adding more vdev's, up to a point, and that's the real "fix".

"More vdevs, faster vdevs."
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I don't mean to be rude, but this entire post is referencing the legacy write throttle - which thankfully is a thing of the past, along with those brutally long and 100% "you aren't writing anything" stalls.

It's still generally applicable and easier to explain what's going on that way. Trying to make sense out of what you just wrote, for a beginner, not to be rude, is pretty rough. What you wrote is fine for someone already familiar with ZFS.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
For a simple metaphor, picture a bucket with a tap at the bottom and a hose at the top.

Your network connection is the hose, and the tap at the bottom is your collection of vdevs.

If the host can fill the bucket faster than the tap can drain it, and you don't do something, you could overflow the bucket. To prevent that, ZFS applies a "write throttle" - squeezing the hose, gradually more and more.

Eventually, you reach a rate of flow where the water coming in is roughly equal to the water going out.
 
Top