10G Starts Maxed & Then Slows

StoreMore · May 6, 2020

My FreeNAS will start a transfer at 1GB/s then slows down to roughly 500MB/s after 10 seconds or so. It will copy indefinitely at this 500MB/s threshold. How can I figure out what is actually happening and a good resolution to keep the transfers closer to the 1GB/s.

I understand it maybe be on the slower side for batches of small files but these are all large files in the 10-100GB range.

My FreeNAS box has 8 WD Reds in a Z2 config, 64 GM ram, 6 core processor. I have 2 optane (905P) drives installed in the NVME slots but I don't have them activated for anything yet.

Thanks!

Yorick · May 6, 2020

StoreMore said:
My FreeNAS box has 8 WD Reds in a Z2 config

There's your answer. To me, a sustained 500MB/s read on a single raidz2 vdev sounds plenty fast.

HoneyBadger · May 6, 2020

What model are those WD Reds? Hopefully not EFAX SMR, or slowing to 500MB/s will be the least of your worries.

StoreMore · May 6, 2020

HoneyBadger said:
What model are those WD Reds? Hopefully not EFAX SMR, or slowing to 500MB/s will be the least of your worries.

They are the 10TB models so they shouldn't be SMR.

sretalla · May 7, 2020

It sounds a bit like filling a cached write buffer on the client at the start, then writing at the speed of the disks after that... what do you have on the other end of that 10G connection?

StoreMore · May 7, 2020

Yorick said:
There's your answer. To me, a sustained 500MB/s read on a single raidz2 vdev sounds plenty fast.

I guess fast is relative. When you see 1G speeds for a quick bit then it drops to half that it seems slow. I guess I am just after making sure that I have fully tuned the system. Is there anything else I can or should do to get the max out of the hardware that I have? The HDs should be able to each sustain 150MB/s which should get to me to about 900MB/s when I multiply by six drives and drop 2 for the Z2. I don't think a zil/slog will make a difference but is there something else that needs to be tuned? Any other way to actually monitor or log the transfers that would help in the diagnosis?

sretalla said:
It sounds a bit like filling a cached write buffer on the client at the start, then writing at the speed of the disks after that... what do you have on the other end of that 10G connection?

My desktop sits on the other side. Its a Ryzen 3 system with an optane 900P system drive and the secondary drive where the files were being copied from is a Gen4 Sabrent Rocket NVME. Using the built in Aquantica 10G nic card on the X570 Creator motherboard.

jgreco · May 7, 2020

Hard drives only sustain 150MBytes/sec if there is straightline (sequential) access happening. Any seeks at all will drop that speed dramatically (in worst case to less than 1MByte/sec).

So if you are writing, it's ballpark reasonable for a not-too-full not-too-fragmented newish pool. After the pool is fuller and has experienced fragmentation, you will see slower write speeds.

Check to see if the disks are actually running at a healthy capacity using gstat. If the tool is reporting that the disks are very busy (70%+) you are really just hitting actual performance limits.

sretalla · May 7, 2020

You could try copying to NUL or a ram disk on the PC.

StoreMore · May 7, 2020

sretalla said:
You could try copying to NUL or a ram disk on the PC.

Trying a local copy from the data folder (where all the copies originate from - gen4 nvme) to my operating system drive temp folder (optane 900p) I get 1.4-1.5GB/s sustained.

jgreco said:
Hard drives only sustain 150MBytes/sec if there is straightline (sequential) access happening. Any seeks at all will drop that speed dramatically (in worst case to less than 1MByte/sec).

So if you are writing, it's ballpark reasonable for a not-too-full not-too-fragmented newish pool. After the pool is fuller and has experienced fragmentation, you will see slower write speeds.

Check to see if the disks are actually running at a healthy capacity using gstat. If the tool is reporting that the disks are very busy (70%+) you are really just hitting actual performance limits.

I'll check the gstats and report back once I have to time to generate a new fileset.

sretalla · May 8, 2020

StoreMore said:
Trying a local copy from the data folder (where all the copies originate from - gen4 nvme) to my operating system drive temp folder (optane 900p) I get 1.4-1.5GB/s sustained.

OK. did I misunderstand the direction you were talking about then? (PC to FreeNAS?)

Then in that case, it's exactly the story @jgreco is mentioning... you can fake it with SLOG for maximum 30GB (5 seconds), but then you're back to spindle speeds and IOPS to catch up.

SweetAndLow · May 8, 2020

Those streaming reads/writes are not actually what make a system seem fast. It's more about IOPS when it comes to actually using the system vs just benchmarking the system. My 3 vdev 24 drive raidz2 system can do 1.7GB/s streaming reads but drops down to about 193MB/s when it comes to doing real work.
You should use iozone to really test the streaming vs random performance. The random read and random write will tell a more accurate performance metric. Here is an example off my system, there are several different options you can pass to iozone to get different numbers.

Code:

iozone -i 0 -i 1 -i 2 -s 150g -t 1
        Iozone: Performance Test of File I/O
                Version $Revision: 3.487 $
                Compiled for 64 bit mode.
                Build: freebsd

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
                     Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
                     Vangel Bojaxhi, Ben England, Vikentsi Lapa,
                     Alexey Skidanov, Sudhir Kumar.

        Run began: Thu May  7 13:00:53 2020

        File size set to 157286400 kB
        Command line used: iozone -i 0 -i 1 -i 2 -s 150g -t 1
        Output is in kBytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 kBytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 1 process
        Each process writes a 157286400 kByte file in 4 kByte records

        Children see throughput for  1 initial writers  =  732941.12 kB/sec
        Parent sees throughput for  1 initial writers   =  726624.73 kB/sec
        Min throughput per process                      =  732941.12 kB/sec
        Max throughput per process                      =  732941.12 kB/sec
        Avg throughput per process                      =  732941.12 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for  1 rewriters        =  752973.31 kB/sec
        Parent sees throughput for  1 rewriters         =  743161.17 kB/sec
        Min throughput per process                      =  752973.31 kB/sec
        Max throughput per process                      =  752973.31 kB/sec
        Avg throughput per process                      =  752973.31 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for  1 readers          = 1787787.88 kB/sec
        Parent sees throughput for  1 readers           = 1787760.13 kB/sec
        Min throughput per process                      = 1787787.88 kB/sec
        Max throughput per process                      = 1787787.88 kB/sec
        Avg throughput per process                      = 1787787.88 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for 1 re-readers        = 1753536.25 kB/sec
        Parent sees throughput for 1 re-readers         = 1753511.14 kB/sec
        Min throughput per process                      = 1753536.25 kB/sec
        Max throughput per process                      = 1753536.25 kB/sec
        Avg throughput per process                      = 1753536.25 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for 1 random readers    =  193732.53 kB/sec
        Parent sees throughput for 1 random readers     =  193725.93 kB/sec
        Min throughput per process                      =  193732.53 kB/sec
        Max throughput per process                      =  193732.53 kB/sec
        Avg throughput per process                      =  193732.53 kB/sec
        Min xfer                                        = 157286400.00 kB

        Children see throughput for 1 random writers    =   78160.55 kB/sec
        Parent sees throughput for 1 random writers     =   78130.06 kB/sec
        Min throughput per process                      =   78160.55 kB/sec
        Max throughput per process                      =   78160.55 kB/sec
        Avg throughput per process                      =   78160.55 kB/sec
        Min xfer                                        = 157286400.00 kB

StoreMore · May 8, 2020

jgreco said:
Hard drives only sustain 150MBytes/sec if there is straightline (sequential) access happening. Any seeks at all will drop that speed dramatically (in worst case to less than 1MByte/sec).

So if you are writing, it's ballpark reasonable for a not-too-full not-too-fragmented newish pool. After the pool is fuller and has experienced fragmentation, you will see slower write speeds.

Check to see if the disks are actually running at a healthy capacity using gstat. If the tool is reporting that the disks are very busy (70%+) you are really just hitting actual performance limits.

It is a new pool for now. I ran gstat and for the majority of the time they are 0s (screenshot attached). Every now and then the drives to go 1-3.

NO WRITE:

WRITE (70GBs being written to disk pool):

sretalla said:
OK. did I misunderstand the direction you were talking about then? (PC to FreeNAS?)

Then in that case, it's exactly the story @jgreco is mentioning... you can fake it with ARC for maximum 30GB (10 seconds), but then you're back to spindle speeds and IOPS to catch up.

Yes, this is exactly what I am after. For this ARC faking (30GB, 10 seconds) is that a tuneable figure that is set somewhere? Or is it automatically calculated based on my RAM size of 64GB? If I double my ram would those figures double (60GB, 20 seconds)?

Alternatively, is there a way to add a cache drive in front (not zil, slog) that receives all data in front of the HD pool at the fast SSD speeds (limited by my 10G network of course) and then flushes it to the pool at the speed of the mechanical drives in the background?

Yorick · May 8, 2020

There is no write cache. You can do read cache.
64GiB would give you more ARC, you can tune with vfs.zfs.arc_max.
Even more ARC can be had as L2ARC.

HoneyBadger · May 8, 2020

StoreMore said:
WRITE (70GBs being written to disk pool):
View attachment 38376

Yep, you're simply hitting the vdev speed limit here. Because you're writing asynchronously, your first bunch of writes can land in RAM, and then it slowly has to taper off until the line speed rate matches how fast your drives can go.

You can band-aid the issue by increasing the dirty data maximum for a longer "burst period" at the start, at the cost of RAM, but the only way to get faster sustained long-term, is to have more vdevs or faster drives.

jgreco · May 8, 2020

sretalla said:
you can fake it with ARC for maximum 30GB (10 seconds), but then you're back to spindle speeds and IOPS to catch up.

Yorick said:
There is no write cache.

Ok guys let's focus here, the wrong information is NOT helpful to the OP and increases confusion. I realize the discussion is already confusing. ;-)

If we are talking writes, this is not an ARC issue and there sure as hell is a write cache of sorts.

StoreMore said:
Yes, this is exactly what I am after. For this ARC faking (30GB, 10 seconds) is that a tuneable figure that is set somewhere? Or is it automatically calculated based on my RAM size of 64GB? If I double my ram would those figures double (60GB, 20 seconds)?

What you're seeing is the transaction group time limit. Additionally, you are probably doing this on a cold system so the transaction group write throttle hasn't yet adapted to the pool.

When you write data, it does not go to disk, it goes into a "transaction group" which has a certain size (dependent on RAM, total transaction group space is limited to 1/8th of RAM) or a temporal limit of five seconds (this is adjustable but it isn't the fix you are looking for) or a throttle rate (which is learned by ZFS depending on your pool performance). You write into a txg until one of these limits is hit, at which point it closes.

When a transaction group closes, it starts to be flushed to disk and a new transaction group opens, and your writes start getting deposited into that new txg. This is an oversimplification for this discussion but good enough. So what can happen is that you start writing insane quantities of data, and you can do so for up to 5 seconds, then that transaction group starts to get flushed to disk, and a new one opens, and you get five more seconds. This can soak up to 1/8th the RAM in your system. And that's the ten seconds.

The problem is that if you have 256GB of RAM, you might then have 32GB of txgs cached to head out to your pool. One txg is actively being written and the other is pending. But there ARE NO MORE (and to avoid your inevitable question, WILL NOT BE MORE). So ZFS stops accepting further writes aimed at the pool, and waits for that up-to-16GB in-progress txg to finish writing to the pool. When that happens, that txg is fully flushed and then a new one opens up, the one that HAD been pending moves into the flushing-to-disk status, and things start moving again.

But let's say you have 256GB of RAM and your pool is a single mirror pair of conventional HDD's that write at 100MBytes/sec. Something bad happens. You have 16GB of data to write at 100MBytes/sec from a transaction group that was only open for five seconds, but it is going to take 160 seconds to write. This means all writes to your pool freeze for almost THREE WHOLE MINUTES while that txg flushes.

To combat this, ZFS has a write throttle that tries hard to learn the performance characteristics of your pool, and will limit the size of a txg further to make sure that it's likely to be able to flush within around five seconds.

But ZFS needs time to observe the pool under stressy write conditions to learn the performance characteristics of the pool. So when you first boot the NAS, it is totally possible to achieve AWESOME WRITE SPEEDS for a LARGE amount of data for the first try at copying a large file to the NAS, then this scales back for the second and third attempt as the write throttle gets smarter about your pool.

Your way to get faster writes isn't to try to tune this feature and subvert it, because under the sheets, the hardware simply cannot sustain what you are asking, so you will always end up hitting that limit. You can get faster performance by adding more vdev's, up to a point, and that's the real "fix". You can also artificially REDUCE the maximum potential size of a txg to make sure that you never get into a situation where I/O stalls, because for many applications a txg stall is a terrible thing.

Alternatively, is there a way to add a cache drive in front (not zil, slog) that receives all data in front of the HD pool at the fast SSD speeds (limited by my 10G network of course) and then flushes it to the pool at the speed of the mechanical drives in the background?

Not at this time.

no_connection · May 9, 2020

If you really need a speedy offload of data for some reason you can have a manual write cache on SSD and a script that moves the files after.

Yorick · May 9, 2020

jgreco said:
If we are talking writes, this is not an ARC issue and there sure as hell is a write cache of sorts.

I think "if we are talking writes" was the open question. Whether this was a read or write test was up in the air for a while.

Point taken on async writes.

Curious question, and without meaning to derail the thread: What happened to Nexenta's Write-Back Cache implementation of 2015? It didn't make it into OpenZFS: Not the right fit / performance? Not funded? Licensing worries? Morphed into special-alloc vdev? (https://www.youtube.com/watch?v=MkdrnG7GwdE)

HoneyBadger · May 9, 2020

jgreco said:
Ok guys let's focus here, the wrong information is NOT helpful to the OP and increases confusion. I realize the discussion is already confusing. ;-)

If we are talking writes, this is not an ARC issue and there sure as hell is a write cache of sorts.

I don't mean to be rude, but this entire post is referencing the legacy write throttle - which thankfully is a thing of the past, along with those brutally long and 100% "you aren't writing anything" stalls. So let's look at what's changed and why the new OpenZFS write throttle sucks a whole lot less. Fair warning; I'm going to refer to the default settings of tunables a lot here. These can be changed, but shouldn't unless you know what you're doing. (And there are more beyond these, too, of course.)

vfs.zfs.delay_min_dirty_percent: The limit of outstanding dirty data before transactions are delayed (default 60% of dirty_data_max)
vfs.zfs.dirty_data_sync: Force a txg if the number of dirty buffer bytes exceed this value (default 64M)
vfs.zfs.dirty_data_max_percent: The percent of physical memory used to auto calculate dirty_data_max (default 10% of system RAM)
vfs.zfs.dirty_data_max_max: The absolute cap on dirty_data_max when auto calculating (default 4G)
vfs.zfs.dirty_data_max: The maximum amount of dirty data in bytes after which new writes are halted until space becomes available (the lower of the percentage of system RAM or the absolute cap in dirty_data_max_max)
vfs.zfs.txg.timeout: The maximum age of a transaction group (default 5s)

The maximum transaction group space is now set to 10% of RAM or a maximum of 4GB, whichever is lower (based on the default tunables) - and a maximum age of 5 seconds is the default. However, there's also a threshold of 64M, once this threshold is reached it will close off a transaction group. So the initial transaction group that results from the burst of writes will be 64M - the next one will be much larger - however much data gets dropped into the system in as long as it takes your vdevs to write 64M.

Where the brakes start getting hit is at the 60% of max threshold - so with default 4GB limits, that's 2.4G of data - at that point, ZFS starts to introduce delays before acknowledging writes. First in nanoseconds, then microseconds - eventually reaching 100ms of delay. But it's a gradual slope that scales up exponentially as you get closer to dirty_data_max of 4G. Even under a maximum load you won't ever hit the maximum, because at 100ms of artificial delay, you only get 10 IOPS from the perspective of an outside system.

There isn't a "learning" process over any longer period of time than it takes to fill up your dirty_data_max - and crucially, it "forgets" it just as fast. 0M of dirty data? It'll gobble it up at full speed. If you only have 2G to write? You'll absorb it all into a RAM-based group at line speed and never throttle. If there's enough time between this 2G write burst and the next for your dirty_data to be "clean" again, you could use a pair of 5400rpm drives as backing vdevs and never know the difference. But if you do sustained writes? You'll feel it.

But regardless of all the improvements in the write throttle, the bottom line remains the same:

jgreco said:
Your way to get faster writes isn't to try to tune this feature and subvert it, because under the sheets, the hardware simply cannot sustain what you are asking, so you will always end up hitting that limit. You can get faster performance by adding more vdev's, up to a point, and that's the real "fix".

"More vdevs, faster vdevs."

jgreco · May 9, 2020

HoneyBadger said:
I don't mean to be rude, but this entire post is referencing the legacy write throttle - which thankfully is a thing of the past, along with those brutally long and 100% "you aren't writing anything" stalls.

It's still generally applicable and easier to explain what's going on that way. Trying to make sense out of what you just wrote, for a beginner, not to be rude, is pretty rough. What you wrote is fine for someone already familiar with ZFS.

HoneyBadger · May 9, 2020

For a simple metaphor, picture a bucket with a tap at the bottom and a hose at the top.

Your network connection is the hose, and the tap at the bottom is your collection of vdevs.

If the host can fill the bucket faster than the tap can drain it, and you don't do something, you could overflow the bucket. To prevent that, ZFS applies a "write throttle" - squeezing the hose, gradually more and more.

Eventually, you reach a rate of flow where the water coming in is roughly equal to the water going out.

Important Announcement for the TrueNAS Community.

10G Starts Maxed & Then Slows

Dabbler

Wizard

actually does care

Dabbler

Powered by Neutrality

Dabbler

Resident Grinch

Powered by Neutrality

Dabbler

Powered by Neutrality

Sweet'NASty

Dabbler

Wizard

actually does care

Resident Grinch

Patron

Wizard

actually does care

Resident Grinch

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "10G Starts Maxed & Then Slows"

Similar threads