can't maintain ssh connection to perform large zfs send

winnielinnie · Jun 5, 2022

Surely (de)compression can't be that CPU-intensive?

seanm · Jun 5, 2022

winnielinnie said:
Surely (de)compression can't be that CPU-intensive?

I would have thought I'd be more IO-bound than CPU-bound. But I'd at least not expect the kernel to allow the rest of the system to be crushed.

Chris Moore · Jun 5, 2022

seanm said:
ssh

If both systems are on a trusted network, you might want to look at using netcat instead of ssh.

It works reliably and quickly. I have used it many times..

Here is a link to a nice post on it:

Replication Stream Compression

Hi All, This is just a bit of knowledge sharing so hope somebody finds this helpful. For a long time now all my systems have been using LZ4 compression during replication as after testing (a long time ago) this seemed to be a good idea. I must add that all my datasets are LZ4 by default...

www.truenas.com

Chris Moore · Jun 5, 2022

seanm said:
I would have thought I'd be more IO-bound than CPU-bound. But I'd at least not expect the kernel to allow the rest of the system to be crushed.

SSH encrypts the traffic, netcat doesn't. That makes a bigger difference in the performance. Compression is actually a net benefit.

seanm · Jun 5, 2022

Chris, your replies make me think you think I'm copying over the network, but, to be clear: all the drives are in the same computer.

Chris Moore · Jun 5, 2022

seanm said:
Chris, your replies make me think you think I'm copying over the network, but, to be clear: all the drives are in the same computer.

I may not have read all the details before responding. Still, it could be useful for you at some point.
Now that I read more of the posts, it looks like you have it solved except that CPU utilization goes to 100%

seanm · Jun 5, 2022

Chris Moore said:
I may not have read all the details before responding. Still, it could be useful for you at some point.
Now that I read more of the posts, it looks like you have it solved except that CPU utilization goes to 100%

The thread topic is indeed obsolete.

The ssh part is solved, but I still can't move this data because of the high CPU utilization bringing the system to its knees and taking down services that must keep running.

Next weekend I'll have another shot at it...

freqlabs · Jun 13, 2022

Something you could try is to put mbuffer between the send and recv sides as a rate limiter, something like

Code:

zfs send ... | mbuffer -R 100M | zfs recv ...

where 100M means 100 MByte/sec, which you'll probably need to tune up or down depending on the rate it's normally running at (which you can check with mbuffer using no options).

freqlabs · Jul 12, 2022

The solution was to limit the ZIO thread count during the replication by tuning vfs.zfs.zio.taskq_batch_pct down to (in this case) 36% instead of the default 80%. This is necessary because each pool has its own groups of threads for I/O, so the CPU limits apply to each pool separately. With multiple pools being stressed simultaneously and being CPU bound by slow compression algorithms this easily saturates the CPU.

It is non-trivial to make ZFS automatically share CPU resources between multiple pools more intelligently, so this may not be implemented any time soon. For now, the manual tuning is available and effective. From zfs(4):

Code:

     zio_taskq_batch_pct=80% (uint)
             Percentage of online CPUs which will run a worker thread for I/O.
             These workers are responsible for I/O work such as compression
             and checksum calculations.  Fractional number of CPUs will be
             rounded down.

             The default value of 80% was chosen to avoid using all CPUs which
             can result in latency issues and inconsistent application
             performance, especially when slower compression and/or
             checksumming is enabled.

     zio_taskq_batch_tpq=0 (uint)
             Number of worker threads per taskq.  Lower values improve I/O
             ordering and CPU utilization, while higher reduces lock
             contention.

             If 0, generate a system-dependent value close to 6 threads per
             taskq.

On CORE these correspond to vfs.zfs.zio.taskq_batch_pct and vfs.zfs.zio.taskq_batch_tpq.

Important Announcement for the TrueNAS Community.

can't maintain ssh connection to perform large zfs send

winnielinnie

MVP

seanm

Guru

Chris Moore

Hall of Famer

Replication Stream Compression

Chris Moore

Hall of Famer

seanm

Guru

Chris Moore

Hall of Famer

seanm

Guru

freqlabs

iXsystems

freqlabs

iXsystems

Similar threads

Important Announcement for the TrueNAS Community.

can't maintain ssh connection to perform large zfs send

MVP

Guru

Hall of Famer

Hall of Famer

Guru

Hall of Famer

Guru

iXsystems

iXsystems

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "can't maintain ssh connection to perform large zfs send"

Similar threads