can't maintain ssh connection to perform large zfs send

Joined
Oct 22, 2019
Messages
3,641
Surely (de)compression can't be that CPU-intensive?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
If both systems are on a trusted network, you might want to look at using netcat instead of ssh.

It works reliably and quickly. I have used it many times..

Here is a link to a nice post on it:

 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I would have thought I'd be more IO-bound than CPU-bound. But I'd at least not expect the kernel to allow the rest of the system to be crushed.
SSH encrypts the traffic, netcat doesn't. That makes a bigger difference in the performance. Compression is actually a net benefit.
 

seanm

Guru
Joined
Jun 11, 2018
Messages
570
Chris, your replies make me think you think I'm copying over the network, but, to be clear: all the drives are in the same computer.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Chris, your replies make me think you think I'm copying over the network, but, to be clear: all the drives are in the same computer.
I may not have read all the details before responding. Still, it could be useful for you at some point.
Now that I read more of the posts, it looks like you have it solved except that CPU utilization goes to 100%
 

seanm

Guru
Joined
Jun 11, 2018
Messages
570
I may not have read all the details before responding. Still, it could be useful for you at some point.
Now that I read more of the posts, it looks like you have it solved except that CPU utilization goes to 100%
The thread topic is indeed obsolete.

The ssh part is solved, but I still can't move this data because of the high CPU utilization bringing the system to its knees and taking down services that must keep running.

Next weekend I'll have another shot at it...
 

freqlabs

iXsystems
iXsystems
Joined
Jul 18, 2019
Messages
50
Something you could try is to put mbuffer between the send and recv sides as a rate limiter, something like
Code:
zfs send ... | mbuffer -R 100M | zfs recv ...
where 100M means 100 MByte/sec, which you'll probably need to tune up or down depending on the rate it's normally running at (which you can check with mbuffer using no options).
 

freqlabs

iXsystems
iXsystems
Joined
Jul 18, 2019
Messages
50
The solution was to limit the ZIO thread count during the replication by tuning vfs.zfs.zio.taskq_batch_pct down to (in this case) 36% instead of the default 80%. This is necessary because each pool has its own groups of threads for I/O, so the CPU limits apply to each pool separately. With multiple pools being stressed simultaneously and being CPU bound by slow compression algorithms this easily saturates the CPU.

It is non-trivial to make ZFS automatically share CPU resources between multiple pools more intelligently, so this may not be implemented any time soon. For now, the manual tuning is available and effective. From zfs(4):

Code:
     zio_taskq_batch_pct=80% (uint)
             Percentage of online CPUs which will run a worker thread for I/O.
             These workers are responsible for I/O work such as compression
             and checksum calculations.  Fractional number of CPUs will be
             rounded down.

             The default value of 80% was chosen to avoid using all CPUs which
             can result in latency issues and inconsistent application
             performance, especially when slower compression and/or
             checksumming is enabled.

     zio_taskq_batch_tpq=0 (uint)
             Number of worker threads per taskq.  Lower values improve I/O
             ordering and CPU utilization, while higher reduces lock
             contention.

             If 0, generate a system-dependent value close to 6 threads per
             taskq.


On CORE these correspond to vfs.zfs.zio.taskq_batch_pct and vfs.zfs.zio.taskq_batch_tpq.
 
Top