Slow replication via SSH. High CPU usage despite AES-ni

s851

Dabbler
Joined
Jan 19, 2021
Messages
12
Hi,

I'm currently running into performance issues during replication of datasets between two hyperconverged truenas boxes. I've already checked the usual topics. Here's a quick summary:
Source machine:
  • Latest Truenas running within latest ESXi 7.0
  • E5-1650 v4, 2 vCPUs for Truenas
  • Supermicro X10SRH-CLN4F, SAS3-controller passed through to TrueNAS via vt-d
  • SAS3-EL1 Backplane
  • 64 GB DDR4 ECC RAM, 32GB reserved for TrueNAS
  • Boot-Datastore running from Intel M.2 SSD
  • 6x 8TB Seagate IronWolf in Raid-z2, 1 vdev
  • Intel x520-da2 NIC connected via DAC and 10gbe Switch
Destination machine:
  • Latest Truenas running within latest ESXi 7.0
  • C3758, 2 vCPUs for Truenas
  • Supermicro A2SDi-8C+-HLN4F, SATA-controllers passed through to TrueNAS via vt-d
  • SATA3-Passthrough Backplane
  • 8GB DDR4 ECC RAM, 4GB reserved for TrueNAS
  • Boot-Datastore running from Intel M.2 SSD
  • 5x 8TB Seagate IronWolf in Raid-z1, 1 vdev
  • Intel x520-da1 NIC connected via DAC and 10gbe Switch
iperf3 confirms 10gb/s in both directions with lro and tso enabled on both machines. SMB is also capable of running 10gb/s from both machines to a win10 workstation. MTU is at its default 1500. Unfortunately, neither the esxi-layer nor the network in between is under my control. The network is shared, therefore I'm stuck with using encryption. At the moment, TrueNAS is the only VM running on the destination machine but that is subject to change in the future.

Now to the problem: When I start a replication task using SSH, transfer speeds of the replication are unfortunately limited to ~61 MiB/s. CPU at the source machine is below 30% according to htop with several active users. On the backup side, ssh is using 95% of a core while the second core is idling. What confuses me is that the CPU supports Intel Quick Assist or AES-NI respectively and is also detected within TrueNAS as "dmesg | grep aes" reports "aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA-1,SHA-256> on motherboard". Cipher settings were left on default during setup of the ssh connection.
Which cipher suite is used by TrueNAS? Are there any possibilities to accelerate the transfer?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399

s851

Dabbler
Joined
Jan 19, 2021
Messages
12
Thank you for your feedback. I will add the "auxiliary parameters" line after the current replication finishes and try again.

I've set up the current ssh connection using "Tasks > Replication Tasks > Wizard > Add", set "Destination Location" to "different system" and "SSH connection" to "Create new" where I left everything at default. Cipher is "Standard (secure)". Under "System > SSH Connections", I cannot find a field to add custom ssh "-c"-parameters. Just "Cipher", which is already set to "Standard". Where can I set them?

Using ssh -v revealed that indeed, chacha20... is used instead of aes256-gcm. Is there any reason why you should prefer chacha20 over aes-gcm even with aes-ni support for aes? It seems a bit odd that TrueNAS defaults to not to use hardware accelerated ciphers.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
We tend to default to the strongest ciphers that both ends support. However, strongest doesn't mean fastest :)

If your network connection between systems is already secure you can disable ssh entirely (switch to netcat) to really push speeds further. I do this and run over a wireguard tunnel to reach my replication targets, no real performance penalties that way, beyond what the network can handle.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
You can change also change the auxiliary options to just the AES ciphers. The GUI otherwise doesn't allow setting a specific set of ciphers in the SSH connection setup.
 

s851

Dabbler
Joined
Jan 19, 2021
Messages
12
Thanks for your help. I will set the auxiliary options to aes and give it a try. If it doesn't work, I will report back.

As far as I know, aes-gcm is not less secure than chacha20 concerning caching/sidechannel attacks as long as you run it hardware accellerated.
Two additional thoughts:
  • In my experience, settings via custom parameters tend to break things down the road. If TrueNAS continues to deliver great improvements and new features in future releases, they are often negated by manually set parameters that are long forgotten and don't get updated automatically during maintenance/upgrades.
  • Why are best practices not configured by default? E.g., why don't you expose the cipher suite used by "Standard" in the tooltip or set the recommended auxiliary parameters as a preset out of the box? Why does the built-in replication require root access via ssh if it's against your own best practices?
tl;dr;
Thank you again for your quick help.
 

s851

Dabbler
Joined
Jan 19, 2021
Messages
12
I've tried updating the ssh settings on the source machine using "Ciphers aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com". Replication is set to pull on the destination machine. Unfortunately, when I test the new connection via gui-shell from the destination machine, ssh defaults to the most insecure aes128-ctr, no matter in which order I enter the ciphers. Additionally, I cannot use "Ciphers aes256-gcm@openssh.com" as I'm using putty in parallel to manage my machines and putty doesn't support gcm. According to sshd_config (freebsd.org) , the order of the ciphers in the parameters doesn't affect their preference. Is there any way to prioritize gcm and fall back to ctr if gcm is not supported by the client?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
There's a hidden setting available via the API call midclt call ssh.update '{"weak_ciphers": ["cipher1","cipher2",etc.]}'. However, the only values the API call will currently accept are "NONE" and "AES128-CBC". I think you're going to have to open a feature request.
 

s851

Dabbler
Joined
Jan 19, 2021
Messages
12
Using "Ciphers aes256-gcm@openssh.com" on the source machine the PULL replication task throws an error message:
"Replication "..." failed: Incompatible ssh server (no acceptable ciphers)..". It seems that ssh from shell is using a different cipher suite than ssh for replication. Using "Ciphers aes256-gcm@openssh.com,aes256-ctr" the replication process is running and cpu utilization drops to 35%, unfortunately my replication speeds are still poor at ~80MiB/s.

Which ciphers are hidden behind the option "Standard (secure)"? Is there any way beside wireshark to check which cipher the replication is using in the end? Any other ideas?
I will create some random test data and use ssh+netcat in the mean time to see if those result in better speeds. Unfortunately those are not an option for production since the network is shared.

Sidequestion: If the source pool is encrypted using zfs-native encryption. Is the data also encrypted during transport with netcat? I guess not but a confirmation would be really useful.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399

s851

Dabbler
Joined
Jan 19, 2021
Messages
12
I've already achieved line rates just by enabling tso and lro according to iperf3 before my initial post. My problem unfortunately still persists. Even with aes256-gcm@openssh.com as the only cipher on the source system and 35% cpu load, I can't get past 80 MiB/s replication speed. According to iozone -i 0 -i 1 -+w 1 -+y 1 -+C 1 I should be able to get above 1GiB/s write speeds. Any ideas how I can find the bottleneck?
 
Top