Suggestions on how to debug 'broken pipe' SSH error?

Fortinbras

Dabbler
Joined
May 14, 2019
Messages
10
Hi, I've got an HP microserver N40L with 4 x 2TB drives set up as a mirrored 4GB pool; the box has built-in gigabit ethernet, 16gig of ECC RAM, and a pretty modest AMD Turion CPU. The boot disk is a dedicated SATA SSD in the optical drive bay. I've been trying to use it as a replication target (zfs receive) for a 369GB dataset currently sourced on a MacOS Mojave box (the Mac is running OpenZFS on OSX by Joergen Lundgren). The source dataset is ZFS natively encrypted, and I'm sending it in raw form with zfs -w. This doesn't seem to be taxing the AMD Turion too greatly, since the Turion doesn't have to process the encryption; the two cores seem to be running at 60 to 70 percent utilization, with the dashboard reporting that the Ethernet traffic is going along a little below 400 Mbit/sec.

Here is a typical command (and resulting broken pipe failure) issued on the Mojave box as root (ssh is using PKI):

sh-3.2# date ; zfs send -w DUAL/ENCRYPTED/SHOME_BACKUP@2013_12_user_data_SL | ssh -o IPQoS=throughput adminkurt@10.0.1.220 zfs receive N40L/REMOTELY_ENCRYPTED/HOME_BACKUP ; date
Sat Jan 2 14:59:35 PST 2021
packet_write_wait: Connection to 10.0.1.220 port 22: Broken pipe
warning: cannot send 'DUAL/ENCRYPTED/SHOME_BACKUP@2013_12_user_data_SL': signal received
Sat Jan 2 15:20:21 PST 2021
sh-3.2#

I've tried this nearly a dozen times, with various datasets, incremental snapshots and initial snapshots, while monitoring the dashboard/not monitoring the dashboard, etc. It always seems to fail after about 20 minutes. To try and debug it a bit I set up a ping to the TrueNAS box concurrent with the zfs send through SSH; the pings were completely normal, with sub-1-millisecond responses, until... at around the 1200th ping (20 minutes!), the TrueNAS box was unresponsive for several minutes - first it was request timeout, then ping sendto host is down, the some request timeouts again, then host down, and finally after 261 seconds normal pings again. To me that looks like a crash, followed by a reboot.

I was wondering if this could be a an out-of-memory problem; I noticed in the dashboard that the free memory had decreased to 0.5 GB after a while. I tried limiting the ZFS L2ARC to 4GB; this did seem to reduce the purple sector of ZFS cache to one quarter of the total 16GB, but the golden Services sector now occupies 11.5 GB memory, with free memory unchanged at 0.5 GB.

So, I'm just wondering how to debug this further. While writing this I just witnessed another crash, and this time I noticed a kind of sustained "scrubbing" noise just prior to the dashboard becoming unresponsive. So that makes me wonder if I have an underlying disk hardware problem; I had turned on weekly SMART checks of the short variety, and not seen anything unusual reported by that; I just made that task into a LONG test of all disks that will run tonight at midnight.

The funny thing is, this worked not too long ago. I had sent across a 686 GB dataset, and then another 300+ GB in snapshots using this same ssh command.

Anyway, I would appreciate any debugging tips anyone might offer. Thanks,

Kurt
 

Fortinbras

Dabbler
Joined
May 14, 2019
Messages
10
The extended offline tests for all four 2TB disks completed without error. I should also mention that zfs scrubs come back clean (as reported by zpool status). I guess my next steps are to

1. try turning off SMB access, which my wife's MacBook computer uses to save Time Machine backups on a ZVOL.
2. try sending across a relatively small test dataset, in case it is some sort of temperature-triggered hardware failure
3. try sending across a non-encrypted dataset
4. blow away and rebuild the pool on the TrueNAS box

The OpenZFS on OSX code that I'm running is Joergen's last stable version (1.9.4), which is not yet based on the new OpenZFS 2.0 code base; it is based on the Linux 0.8 source. I wonder if that is relevant?
 

Fortinbras

Dabbler
Joined
May 14, 2019
Messages
10
Here is a dump header from the most recent crash, found in /data/crash/info.4:

Dump header from device: /dev/ada3p1
Architecture: amd64
Architecture Version: 4
Dump Length: 670208
Blocksize: 512
Compression: none
Dumptime: Sun Jan 3 09:35:47 2021
Hostname: truenas.local
Magic: FreeBSD Text Dump
Version String: FreeBSD 12.2-RELEASE-p2 663e6b09467(HEAD) TRUENAS
Panic String: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, &zio->io_bookmark, BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), salt, iv, mac, psize, zio->io_abd, eabd, &no_crypt))
Dump Parity: 461044030
Bounds: 4
Dump Status: good


I had been thinking that this had to have been a hardware problem, now I'm not so sure. I guess I'll try sending over a large unencrypted dataset, see what happens.

Edit: when I do a search for "freebsd panic spa_do_crypt_abd" it comes up with a reference to a (bug-tracking?) entry at jira.ixsystems.com, but I can't seem to find the ticket 107636:

Safe Search: Moderate
Any Time




- iX - Bug Tracker (Jira)
https://jira.ixsystems.com/projects/NAS/issues/NAS-108627?filter=allopenissues
NAS-107636 Kernel panic in spa_do_crypt_abd() during replication (12.0-BETA2.1 -> 12.0-RC1 regression) NAS-108788 add xtail to SCALE NAS-108677 Added support for UPS
 
Last edited:

Fortinbras

Dabbler
Joined
May 14, 2019
Messages
10
Well, after some more testing I have to conclude that it is a problem with encryption. The root cause could be on the sending end, or on the TrueNAS receiving end. I was able to send several of the same exact 250-300GB datasets and snapshots over to the TrueNAS box, so long as I didn't send them with the -w flag (raw).
 

Kris Heslop

Dabbler
Joined
Feb 7, 2016
Messages
42
Thank you for sharing your journey and discovery. It appears you were alone. While it doesn't answer my question, it was good to read and added to the understanding of possible caused for me.
 
Top