Fortinbras
Dabbler
- Joined
- May 14, 2019
- Messages
- 10
Hi, I've got an HP microserver N40L with 4 x 2TB drives set up as a mirrored 4GB pool; the box has built-in gigabit ethernet, 16gig of ECC RAM, and a pretty modest AMD Turion CPU. The boot disk is a dedicated SATA SSD in the optical drive bay. I've been trying to use it as a replication target (zfs receive) for a 369GB dataset currently sourced on a MacOS Mojave box (the Mac is running OpenZFS on OSX by Joergen Lundgren). The source dataset is ZFS natively encrypted, and I'm sending it in raw form with zfs -w. This doesn't seem to be taxing the AMD Turion too greatly, since the Turion doesn't have to process the encryption; the two cores seem to be running at 60 to 70 percent utilization, with the dashboard reporting that the Ethernet traffic is going along a little below 400 Mbit/sec.
Here is a typical command (and resulting broken pipe failure) issued on the Mojave box as root (ssh is using PKI):
sh-3.2# date ; zfs send -w DUAL/ENCRYPTED/SHOME_BACKUP@2013_12_user_data_SL | ssh -o IPQoS=throughput adminkurt@10.0.1.220 zfs receive N40L/REMOTELY_ENCRYPTED/HOME_BACKUP ; date
Sat Jan 2 14:59:35 PST 2021
packet_write_wait: Connection to 10.0.1.220 port 22: Broken pipe
warning: cannot send 'DUAL/ENCRYPTED/SHOME_BACKUP@2013_12_user_data_SL': signal received
Sat Jan 2 15:20:21 PST 2021
sh-3.2#
I've tried this nearly a dozen times, with various datasets, incremental snapshots and initial snapshots, while monitoring the dashboard/not monitoring the dashboard, etc. It always seems to fail after about 20 minutes. To try and debug it a bit I set up a ping to the TrueNAS box concurrent with the zfs send through SSH; the pings were completely normal, with sub-1-millisecond responses, until... at around the 1200th ping (20 minutes!), the TrueNAS box was unresponsive for several minutes - first it was request timeout, then ping sendto host is down, the some request timeouts again, then host down, and finally after 261 seconds normal pings again. To me that looks like a crash, followed by a reboot.
I was wondering if this could be a an out-of-memory problem; I noticed in the dashboard that the free memory had decreased to 0.5 GB after a while. I tried limiting the ZFS L2ARC to 4GB; this did seem to reduce the purple sector of ZFS cache to one quarter of the total 16GB, but the golden Services sector now occupies 11.5 GB memory, with free memory unchanged at 0.5 GB.
So, I'm just wondering how to debug this further. While writing this I just witnessed another crash, and this time I noticed a kind of sustained "scrubbing" noise just prior to the dashboard becoming unresponsive. So that makes me wonder if I have an underlying disk hardware problem; I had turned on weekly SMART checks of the short variety, and not seen anything unusual reported by that; I just made that task into a LONG test of all disks that will run tonight at midnight.
The funny thing is, this worked not too long ago. I had sent across a 686 GB dataset, and then another 300+ GB in snapshots using this same ssh command.
Anyway, I would appreciate any debugging tips anyone might offer. Thanks,
Kurt
Here is a typical command (and resulting broken pipe failure) issued on the Mojave box as root (ssh is using PKI):
sh-3.2# date ; zfs send -w DUAL/ENCRYPTED/SHOME_BACKUP@2013_12_user_data_SL | ssh -o IPQoS=throughput adminkurt@10.0.1.220 zfs receive N40L/REMOTELY_ENCRYPTED/HOME_BACKUP ; date
Sat Jan 2 14:59:35 PST 2021
packet_write_wait: Connection to 10.0.1.220 port 22: Broken pipe
warning: cannot send 'DUAL/ENCRYPTED/SHOME_BACKUP@2013_12_user_data_SL': signal received
Sat Jan 2 15:20:21 PST 2021
sh-3.2#
I've tried this nearly a dozen times, with various datasets, incremental snapshots and initial snapshots, while monitoring the dashboard/not monitoring the dashboard, etc. It always seems to fail after about 20 minutes. To try and debug it a bit I set up a ping to the TrueNAS box concurrent with the zfs send through SSH; the pings were completely normal, with sub-1-millisecond responses, until... at around the 1200th ping (20 minutes!), the TrueNAS box was unresponsive for several minutes - first it was request timeout, then ping sendto host is down, the some request timeouts again, then host down, and finally after 261 seconds normal pings again. To me that looks like a crash, followed by a reboot.
I was wondering if this could be a an out-of-memory problem; I noticed in the dashboard that the free memory had decreased to 0.5 GB after a while. I tried limiting the ZFS L2ARC to 4GB; this did seem to reduce the purple sector of ZFS cache to one quarter of the total 16GB, but the golden Services sector now occupies 11.5 GB memory, with free memory unchanged at 0.5 GB.
So, I'm just wondering how to debug this further. While writing this I just witnessed another crash, and this time I noticed a kind of sustained "scrubbing" noise just prior to the dashboard becoming unresponsive. So that makes me wonder if I have an underlying disk hardware problem; I had turned on weekly SMART checks of the short variety, and not seen anything unusual reported by that; I just made that task into a LONG test of all disks that will run tonight at midnight.
The funny thing is, this worked not too long ago. I had sent across a 686 GB dataset, and then another 300+ GB in snapshots using this same ssh command.
Anyway, I would appreciate any debugging tips anyone might offer. Thanks,
Kurt