Replication from scratch failing after 4.19 TiB

Rednalreden

Cadet
Joined
Apr 28, 2020
Messages
1
Hi all,

Thanks for taking the time to read my question.

I've been running on a FreeNas server for over 4 years now with the following specs:
Main nas (was running 11.1U7 and in a desire to fix my issue is now running 11.3u2)
Fractal Design Define R5
Supermicro mATX S1151 X11SSL-CF
Intel Xeon E3-1225v5 (4c/4t @ 3.3GHz)
Samsung ECC UDIMM DDR4-2400 64GB
Samsung 850 Pro 256GB SATA6G 2.5" 7mm
Samsung 850 Pro 512GB SATA6G 2.5" 7mm
7x Seagate IronWolf 8TB SATA6G 7200rpm 3.5" in Raid-Z2 with a hot spare
Intel X-550-T2 NIC

It uses the following (relevant) datasets:
data/Algemeen (13.52 TiB used)
data/Gitlab (zvol 5.36 TiB used)
data/Simon (4.13 TiB used)

Recently the main nas has had some issues with reliability (I suspect the onboard HBA since all the errors come from disks attached to the HBA) and I decided to create a local backup server with some hardware I had available with the following specs:
Backup nas (running 11.3u2)
Fractal Design Define R5
ASRock x570 Phantom Gaming 4
AMD 3600 (8c/16T)
64GB non-ecc memory
Intel X-550-T2 NIC
2x Kingston A2000 nvme ssd in mirror
5x Seagate IronWolf 8TB SATA6G 7200rpm 3.5" in Raid-Z (28 TiB usable)

Network situation
Both of the machines are running on a 10 Gbit connection to a switch and I decided to run a dedicated cable from the Main nas to the Backup nas to not slow down the other traffic when replicating large quantities of data. The dedicated NIC port has a static IP configured on both machines and I'm using that connection to replicate.

Replication Task - first attempt
So after getting the backup nas up and running I created an SSH connection from the backup nas running 11.3u2 to my main nas running 11.1u7 at the time and tried to do a PULL of the data/Algemeen (13.5 TiB used) over to the newly create data/Algemeen pool on the backup nas. Unfortunately this ran and failed after a few hours with the following error:
Input/output error
cannot receive resume stream: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
zfs send -t 1-d97ba723e-f8-789c636064000310a500c4ec50360710e72765a526973030fc568c07abc1904f4b2b4e2d01c930c1e5d990e4932a4b528b81b441abfb2d05164cfd25f9e9a599290c0cbb5e6adc3d147be4a207923c27583e2f313795812125b12451df31273d35373535cf213731af3431473711cad735323032d03530d135b2883734d7353485d8c3cd80f057727e6e41516a71717e36031c0000a47d2918.

Running the zfs send command from the main nas did not have a positive outcome and it remained in a failed state.

Thinking maybe it was a fluke (this being the first time I'm using replication) I retried a few times but kept getting the error after a couple of hours.

Replication Task - second attempt
Looking around for people with the same issues I stumbled on a post saying they were having issues after upgrading to freenas 11.3 with their replication. At this point I decided to update my main nas to Freenas 11.3u2 and try it with a PUSH replication task. I was hopeful that this would resolve the issue and IF I ran into it the zfs send command might yield better results. Unfortunately this wasn't the case and it yielded the exact same error. I again tried replicating the data/Algemeen dataset and after a few hours it again failed with over 4.19 TiB used on the remote disk and then failing.

This was the moment I noticed it was failing at the same point, this 4.19 TiB and I searched more into the issue. I read a post that snapshots from a older freenas version might give issues on 11.3. So I tried creating a brand new manual snapshot, recreated the replication task to only replicate that snapshot and created a 2nd dataset on the backup nas so I could compare. After running the task it failed again at 4.19 TiB.

Since I also had a slightly smaller dataset of 4.13 TiB I tried replicating that dataset as well. The smaller dataset transferred without any issues. The issue only seems to happen with larger datasets. I'm running out of ideas at this point. Anybody has any experience with replicating larger data sets and is familiar with this issue?

Thanks again for taking the time to read my issue.
Kind regards,
Maurice
 

Attachments

  • 9663.txt
    4.8 KB · Views: 296
Top