first replication task always fail for 4.6TB arround 66%

fritoss007

Cadet
Joined
Jan 26, 2019
Messages
6
Hi, I'm trying to replace our actual FreeNAS that was build on a hardware RAID controller for a new system that complies (I think) with all the best practices I know. In order to swap out our system I need to move the data arround...I've been able to use snapshot replication for our smallers data sets but our big one (arround 14tb fill at arround 4.6tb) won't replicated pass 66% it fails the last 5 times I tried. What am I missing ?

old system on FS9.10.2
new system on FS11.2-release

is it possible to manually do the firt snapshot replication manually with some kind of "resume" fonction...because reaching 66% of 4.6tb over Gig ethernet take few hours each time I try somthing different...not to mention I can only do this at full gig over the weekend while the office is closed otherwise the disk usaged gets wild on that old hardware raid system

thanks in advance for any help !
 
D

dlavigne

Guest
Resumable isn't coming til 12...

On the 9.10.2 system, is Fast selected for the Encryption Cipher? Also, trying plzip for the Compression will reduce the size to its smallest (but will also be slower, so a bit of a gamble).
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
Hi, I'm trying to replace our actual FreeNAS that was build on a hardware RAID controller for a new system that complies (I think) with all the best practices I know. In order to swap out our system I need to move the data arround...I've been able to use snapshot replication for our smallers data sets but our big one (arround 14tb fill at arround 4.6tb) won't replicated pass 66% it fails the last 5 times I tried. What am I missing ?

old system on FS9.10.2
new system on FS11.2-release

is it possible to manually do the firt snapshot replication manually with some kind of "resume" fonction...because reaching 66% of 4.6tb over Gig ethernet take few hours each time I try somthing different...not to mention I can only do this at full gig over the weekend while the office is closed otherwise the disk usaged gets wild on that old hardware raid system

thanks in advance for any help !
When this happened to me once, I had found that I had quotas set on the destination dataset. The pool had enough space but the quota on destination was smaller than the source dataset's size. Relaxing the quota restriction fixed the issue for me.
 

fritoss007

Cadet
Joined
Jan 26, 2019
Messages
6
Thanks saurav, this was almost a good call :) but I have no quotas on remote freenas what so ever.

Is there any logs available to see what's going on arround that famous 66% on my replication ? maybe I could find an exit code with errors ?

When I do use "zsf send / zfs recv" to manually replicate a snapshot to remote freenas it goes up until the end...so size should not be an issue I guess ?

PS. maybe I should change me tilte...cause now it's more like 4.8TB :) :)
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Resuming replication is available since Freenas 10 or 11. IT requires the use of a resume token and must be done over CLI.
It is not supported by Freenas 9, however.
Are you doing Recursive replication?
Use the -vv option on both send and receive to get output to CLI.
Make sure you use "screen" under CLI to run your replication.
 
Joined
Jul 3, 2015
Messages
926
I second what @saurav says and In my experience the only time replication fails is when the destination doesn't have enough space be it because for quotas or simply not enough space on the destination pool or sometimes you find incremental snapshots run out of sync and therefore fail however FreeNAS is pretty good at automatically sorting this out.

How many snapshots are you trying to send? Could you disable the replication, delete the dataset and corresponding snapshots on the destination side and then try again from new?
 
Last edited:

fritoss007

Cadet
Joined
Jan 26, 2019
Messages
6
Alright ! thanks guys....it might be a space issue....so @saurav was partly right...at least he (she) point me in the right direction. Yesterday evening I ran this command :
Code:
zfs send -R -P -v Stripe/users@auto-20190127.2334-2w | ssh 10.74.0.15 zfs recv -v -s -F RaidZ3/users


Destination RaidZ3/users already existing and here is the output of that verbose (-v) command :
Code:
cannot receive new filesystem stream: out of space
warning: cannot send 'Stripe/users@auto-20190127.2334-2w': signal received



Well now I need help on how the hell to decode the "volume" tab "used/available" section...here is what I have on the source and on the receive server :

Source :
1548872842973.png


Destination :
1548872872232.png


and this is the snapshot file I want to move arround :
1548872915736.png
 
Joined
Jul 3, 2015
Messages
926
I wonder if you are sending an incremental and the destination snapshot has expired so it’s having to send the data all over again all 14.8TB of it. I would be inclined to delete the users dataset on the destination and start a fresh.
 
Joined
Jul 3, 2015
Messages
926
PS: why are you sending via the CLI and not the UI?
 
Joined
Jul 3, 2015
Messages
926
So your users dataset is 14.8TB in size. Your snapshot is referring to 4.7TB. Something not quite right, looks like your snapshots are out of sync. Start again.
 

fritoss007

Cadet
Joined
Jan 26, 2019
Messages
6
I'm sending from CLI just to get some kind of feedback on what was going on while troubleshooting

On this dataset, there is a "Mapped RAW LUN" drive from a virtual ESXi Windows 2012 R2 based file server pointing into it. The data size reported by this Windows server is exactly the same as the snapshot size (arround 4.7TB) but the virtual raw disk and the partation on this server is 10TB. We did shrunk from 14.3TB to 10TB just before Christmas as the new server is on RAIDZ3 as oppose to HW RAID5 on the source FS, that why we had to shrunk it...as we lost drives from HW RAID5 to RAIDZ3. Shrunk was made following the best practice...I think :)

shows 14.4TB on main screen but 10T on the dataset settings windows :
1548881177936.png
 

fritoss007

Cadet
Joined
Jan 26, 2019
Messages
6
Here's an update, it worked...well kind of !

I've deleted every sub dataset and start manually again the transfer and this morning it went well over the 3.7TB it failed yesterday !

But sadly I had to stop the transfer this morning at 7:50....so I'm going to delete my dataset again and use the GUI to setup a snapshot replication this time but with some throttling so I can run it 24/7...I'll let you know guys !
 
Top