"out of space" error during replication

Status
Not open for further replies.

macmac1

Dabbler
Joined
Apr 9, 2014
Messages
17
Hello,

I've run into strange problem with replication between two FN boxes (both running version FreeNAS-9.2.1.5-RELEASE-x64 (80c1d35) ).

I have zvol replication configured between two FressNAS boxes. The zvol I replicate is used through iSCSI by VMware.

Everything was working just fine. Until, some day, much data was copied to the volume, which I used to replicate (it grew from something like 150 GB to 1.5 TB during one day).

Snapshots continued to be made, but replication seemed stuck at recent snapshot.

I tried several things: interrupted replication, disabled it, reenabled. Always getting replication failed with "out of space" error.

Finally, I have cleared all snapshots from both source and destination FN boxes.
Now, I have only 1 snapshot at SOURCE. FreeNAS tries to replicate it, and after few hours, I get message like this in syslog:

autorepl.py: [tools.autorepl:380] Replication of zfsVol1/zfsVolume1@auto-20140625.1506-1w failed with cannot receive new filesystem stream: out of space

Source volume has 2T capacity and about 1.8 is occupied.

I alredy tried to increase target volume size with the command:

zfs set volsize=5T zfsVol1/zfsVolume1

I don't know if this is enough to actually increase volume size, anyway - no effect.

After a while ,replication gets repeated and same error happens over and over.

Here are some printouts from my config:

SOURCE box:

# zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
zfsVol1 1.76T 8.61T 0 337K 0 8.61T
zfsVol1/.system 1.76T 117M 0 244K 0 117M
zfsVol1/.system/cores 1.76T 7.83M 0 7.83M 0 0
zfsVol1/.system/samba4 1.76T 6.44M 0 6.44M 0 0
zfsVol1/.system/syslog 1.76T 102M 0 102M 0 0
zfsVol1/jails 1.76T 5.43G 0 692K 0 5.43G
zfsVol1/jails/.warden-template-pluginjail 1.76T 805M 116K 805M 0 0
zfsVol1/jails/bacula-sd_1 1.76T 271M 0 271M 0 0
zfsVol1/jails/btsync_1 1.76T 247M 0 247M 0 0
zfsVol1/jails/owncloud_1 1.76T 677M 0 677M 0 0
zfsVol1/jails/web 1.76T 3.47G 0 3.47G 0 0
zfsVol1/zfsDataset1 1.76T 3.53G 0 3.53G 0 0
zfsVol1/zfsDataset2 1.76T 49.0M 0 49.0M 0 0
zfsVol1/zfsVolume1 3.69T 4.47T 23.3G 2.52T 1.93T 0
zfsVol1/zfsVolume2 3.83T 2.06T 0 118M 2.06T 0
zfsVol1/zfsVolume3 3.82T 2.06T 0 6.89G 2.06T 0


Snapshots at SORUCE:

# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
zfsVol1/jails/.warden-template-pluginjail@clean 116K - 805M -
zfsVol1/zfsVolume1@auto-20140625.1506-1w 23.3G - 2.41T -


Some "ps" output while replication in progresss:

# ps -aux|grep "zfs: sending"
root 5384 5.3 0.0 37716 2536 ?? S 1:16PM 0:43.65 zfs: sending zfsVol1/zfsVolume1@auto-20140625.1506-1w (4%: 76513601712/1814601840216) (zfs)
root 7545 0.0 0.0 16268 1864 0 S+ 1:45PM 0:00.00 grep zfs: sending



# ps -aux -w -w | grep 192.168.200.204
root 5388 38.8 0.0 51652 5552 ?? S 1:16PM 15:55.92 /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l root -p 22 192.168.200.204 /sbin/zfs receive -F -d zfsVol1 && echo Succeeded
root 5385 0.0 0.0 14492 1804 ?? I 1:16PM 0:00.00 /bin/sh -c /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null | /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l root -p 22 192.168.200.204 "/sbin/zfs receive -F -d zfsVol1 && echo Succeeded"


DEST box:

# zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
zfsVol1 1020G 9.37T 0 326K 0 9.37T
zfsVol1/.system 1020G 16.2M 0 244K 0 16.0M
zfsVol1/.system/cores 1020G 7.83M 0 7.83M 0 0
zfsVol1/.system/samba4 1020G 3.81M 0 3.81M 0 0
zfsVol1/.system/syslog 1020G 4.32M 0 4.32M 0 0
zfsVol1/zfsDataset1 1020G 3.49G 0 3.49G 0 0
zfsVol1/zfsDataset2 1020G 22.5M 0 22.5M 0 0
zfsVol1/zfsVolume1 3.87T 5.24T 0 2.28T 2.88T 89.3G
zfsVol1/zfsVolume2 3.06T 2.06T 0 118M 2.06T 0
zfsVol1/zfsVolume3 3.05T 2.06T 0 6.89G 2.06T 0


No snapshots:

# zfs list -t snapshot
no datasets available

Replication data stream being received:

# ps -aux | grep receive
root 52323 8.4 0.0 37584 2268 ?? S 1:16PM 4:05.83 /sbin/zfs receive -F -d zfsVol1
root 52321 0.0 0.0 48300 5068 ?? Is 1:16PM 0:00.01 csh -c /sbin/zfs receive -F -d zfsVol1 && echo Succeeded

And now, strangest part to me: after replication fails with "out of space" error, there are still no snapshots visible at DEST, but zfsVol1/zfsVolume1 show greater values for used space:

Here are output of "zfs list" at DEST:

Before replication :

zfsVol1/zfsVolume1 3.87T 5.24T 2.28T -

After failed replication:

zfsVol1/zfsVolume1 6.00T 3.12T 2.28T -

Same at SOURCE:

before replication:

zfsVol1/zfsVolume1 3.69T 4.47T 2.52T -

after failed repliation:

zfsVol1/zfsVolume1 4.61T 3.69T 2.52T -


Looks like failed repliation somehow consume space at both SOURCE and DEST.

zfsVolume1 volume actually holds about 1.8 TB (result of "du" at VMware server that uses the volume thorugh iSCSI)

How can I solve it? Why volumes seem to grow on failed replication? How can I clear this used space?
I'm not ZFS expert, maybe it is just a question of running some kind of "purge" command?
Please, help.
 

macmac1

Dabbler
Joined
Apr 9, 2014
Messages
17
If somebody else encountered it: I finally solved it by marking "Initialize remote side for once. (May cause data loss on remote side!):" check-box.
When done, first replication succeeded. Then, I unmarked the check-box and replication works fine since then.
No idea if it is the only thing to be done, or my previous actions were also necessary.
 
Status
Not open for further replies.
Top