ZFS Replication: Dataset becomes unavailable on receiving end after receiving snapshot.

Status
Not open for further replies.

gunnarsson

Cadet
Joined
Sep 22, 2014
Messages
9
Hello all,

As the subject should suggest, I am having problems with ZFS replication where the dataset becomes unavailable on the receiving end once a new snapshot is received. If I connect via SSH and get a directory listing the dataset is not present or accessible.

The console shows errors along the line of the following, one msg per affected dataset every 10 seconds:

Code:
May 11 19:00:56 haugur2 collectd[8701]: statvfs(/mnt/vol1/cbdp/cbdp_photo/2014) failed: No such file or directory


If I reboot the box, all datasets are intact and up-to-date and everything seems to be working perfectly until the next snapshot is received and the datasets becomes unavailable again until next reboot.

I have been having this problem since I first set up my two boxes in September 2014 (9.2.8). Both boxes replicate data to the other and the data is shared via CIFS as read-only on the receiving end. Currently I am running FreeNAS-9.3-STABLE-201505040117.

It is an annoying and inconvenient problem as I would like to have the data accessible as read-only on the remote side but since I haven't had a lot of time available to put into this and since I know the data is getting replicated, I haven't done anything serious about it yet.

I have searched for information on the problem before but not found enough relevant information to put me on the right track to solving the problem. Initially I thought it might be a bug and if so then it would presumably get a fix soon enough but it doesn't look like a lot of other people are having this problem so now I am leaning more towards the explanation that I have made some sort of a configuration error.

Hardware: ASRock C2750D4I, 2x8GB ECC RAM, 3x 4TB WD RE4 (Raid-Z1)

I would appreciate any input and if more information is required to debug this, let me know.
 

Attachments

  • debug-haugur2-20150511185401.tar.gz
    360.4 KB · Views: 192

gunnarsson

Cadet
Joined
Sep 22, 2014
Messages
9
Unfortunately, no. Although it is always lingering in my brain I can't say I've tried much though due to lack of time. I was hoping someone might come with suggestions that would put me on the track but I am aware that this might just be a freak case caused by me messing something up and thus potentially pretty difficult to figure out, so I understand it very well if people on here are not too keen on even starting to look at it :)

Initially when I set up my boxes I did a fair bit of fiddling around with permissions via the terminal by means of chmod instead of ACLs and it has occurred to me whether something I did then may have caused this. Most of the data on the boxes was copied from linux servers and I was trying to retain the users and permission structure from there, which in hindsight wasn't smart. After I discovered that there was a problem I thought it was a CIFS issue but after having tried to delete all the cifs shares and kill the cifs service I realised that was not the problem. Later I tried to create new datasets and followed the documentation as I possibly could, but ended up with the same results for those datasets. Ater reading some posts I started thinking whether manual replication has the same effect or not, but I haven't done any tests with that yet. I imagine I may have to start from square one with my boxes when I get the time to do so. I will probably get yet another box and start with that so I don't disrupt the replication / backup system I have in place already, since I at least know that the data is getting replicated ...it is just not possible to access it locally on the remote box.

In any case, thank you for responding dlavigne. If you have any ideas that you can throw me...where to look for clues etc, that would be much appreciated.
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554
I might be mistaken, but i'm quite sure i have read that the filesystem will be unavailable while receiving.

Seems to me this could be your issue.
I also think that it is not supposed to work like you are doing it, sharing the datset beeing replicated on the pull side.
So samba will be sharing a dataset wich suddenly is unavailable.... ;)


Sent from my mobile using Tapatalk
 

gunnarsson

Cadet
Joined
Sep 22, 2014
Messages
9
Thanks very much for your input Sir.Robin :)

Perhaps that is the case. How ever...it does not matter if CIFS (or any sharing of the datasets in question) is on or not. The datasets become inaccessible on the remote side (after the replication has been completed) even if I try and access it on the FreeNas machine that is carrying it, until after next reboot (or probably restarting some process which I haven't figured out how to do yet, or perhaps somehow re-mounting the dataset).

I would not expect the dataset to be accessible while it is being updated by the replicated snapshot on the remote machine, but I was thinking (or hoping) it should become available again after it has been updated. Maybe that's not so?
If the dataset should be available after it's been updated, then I feel like it shouldn't be a problem sharing it via CIFS or any other means as READ ONLY. I admit this is an assumption based on limited knowledge and understanding of it all.

Perhaps I will have to resort to different ways of cloning the data between the boxes, which I find a bit sad because I really love the idea of the automated replication in FreeNAS and how it works (apart from this problem of course). I guess I could implement the desired functionality myself , when I find the time to dig deeper into this.

Many thanks!
 
Last edited:

gunnarsson

Cadet
Joined
Sep 22, 2014
Messages
9
Now over 3 months later I had some time to fiddle with this and actually found something out.

I went to the terminal of the machine on the receiving end of the replication and did:
Code:
zfs unmount /mnt/vol1/cbdp/


On this particular occasion the message I got back was:
Code:
cannot unmount '/mnt/vol1/cbdp': Device busy


Despite the "cannot unmount" error msg, immediately after this, the statvfs errors stopped and the dataset became available again after having been inaccessible following a successful replication.

Possibly a bug? ...or at least an incomplete feature?
 
Last edited:

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58

gunnarsson

Cadet
Joined
Sep 22, 2014
Messages
9
Thanks philiplu :) , I appreciate you informing me that you have a similar problem and the work-around you came up with.
I haven't had time to look into this since I posted. When I have needed to access the data I have manually unmounted/remounted to be able to access the data. I imagine I will simply ditch the GUI way of setting up the replication, unless a fix comes before I get the time to sort it out. Your script will help me towards that. Thanks again!
 
Status
Not open for further replies.
Top