philiplu
Explorer
- Joined
- Aug 10, 2014
- Messages
- 58
I'm seeing similar issues as mentioned in some other forum posts, like ZFS Replication Fails from time to time and ZFS Replication: Dataset becomes unavailable on receiving end after receiving snapshot. While replication is going on, I start seeing syslog messages like these:
That's just a symptom - the real problem seems to be that zfs receive can sometimes unmount a dataset, even though zfs mount will still show the dataset as mounted.
What's interesting is that I don't see this when using the GUI to set up replication, like those other forum posts. Instead, I'm seeing it with a script of mine, which I use to push an on-demand replication to a local pool, which I use to create rotating backup pools that are hot-swapped and stored off-site. The replication script tries to mimic the same zfs commands as used in GUI replication (at least, before 9.3.1 - I haven't checked to see exactly what's changed in the new replication scheme). Here are example commands I use to replicate locally:
Once that zfs receive starts, the collectd errors start appearing. If I manually look at the tree out in the PULL pool bakset1, the top-level received dataset, at bakset1/bu, doesn't have the actual nested datasets within it. Instead, it just has the placeholder mountpoint directories for those datasets. You can tell because the mountpoints are empty, and all have the same modification times when the bakset1 destination was first created, instead of various times corresponding to the modifications of the source datasets under pool/bu. The collectd syslog errors start appearing because those datasets are nested two levels down, e.g. bakset1/bu/users/user1, and those mountpoint directories aren't found any longer since their parent directory is acting unmounted.
Even though the nested datasets act like they're unmounted, they still show up in the output from zfs mount.
I fix the problem by detaching and reattaching the destination pool in the GUI storage tab, but the problem will reoccur the next time a replicate.
Anyone have any idea why a zfs receive would half-way unmount a dataset?
I've attached the script I use to manually replicate, though the details there aren't important to the errors I'm seeing.
Oct 1 16:30:05 Marvin collectd[42972]: statvfs(/mnt/bakset1/bu/users/user1) failed: No such file or directory
Oct 1 16:30:05 Marvin collectd[42972]: statvfs(/mnt/bakset1/bu/users/user2) failed: No such file or directory
Oct 1 16:30:05 Marvin collectd[42972]: statvfs(/mnt/bakset1/bu/users/user2) failed: No such file or directory
That's just a symptom - the real problem seems to be that zfs receive can sometimes unmount a dataset, even though zfs mount will still show the dataset as mounted.
What's interesting is that I don't see this when using the GUI to set up replication, like those other forum posts. Instead, I'm seeing it with a script of mine, which I use to push an on-demand replication to a local pool, which I use to create rotating backup pools that are hot-swapped and stored off-site. The replication script tries to mimic the same zfs commands as used in GUI replication (at least, before 9.3.1 - I haven't checked to see exactly what's changed in the new replication scheme). Here are example commands I use to replicate locally:
zfs snapshot -r pool/bu@bakset1-20151001.1616
zfs send -e -R -I bakset1-20150929.1304 pool/bu@bakset1-20151001.1616 | zfs receive -v -F -d bakset1
zfs destroy -r -v pool/bu@bakset1-20150929.1304
zfs send -e -R -I bakset1-20150929.1304 pool/bu@bakset1-20151001.1616 | zfs receive -v -F -d bakset1
zfs destroy -r -v pool/bu@bakset1-20150929.1304
Once that zfs receive starts, the collectd errors start appearing. If I manually look at the tree out in the PULL pool bakset1, the top-level received dataset, at bakset1/bu, doesn't have the actual nested datasets within it. Instead, it just has the placeholder mountpoint directories for those datasets. You can tell because the mountpoints are empty, and all have the same modification times when the bakset1 destination was first created, instead of various times corresponding to the modifications of the source datasets under pool/bu. The collectd syslog errors start appearing because those datasets are nested two levels down, e.g. bakset1/bu/users/user1, and those mountpoint directories aren't found any longer since their parent directory is acting unmounted.
Even though the nested datasets act like they're unmounted, they still show up in the output from zfs mount.
I fix the problem by detaching and reattaching the destination pool in the GUI storage tab, but the problem will reoccur the next time a replicate.
Anyone have any idea why a zfs receive would half-way unmount a dataset?
I've attached the script I use to manually replicate, though the details there aren't important to the errors I'm seeing.