Replication failed emails, the sequel

indivision · Jul 8, 2017

I have replication set up between two FreeNAS machines running the latest 11-Stable.

Every time a replication finishes (once per day), I receive an email that says that replication failed. Looking at the replication task in the GUI, it says that it is "up to date". Looking on PULL, it appears that the replication went through. But, I would like to be 100% certain.

There are several threads/reports about this kind of thing happening. But, I don't see any solutions. Is this an unresolved bug?

Other examples:

https://forums.freenas.org/index.php?threads/spurious-replication-failed-emails.39110/
https://bugs.freenas.org/issues/11550
https://forums.freenas.org/index.ph...while-attempting-to-send-snapshot-auto.55349/
https://bugs.freenas.org/issues/23837
https://forums.freenas.org/index.php?threads/replication-problem-what-to-make-of-the-errors.45313/

[EDIT: One side note: I notice that the "used" amounts are different for the snapshots between PUSH and PULL. Is that normal? Or, an indication that they aren't being copied over correctly?]

artlessknave · Jul 10, 2017

hmm, would it not make sense to include the error listed in the failure emails? are we to assume that you get all the exact errors listed in those threads?

I believe the used amounts per snapshot will all depend on which snaps were replicated.

it looks like a bug report in one of those threads states that a failure message is sent when it has to create a dataset. could something be nuking datasets on pull, causing it to have to create datasets daily?

kind of seems to be a false failure

rogerh · Jul 10, 2017

artlessknave said:
hmm, would it not make sense to include the error listed in the failure emails?

The standard replication failure emails I get are of two kinds with slightly different format, but neither says what the nature of the failure is. If there is a concomitant alert, that may be slightly more helpful, but it can be quite difficult to identify exactly where it is failing. For instance, my failure message on the 'send' machine is because of a non-zero status return from "zfs receive" on the 'receive' machine, but nonetheless the snapshot replication succeeds (i.e. despite the email saying it has failed). But, because of the non-zero status, stale snapshots are not deleted on receive. Some effort was needed to discover this.

Edit: the point of this post is to show how non-specific these emails are, not to divert the thread to my old problem.

indivision · Jul 10, 2017

artlessknave said:
hmm, would it not make sense to include the error listed in the failure emails? are we to assume that you get all the exact errors listed in those threads?

I believe the used amounts per snapshot will all depend on which snaps were replicated.

it looks like a bug report in one of those threads states that a failure message is sent when it has to create a dataset. could something be nuking datasets on pull, causing it to have to create datasets daily?

kind of seems to be a false failure

Here is the error:

Code:

The replication failed for the local ZFS tank/blue/red while attempting to
	apply incremental send of snapshot auto-20170709.1320-2m -> auto-20170710.1320-2m to 192.168.0.104

I guess something could be nuking datasets on pull in theory. I haven't seen evidence of that and am not sure how I would check that exactly though?

rogerh said:
Edit: the point of this post is to show how non-specific these emails are, not to divert the thread to my old problem.

Thank you. I think it's a helpful point. From what I've read, it sounds like the 0/1 issue you describe matches what I'm seeing. So, maybe the same problem. Were you able to fix?

artlessknave · Jul 11, 2017

maybe some of whats in this thread could help. they mention that if snaps get created and expire on push before they get replicated to pull you might get errors. since you didnt post any of your snapshot or replication settings, pool sizes, dataset layout, or hardware, you will have to inspect them all yourself.
if that doesnt lead to anything, maybe you can use the script, although it seems to me like deleting rollback snaps might be counter productive.

Edit: oops forgot the thread link haha
https://forums.freenas.org/index.ph...napshots-similar-to-apples-timemachine.10304/

indivision · Jul 11, 2017

artlessknave said:
maybe some of whats in this thread could help. they mention that if snaps get created and expire on push before they get replicated to pull you might get errors. since you didnt post any of your snapshot or replication settings, pool sizes, dataset layout, or hardware, you will have to inspect them all yourself.
if that doesnt lead to anything, maybe you can use the script, although it seems to me like deleting rollback snaps might be counter productive.

Edit: oops forgot the thread link haha
https://forums.freenas.org/index.ph...napshots-similar-to-apples-timemachine.10304/

Thank you for the link.

Isn't the "roll-up" feature from that script available through FreeNAS? It seems that you can designate how long to keep snapshots for and they are removed after that time. Or, maybe I'm missing something there.

I set the snapshots to expire in 2 months and replication to occur daily. So, I don't think they are expiring before replication...

artlessknave · Jul 11, 2017

ya, personally that script looks redundant, since you can do that with whats built in, but it might be worth trying for you too see if that bypasses your issues

ya that shouldnt be it, unless your data somehow takes 2 months to replicate.

at this point I can't think of anything, even that link i found was accidental.

indivision · Jul 11, 2017

Thank you for taking a look.

I may just try using RSync for backing up instead.

rogerh · Jul 12, 2017

My error is related to zfsonlinux, so probably not relevant. But it is possible that, like me, you are seeing an error in zfs receive that does not pass a useful error message to the sending side.

The different sizes are not necessarily an error. You could choose a few random files from the two machines and copy and compare them to see if replication is likely to be working alright.

indivision · Jul 12, 2017

rogerh said:
You could choose a few random files from the two machines and copy and compare them to see if replication is likely to be working alright.

What is a safe way to do that?

When I navigate to the target folder in PULL in CLI it doesn't show any files.

rogerh · Jul 12, 2017

indivision said:
What is a safe way to do that?

When I navigate to the target folder in PULL in CLI it doesn't show any files.

That is definitely worrying! Is the file system on the PULL system mounted? It should look exactly the same at dataset and below level as the dataset you are replicating does on the PUSH side.

It might be worth giving us (in code tags) the results of zfs list on both systems. And perhaps screenshots of your snapshot task(s) and replication task(s) in the GUI. Just so we know exactly what you are trying to do. You can of course redact confidential data if necessary, but it might confuse the issue.

Edit; I am assuming there are masses of files you can list and open in the relevant dataset(s) on the PUSH side?

indivision · Jul 12, 2017

rogerh said:
That is definitely worrying!

I was hoping you wouldn't say that and that the files were hidden somehow as a security precaution. :D

rogerh said:
Is the file system on the PULL system mounted? It should look exactly the same at dataset and below level as the dataset you are replicating does on the PUSH side.

PULL is a near default machine. It does nothing but pull replications.

I haven't done anything extra to mount a file system. I just made one dataset for a mirrored set of drives and then set up replication according to the manual onto a dataset under the main one.

The list of snapshots, in the PULL GUI, look roughly the same as the list of the snapshots in the GUI on PUSH. The first replication does take a long time. So, it seems like it is really transferring the data.

Could it be that the data on PULL isn't visible in CLI due to permissions?

rogerh said:
It might be worth giving us (in code tags) the results of zfs list on both systems. And perhaps screenshots of your snapshot task(s) and replication task(s) in the GUI. Just so we know exactly what you are trying to do. You can of course redact confidential data if necessary, but it might confuse the issue.

Here are the relevant sections of zfs list.

PULL:

Code:

vanguard/replica											536G  1.23T	88K  /mnt/vanguard/replica							   
vanguard/replica/vitae									  536G  1.23T	88K  /mnt/vanguard/replica/vitae						 
vanguard/replica/vitae/indivision						   153G  1.23T   153G  /mnt/vanguard/replica/vitae/indivision			 
vanguard/replica/vitae/vault								383G  1.23T   382G  /mnt/vanguard/replica/vitae/vault

PUSH:

Code:

optimus/vitae											  545G  1.43T   112K  /mnt/optimus/vitae								   
optimus/vitae/indivision								   155G  1.43T   155G  /mnt/optimus/vitae/indivision						
optimus/vitae/vault										389G  1.43T   389G  /mnt/optimus/vitae/vault

Attached are screenshots of replication and snapshot tasks.

rogerh said:
Edit; I am assuming there are masses of files you can list and open in the relevant dataset(s) on the PUSH side?

Yes. On PUSH I can navigate to the folders and see the files just fine.

artlessknave · Jul 13, 2017

i have had PULL folders look blank at the cli when i had 2 more more replications configured to go to the same dataset. they would successfully erase each other, forever, but as long as they completed there were no errors.
I cant remember the exact bad config, but its something like:

 

server	  source pool	 server	  dest pool

PUSH1	   optimus/vitae   PULL		vanguard/replica

PUSH1	   uber/eraser	 PULL		vanguard/replica

PUSH2	   pool/nuker	  PULL		vanguard/replica

are PUSH and PULL the only servers set up? is there any chance that anything else is writing to PULL?

indivision · Jul 14, 2017

artlessknave said:
i have had PULL folders look blank at the cli when i had 2 more more replications configured to go to the same dataset. they would successfully erase each other, forever, but as long as they completed there were no errors.
I cant remember the exact bad config, but its something like:

server source pool server dest pool PUSH1 optimus/vitae PULL vanguard/replica PUSH1 uber/eraser PULL vanguard/replica PUSH2 pool/nuker PULL vanguard/replica

are PUSH and PULL the only servers set up? is there any chance that anything else is writing to PULL?

They are the only servers. No other write tasks have been set up. Both servers have recently been set up from scratch (other than the data pools on PUSH).

Although the folder appears blank, PULL shows the right amount of drive space as being used. Makes me think that permissions on the files from PUSH make it so that I can't view with CLI user on PULL. But, not sure if that is possible?

artlessknave · Jul 14, 2017

unix permissions can be very specific, so yes, it could be something with permissions. by default a new pool is owned by root. are you replicating with root? are you viewing on pull with root? who owns the files on push? does root own the pool on pull?

indivision · Jul 14, 2017

The new pool replica was created by the replication process. Presumably by root then?

I am replicating with a task set up in the GUI. I don't see anywhere in the replication task options that indicates what account it is replicating with.

I am viewing on pull by using the GUI Shell button. I believe that uses root.

I am replicating the pool recursively. So, there are varied permissions below the top level. But, the highest level pool "vitae" is owned by "nobody" and a group that I set up so multiple people can access. I switched it from root to "nobody" in an attempt to harden/protect the files further. But, maybe that is the issue?

Strangely enough, on PULL, the target pool that was created for replication "replica" is owned by root and wheel. But, all of the pools under it have completely blank ownership for owner and group.

rogerh · Jul 14, 2017

indivision said:
The new pool replica was created by the replication process. Presumably by root then?

I am replicating with a task set up in the GUI. I don't see anywhere in the replication task options that indicates what account it is replicating with.

I am viewing on pull by using the GUI Shell button. I believe that uses root.

I am replicating the pool recursively. So, there are varied permissions below the top level. But, the highest level pool "vitae" is owned by "nobody" and a group that I set up so multiple people can access. I switched it from root to "nobody" in an attempt to harden/protect the files further. But, maybe that is the issue?

Strangely enough, on PULL, the target pool that was created for replication "replica" is owned by root and wheel. But, all of the pools under it have completely blank ownership for owner and group.

Your data in the previous posts looks as though everything is working well, the variations in space used are well within what can happen due to various filesystem factors (i.e. I don't understand them but they seem to happen). If only you could access the files on PULL there would seem to be no concerns.

But the datasets on PULL having no Unix owners is something I have not come across before and I can't interpret. Do both pools have the same ZFS feature, could either of them have been created in an earlier edition of FreeNAS and not updated?

indivision · Jul 14, 2017

rogerh said:
could either of them have been created in an earlier edition of FreeNAS and not updated?

Yes. That is the case. The pools on PUSH were set up several years ago. I get an alert saying that I can update the pools. But, I have not done so yet because it sounds like the new feature flags are of no use to me. So, might as well retain the ability to go backward if something goes wrong with the current train, etc.

On PULL, I completely re-did that machine just a few days ago, building the pools from the latest 11-stable. So, presumably, that has the new feature flags in place on the pool.

Do you think that could be the issue?

artlessknave · Jul 14, 2017

hmm. I always created the dataset on pull and assigned owner because if I didn't the replica didn't work as I expected.
I wonder if if letting it create the parent pool vanguard/replica could be part of your problem?
it looks like the replication is ~1.5TB; how fast is the network speed? could you delete everything on PULL, explicitly create vanguard/replica and re-replicate in a reasonable time?
if not, do you have sufficient space on PULL that you could create a new dataset (eg vanguard/replica2) and a new replication for the same PUSH pool and see if that works better?
I would be doubtful that its feature flags because I don't think replication copies that at all; i think thats only part of the pool itself.

indivision · Jul 15, 2017

artlessknave said:
could you delete everything on PULL, explicitly create vanguard/replica and re-replicate in a reasonable time?

Well. I tried this. But, just got the same results. Blank permissions/owner.

I also tried changing those permissions/owner via the GUI on PULL. But, they remain blank after the change attempt.

Important Announcement for the TrueNAS Community.

Replication failed emails, the sequel

Guru

Wizard

Guru

Guru

Wizard

Guru

Wizard

Guru

Guru

Guru

Guru

Guru

Attachments

Wizard

Guru

Wizard

Guru

Guru

Guru

Wizard

Guru

Similar threads