11.3 new replication - a curate's egg?

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
Definition.
I upgraded one of my machines from 11.2-U7 to 11.3-RELEASE a few days ago, and as it looked good upgraded the other three this weekend, and then spent two long days trying to get new replication to work. The end result is that I have reduced my systems to being one machine containing data, no snapshots before 11.3-RELEASE, with a couple of rsync partial backups on external drives and partial Cloud Sync backups;, and three backup / test machines with no data, which I am about to turn off until 11.3-U1 or later.

I now appreciate the wisdom of those who stay on old versions until the screaming has stopped.

The good parts of 11.3_RELEASE

For me, the 11.2-U7 to 11.3-RELEASE via GUI updates were painless.
  • Cloud Sync to AWS still works.
  • I do not miss the legacy GUI much, as I weaned myself off it.
  • The new GUI is faster, and looks quite attractive, if sparse. Information density is low. I have not seen any Chrome "Aw, Snap!" errors. Idling dashboards would do this after a few hours.
I do not use encrypted disks.
I do not use FreeNAS plugins, jails or VMs.

The poor parts
  • Lagg/failover behavior, mentioned elsewhere.
  • SMB access and permissions problems after recreating my Windows shares. Resolved by re-entering the passwords for my root and individual users. Permissions still look very odd from Windows.
I do not intend to revert to 11.2-U7.

The bad part - the new replication system

n the guide this looks wonderful.
SSH + NETCAT (when I got it running) ran really quickly, even to my elderly AMD Athlon(tm) II Neo N36L HP Microserver.
But the problems!
  • I could not get it to include my legacy snapshots in replications.
  • An apparently successful replication run then destroyed all the legacy snapshots for the dataset.
  • Replication from source tank/dataset to remote tank/remote/system used to result in the replication being in tank/remote/system/dataset, allowing sibling datasets to be logically placed. Now, if it works at all, the replication is in tank/remote/system and incremental replications are not possible.
  • After biting on the bullet and destroying all my legacy snapshots, and recreating multiple times replication targets, SSH Keys, SSH Connections, Periodic Snapshot tasks and Replication tasks, try as I might I could not get it to work.
  • Seemingly invalid host keys sometimes being stored in SSH Connections preventing connection.
Either I have missed something or the new replication system is not yet in any way fit for purpose. I hope that iXsystems will extensively test and fix this area before releasing a 11.3-Un where they announce that replication works. Holding up 11.3-U1 would be unpopular, as it looks as if it fixes problems in other areas.
 
Joined
Jan 4, 2014
Messages
1,644
Ouch! Your post made me go and check my backup systems. Backup data is still intact and data from the replication sources are still coming across. I haven't bothered updating my replication tasks away from using the LEGACY transport and after reading your post, I'll put tinkering with replication on hold. Thanks for the balanced assessment of your 11.3 journey so far.

SMB access and permissions problems after recreating my Windows shares. Resolved by re-entering the passwords for my root and individual users. Permissions still look very odd from Windows.
I haven't created any new SMB shares since 11.3. The existing shares appear to be fine, but I do agree, share permissions do look a little odd under Windows.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
Yes, legacy snapshots and replications were working fine for me too under 11.3-RELEASE.
I wish that I had left them alone.
 

Richard Durso

Explorer
Joined
Jan 30, 2014
Messages
70
  • Replication from source tank/dataset to remote tank/remote/system used to result in the replication being in tank/remote/system/dataset, allowing sibling datasets to be logically placed. Now, if it works at all, the replication is in tank/remote/system and incremental replications are not possible.
  • After biting on the bullet and destroying all my legacy snapshots, and recreating multiple times replication targets, SSH Keys, SSH Connections, Periodic Snapshot tasks and Replication tasks, try as I might I could not get it to work.
Adrain, I just upgraded from 11.2-U8 to 11.3 and didn't experience the same issues with legacy replication you had.

I did have to do some cleanup under "System > SSH Connections" to remove a bunch of duplicates created during the migration to 11.3. Looks like one duplicate created for each task migrated. I only need 1 connection object. Then cleaned up my legacy replication tasks to just use my consolidated SSH Connection. I got it down to the one SSH Connection for Legacy. You can look at the System > SSH Keypairs select Edit and Copy the Public Key contents from that. On the remote host, clean up the ".ssh/authorized_keys" and paste that in. For giggles I regenerated my keypair on FreeNAS and did these steps to use the new keypair.

Then from the FreeNAS Shell tested the connection to remote hosts:
Code:
ssh -vv -i /data/ssh/replication <remote.host.com>

OpenSSH_7.5p1, OpenSSL 1.0.2s-freebsd  28 May 2019
debug1: Reading configuration data /etc/ssh/ssh_config
debug2: resolving "<remote.host.com>" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to <remote.host.com> [192.168.0.225] port 22.
debug1: Connection established.
...

Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-88-generic x86_64)
...
0 packages can be updated.
0 updates are security updates.

Last login: Mon Feb 24 17:36:29 2020 from 192.168.0.250

~#


Connected no problem. Then created a new replication tasks to use the legacy replication (you can select legacy as transport), something like the following (push to my Ubuntu backup system)
  1. Login to FreeNAS GUI.
  2. Expand "Tasks" and select Replication Task
  3. Click [Add]
  4. Click [Advanced Replication Creation]
    1. Set name such as "main/users -> root@remote.host.com:rpool/backups/users"
    2. Set Direction to "Push"
    3. Set "Transport" to "Legacy"
    4. Set SSH Connection to respective remote SSH host
    5. Expand Source Dataset folders to select pool and dataset to replicate to Remote Server.
    6. Expand Target Dataset to select "rpool/backups/<DataSet>" on Remote Server.
    7. Enable "Recursive" if child datasets should be included.
    8. Set Snapshot Retention Policy to "Same as Source"
    9. Enable Stream Compression if that applies to data being replicated.
  5. Click [Save]
Within seconds that replication task switched state to "Running" and it can be monitored from FreeNAS shell with "tail -f /var/log/debug.log". And on the Ubuntu side I could see the Dataset(s) being populated.

I've been unable to get anything with NETCAT working yet as I've been unable to get required "py-libzfs" compiled on non-FreeNAS hosts.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
I am glad it is working for you Richard.
I didn't mention that I seemed to have legacy snapshots and replication working fine.
Then I tore it all down to switch to new snapshots and replication.
Legacy is presumably using the old well hammered upon code.
It is the new stuff that seems to have problems.
 
Joined
Jan 4, 2014
Messages
1,644

mjt5282

Contributor
Joined
Mar 19, 2013
Messages
139
Adrian, your analysis of 11.3 is accurate. I do not use the built in replication, but rsync scripts and Syncoid. I hope you can figure out what went wrong. The Jails / Zpool system seem stable for me.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
Does 11.3-U1 address the issues you've been experiencing?
I have not yet installed 11.3U1. I'll do so on a couple of machines when I have some spare time and test new snapshot and replication.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
I have installed 11.3-U1 on 3 local machines, deleted all snapshots, snapshot and replication tasks, SSH replication keys and connectors; set up a small snapshot and replication which worked as expected, set up all my local replications, and have them grinding away in parallel. Looking good! With 10 TB to copy to one machine and 7 TB to another over a 1Gb network this will take days.
I'll bring back my remote machine as bringing it up to date when something awful happens with snapshots on a 6TB USB external disk is too painful, set it up here and then power it up every month to get its replications up to date, then power it off again.
 

echelon5

Explorer
Joined
Apr 20, 2016
Messages
79
I agree. There are lots of weird things with it and I didn't find a doc explaining the difference and/or benefits between the Legacy system and the new one. For this one there should have been a migration guide.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
Update
The release notes for 11.3-RELEASE do say
The replication framework has been redesigned, adding new back-end systems, files, and screen options to the Replication system and Periodic Snapshot Tasks. The redesign adds these features:
  • New peers/credentials API for creating and managing credentials. The SSH Connections and SSH Keypairs screens have been added and a wizard makes it easy to generate new keypairs. Existing SFTP and SSH replication keys created in 11.2 or earlier will be automatically added as entries to SSH Keypairs during the upgrade.
  • New transport API adds netcat support, for greatly improved speed of transfer
  • Snapshot creation has been decoupled from replication tasks, allowing replication of manually created snapshots.
  • The ability to use custom names for snapshots.
  • Configurable snapshot retention on the remote side.
  • A new replication wizard makes it easy to configure replication scenarios, including local replication and replication to systems running legacy replication (pre-11.3).
  • Replication is resumable and failed replication tasks will automatically try to resume from a previous checkpoint Each task has its own log which can be accessed from the State column.
Performance with netcat is greatly improved.

Well, the initial replications with 11.3-U1 have finally finished, but something (replication?) is drastically pruning source snapshots after hours, not the 8 week retention I set, down to a couple or even no source snapshots. This often results in there being no common snapshots and the entire dataset being replicated from scratch.
I will gather some diagnostic evidence.

Update
With a dataset automatically snapshotted hourly and replicated to 2 machines all seemed well.
I then created two manual snapshots, one with the default name manual-... and one called keep.
Replication to the first machine worked.
Replication to the second machine appeared to work, but the manual-... snapshot had vanished from the source machine! Double plus not good.
/var/log/zettarepl.log shows many snapshots being deleted yesterday evening when I was entertaining friends. I think that a bit before that I had changed the snapshot schedule for most snapshot tasks from hourly to daily at different times. If this was the trigger for my problems, George, don't do that!
I have enabled and set debugging on all 11 replication tasks on my main machine and will see if things settle down. Most replications, including a 6 TB one, should kick off in the small hours tomorrow. On my1Gb/s network it is going to take quite a while. I will hold off submitting an incident until my replications work, or fail.
 
Last edited:

echelon5

Explorer
Joined
Apr 20, 2016
Messages
79
What about snapshots? Is there a difference between "legacy" and new? Can I just rename the snapshot task?
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166

echelon5

Explorer
Joined
Apr 20, 2016
Messages
79
Experiment?

Yeah I mean, that's what I'm doing now but I had replication tasks and snapshots older than a few months that I didn't want to ruin. I believe those that aren't homelabbers are even more reluctant to experiment :)).
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
  • SMB access and permissions problems after recreating my Windows shares. Resolved by re-entering the passwords for my root and individual users. Permissions still look very odd from Windows.
11.3 is more strict about the contents of the passdb.tdb file (which users can modify by using the pdbedit command from the CLI) in that it the middleware ensures that it is synchronized with the contents of the FreeNAS configuration file. This synchronization is one-way, which means that if root (which isn't normally in samba's passbd.tdb) was made an SMB user in 11.2 or earlier through "pdbedit -a root", then its SMB access will be removed on upgrade to 11.3 (until the root password is changed in the webui). I don't know if this was your particular problem, but this is an issue that was reported by a couple of FreeNAS users.

As far as permissions go from windows. In 11.3 we switched to the upstream Samba default for the "nfs4:mode" parameter. The value in 11.2 and earlier is a deprecated legacy version that does not work properly in many situations with complex ACLs. NFSv4 ACLs have special ids ("owner@", "group@", "everyone@") as well as explicit user and group entries. owner@ refers to the owner of the file (User), group@ refers to the owning group of the file (Group), and everyone@ refers to... well.. everyone. Special ids behave differently than the other types. For instance, those entries are impacted by chmod, chown, and chgrp.

Windows has actually has equivalents of these S-1-3-0 (CREATOR-OWNER), S-1-3-1 (CREATOR-GROUP), S-1-1-0 (Everyone).

Take the follow situation. A directory is owned by bob:smbusers and has the following permissions:
owner@ -full_set -inherit
group@ -modify_set - inherit
everyone@ - read_set - inherit
bob has a SID of S-1-5-21-1842518067-541413841-1738574118-2002
smbusers has a SID of S-1-5-21-1842518067-541413841-1738574118-1011

Legacy behavior - Windows SD (simplified):
S-1-5-21-1842518067-541413841-1738574118-2002 - full control - inherit
S-1-5-21-1842518067-541413841-1738574118-1011 - modify - inherit
S-1-1-0 - read -inherit

New behavior - Windows SD (simplified):
S-1-3-0 - full control - inherit only
S-1-5-21-1842518067-541413841-1738574118-2002 - full control - no inherit
S-1-3-1 - modify - inherit only
S-1-5-21-1842518067-541413841-1738574118-1011 - modify - no inherit
S-1-1-0 - read - inherit

The reason why owner@ and group@ is split into two ACEs is that S-1-3-0 and S-1-3-1 never apply to the actual file in Windows (they are always inherit-only). This is copying Windows behavior. The reason why we do this is because it's the best way to ensure consistent behavior in enterprise environments. The change was long overdue, but it was also one that we were not comfortable doing in a U-release for obvious reasons (and so it waited for 11.3).
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
Yeah I mean, that's what I'm doing now but I had replication tasks and snapshots older than a few months that I didn't want to ruin. I believe those that aren't homelabbers are even more reluctant to experiment :)).
Yes, those who need keep much older snapshots or cannot risk trimming their snapshots down to nothing, had best stick with legacy replication or have good backups. I am not even a homelabber, just a person with a lot of data that I want to keep very safe.

Aside from the 3 replication machines in my signature, I also back up via rsync:
  • All but stuff I am prepared to lose monthly to a flip/flop pair of pools on 6 TB USB external disks.
  • Important stuff daily to rsync.net, 400 GB quota, 326 GB used.
Which reminds me, I must ditch my AWS S3 storage.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
After bricks, flowers :cool:

I have been sending a 180 GB snapshot to my remote machine via a VPN with an 80/20 VDSL line at each end.
With SSH+NETCAT and no encryption (other than the VPN) this was pushing at around 18 mb/s.

I have some hardware problems down there. The active NIC in the lagg flickers down then up every few minutes, sometimes taking the lagg with it, and one drive is reporting uncorrectable parity/CRC errors. I can't deal with these until I next go there. There are some intervening comms problems too.

The replication task failed outright a few times (it retries 5 times automatically?) lighting up a red ERROR status. Simply running the task again caused it to restart, and resume receive works! This turns "you have no chance of success" to "keep at it and you'll reach the end"
 

echelon5

Explorer
Joined
Apr 20, 2016
Messages
79
After bricks, flowers :cool:

I have been sending a 180 GB snapshot to my remote machine via a VPN with an 80/20 VDSL line at each end.
With SSH+NETCAT and no encryption (other than the VPN) this was pushing at around 18 mb/s.

I have some hardware problems down there. The active NIC in the lagg flickers down then up every few minutes, sometimes taking the lagg with it, and one drive is reporting uncorrectable parity/CRC errors. I can't deal with these until I next go there. There are some intervening comms problems too.


For one of my FN machines, the mac for the lagg got switched to the other nic
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
Yes, that has happened to most of mine. Confusing when it first happened as I use DHCP but had only allocated a fixed IP to the (normally) master NIC. So, the machine came up with a different IP address. I now have both NICs allocated the same address. DHCP is convenient, but using static addressing and explicitly specifying DNS and gateway addresses does have its merits.
 

Adrian

Contributor
Joined
Jun 29, 2011
Messages
166
@Vladimir Vinogradenko wrote that the snapshot deletion problem is a bug which will be fixed in 11.3-U2.

He provided a circumvention too!
More flowers! :cool:

Meanwhile as a workaround you can temporarily disable "Hold pending snapshots feature". It erroneously forces retention to delete snapshots that could not have been created by any existing periodic snapshot task (e.g. manual snapshots, or, in your case, snapshots taken at 01:00, 02:00, etc... as you've changed your periodic snapshot tasks to only run at midnight).

I have tested this and it is working well, other than manual snapshots not being replicated.

I am impressed by the speed of the turn-around of my bug report. It looks as if it has been fixed in 12, back ported to 11.3-U2, test suite upgraded and a circumvention provided, all in a few days.
 
Top