SOLVED Struggling with snapshots and replication

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
I have two TrueNAS systems. The primary is running SCALE 22.12.1. The secondary is intended as a pure backup server, it is running CORE 13.0-U4. (I intend to change the backup system to SCALE in the future, for consistency.)

The primary system has three zpools, each with one or more datasets(*). For each pool, I have created daily recursive snapshot tasks. As far as I can tell, this is working. I can see that the snapshots for the pools and child datasets are being created every day.

The backup system has a single huge pool (much bigger than the combined size of the three pools on the primary system). What I want is to have a three datasets under this pool, each one corresponding to the pools on the primary system. And after the snapshots are taken on the primary system, they should immediately by replicated to the backup system. Furthermore, on the initial replication (i.e. the first time it runs), all previously-created snapshots should be sync'ed as well. The intent here is to have a reliable backup. In the case the primary system just magically disappeared, the data (since the last snapshot+sync) would be readily available on the backup system.

It looks like I'll need three replication tasks, one for each zpool+snapshot task. But I'm not sure exactly how to configure the replication tasks. (I'm working with just one pool for now, for simplicity.) Where I'm at now: I created a replication task and manually ran it. It ran correctly as far as I can tell. For the replication schedule, I selected "run automatically". But it never ran after the first run, even though the corresponding snapshots were being created every day. So I tried to run it manually, and it failed, saying "Replication <job name> failed: Active side: most recent snapshot of <destination dataset name> does not match incremental source."

(*) I don't think it should matter, but in the interest of full disclosure: the zpools on the primary system were created not by TrueNAS, but by Proxmox. This system was previously running Proxmox only. I moved Proxmox to a separate (physical) system. This system got a new boot disk, on which I installed TrueNAS SCALE, and imported the previously-created pools. (The backup TrueNAS Core system has always been just TrueNAS, and its pool was created by TrueNAS.)

I appreciate any feedback!
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
the first thing that comes to mind is your snapshots are expiring before your replications complete, which means that the backup never has the correct snapshots for incremental. your first snapshot, the "incremental source", has to exist at least long enough for the replication to finish and replicate more snapshots.

otherwise, the relatively little detail given sounds correct
you would want to basically be doing
tank1>btank/tank1
tank2>btank/tank2
etc.


screenshots can be helpful because there are many replication settings
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
Thank you for your reply, I appreciate it!

the first thing that comes to mind is your snapshots are expiring before your replications complete, which means that the backup never has the correct snapshots for incremental. your first snapshot, the "incremental source", has to exist at least long enough for the replication to finish and replicate more snapshots.

I don't believe that is the case, as I have the snapshot lifetime time set to four weeks.

otherwise, the relatively little detail given sounds correct
you would want to basically be doing
tank1>btank/tank1
tank2>btank/tank2
etc.

screenshots can be helpful because there are many replication settings

Here's a bit more detail: one of the three pools on the source system is named "ssdpool", and this is the one I'm trying to replicate (for now, once I get this working as intended, I'll do it for the other two). I just created a new dataset on the backup server, "ssdrepl". I used the wizard to create a new replication task, ssdpool (Recursive) to ssdrepl. I ran it immediately after creating it, and it completed successfully.

So the next step is to see if it runs automatically every night, in accordance with the nightly snapshot schedule. I created a few new test files in some of the datasets that live under "ssdpool". I expect to see those on the backup system after replication runs (i.e. a simplistic manual validation that replication is working, in addition to what the TrueNAS GUI is telling me).

While I wait to see if what happens, one question I have: when creating these replication tasks, what are the considerations for doing "Recursive" versus "Full Filesystem Replication"? They seem very similar to me, but since there are two options, I assume there is some difference (I circled them in yellow in the screenshot).

Here's a screenshot of some of my current replication settings. I know it's not everything - if this fails or doesn't work as expected I post all the details, but I wanted to show something to keep the discussion going (and ask about Recursive vs Full Filesystem Replication).

Thanks again!
 

Attachments

  • ssdpool_replication_edit_01.png
    ssdpool_replication_edit_01.png
    57.3 KB · Views: 319

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
"Recursive" versus "Full Filesystem Replication"?
recursive you can choose exceptions, full replication will attempt to make it complety idendical. i've found that full just causes more problems, for example, it tries to replicate snapshots based on expectations instead of just what is present, so if you have any thing excluded from snapshots , it will fail trying to replicate an non existent snapshot.
for example, if you are replicating the pool that contains the system dataset, you will want to exclude that, as it causes headaches with replication. same with jail/docker and VM datasets, if you wish to replication those, its likely better to set them up separately as they are more than mere data.
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
So, indeed, the replication did not run last night, even though snapshots were generated. Here is the snapshot config:

ssdpool_snapshot_config_20230422.png


And indeed I can see in the GUI that recursive snapshots are being taken daily for this pool. I'll post the "zfs list" output from the terminal, instead of a bunch of screenshots from the GUI. Clearly I have snapshots for 2023-04-22.

Code:
root@fileserver[~]# zfs list -t snapshot | grep ssdpool | grep 2023-04-22
ssdpool@auto-2023-04-22_00-00                                                        0B      -      132K  -
ssdpool/.system@auto-2023-04-22_00-00                                               64K      -     1.34G  -
ssdpool/.system/configs-8f179c8648fc4419af075a5cf26c19f8@auto-2023-04-22_00-00      64K      -     16.3M  -
ssdpool/.system/cores@auto-2023-04-22_00-00                                          0B      -       96K  -
ssdpool/.system/ctdb_shared_vol@auto-2023-04-22_00-00                                0B      -       96K  -
ssdpool/.system/glusterd@auto-2023-04-22_00-00                                       0B      -      108K  -
ssdpool/.system/rrd-8f179c8648fc4419af075a5cf26c19f8@auto-2023-04-22_00-00         152M      -      158M  -
ssdpool/.system/samba4@auto-2023-04-22_00-00                                        72K      -     2.34M  -
ssdpool/.system/services@auto-2023-04-22_00-00                                       0B      -       96K  -
ssdpool/.system/syslog-8f179c8648fc4419af075a5cf26c19f8@auto-2023-04-22_00-00     18.3M      -      303M  -
ssdpool/.system/webui@auto-2023-04-22_00-00                                          0B      -       96K  -
ssdpool/ix-applications@auto-2023-04-22_00-00                                        0B      -      136K  -
ssdpool/ix-applications/catalogs@auto-2023-04-22_00-00                             512K      -     18.5M  -
ssdpool/ix-applications/default_volumes@auto-2023-04-22_00-00                        0B      -       96K  -
ssdpool/ix-applications/docker@auto-2023-04-22_00-00                                 0B      -     63.4M  -
ssdpool/ix-applications/k3s@auto-2023-04-22_00-00                                 11.2M      -      674M  -
ssdpool/ix-applications/k3s/kubelet@auto-2023-04-22_00-00                           96K      -      160K  -
ssdpool/ix-applications/releases@auto-2023-04-22_00-00                               0B      -       96K  -
ssdpool/septictank-pve-images@auto-2023-04-22_00-00                               2.09G      -     71.8G  -
ssdpool/ssdhome@auto-2023-04-22_00-00                                              828K      -      180G  -
ssdpool/subvol-100-disk-0@auto-2023-04-22_00-00                                      0B      -     3.69G  -
ssdpool/subvol-101-disk-0@auto-2023-04-22_00-00                                      0B      -     4.11G  -
ssdpool/subvol-102-disk-0@auto-2023-04-22_00-00                                      0B      -     87.8G  -
ssdpool/subvol-103-disk-0@auto-2023-04-22_00-00                                      0B      -     1.19G  -
ssdpool/subvol-105-disk-0@auto-2023-04-22_00-00                                      0B      -      963M  -
ssdpool/subvol-106-disk-0@auto-2023-04-22_00-00                                      0B      -     3.77G  -
ssdpool/subvol-107-disk-0@auto-2023-04-22_00-00                                      0B      -     1.55G  -
ssdpool/subvol-108-disk-0@auto-2023-04-22_00-00                                      0B      -     6.59G  -
ssdpool/subvol-109-disk-0@auto-2023-04-22_00-00                                      0B      -     2.21G  -
ssdpool/vm-104-disk-0@auto-2023-04-22_00-00                                          0B      -      417M  -
ssdpool/vm-104-disk-1@auto-2023-04-22_00-00                                          0B      -       56K  -


Here is the overview of the replication jobs I have. It shows the Last Snapshot as 2023-04-20. I expect that it should show a snapshot from 2023-04-22. Furthermore, the test files I created have not been replicated to the backup server. (Even though I'm trying to get this to work for "ssdpool", you can see where I tried to get this working for "data1pool" as well - that pool also has snapshots configured the same as ssdpool, and I confirmed they are running daily. But still the replication hasn't run since April 11.)

replication_tasks_overview_20230422.png


Here is the complete config for this replication task:

ssdpool_repl_config_1of2.png


ssdpool_repl_config_2of2.png


Any thoughts on why this does not run automatically as I expect?

Thanks again!
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
first thought. set yourself up some hourly snapshots and replication, so you aren't doing this by relying on a nightly run for testing.

2nd. i have not used the function to tie directly into snapshots for scheduling, i schedule the replication and the schema independanly of the snapshots. i'd found that tieing directly into the snapshots preventing me from doing things like excluding datasets from snapshots or replications. aditionally, this will replicate custom snapshots that follow the schema.

3rd. sorry, i have no further ideas. it looks like that should just work. the only reasons I can think of that it doesnt should fail the replication, not have it say it was successful.

how are you checking the destination for the snapshots?
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
first thought. set yourself up some hourly snapshots and replication, so you aren't doing this by relying on a nightly run for testing.

I did exactly that. Even now, it's still not running. The overview of replication jobs looks exactly the same as it does in the previous screenshot: the last replicated snapshot for ssdpool is dated 2023-04-20. That should be for 2023-04-23.

Here is the bottom half of the replication config now (the other half is exactly the same). All I did was change the schedule to be hourly as suggested. I did that earlier this morning, i.e. many hours ago.

ssdpool_repl_config_20230423.png



2nd. i have not used the function to tie directly into snapshots for scheduling, i schedule the replication and the schema independanly of the snapshots. i'd found that tieing directly into the snapshots preventing me from doing things like excluding datasets from snapshots or replications. aditionally, this will replicate custom snapshots that follow the schema.

3rd. sorry, i have no further ideas. it looks like that should just work. the only reasons I can think of that it doesnt should fail the replication, not have it say it was successful.

how are you checking the destination for the snapshots?

I am logging in via ssh to the backup server, and manually inspecting. I've been creating test files on the source system, like
Code:
# touch testfile_`+%Y%m%d-%H%M%S`.txt


I do that in a few different datasets in ssdpool. My expectation is that, after the replication runs, I should see these files in the dataset I made for backup/replication on the backup server. But I'm not seeing them.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
wait. are you checking for the snapshots or for the files having been mounted? there is sometimes some less intuitive stuff that goes on with the mounting at the destination and if any of those occur you will not see any files because the destination wasnt mounted. sometimes it has to be unmounted for the replication.

look for the actual snapshots, first. if there are no snapshots, there is a problem with the replication, but if the snapshots are present you need to see if it's mounted. replication problems *should* always have an error.

both of these could be very useful.

zfs list -t snapshot
zfs mount -a
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
I was indeed naively just checking the destination dataset on the backup server, expecting it to essentially be a living clone of the source (my use case is very similar to the one described in this thread).

Now I see that the system doesn't work the way I expect. So, I am seeing snapshots on the backup system that look like they correspond to the same snapshots on the source system. But these replicated snapshots don't appear to have the latest data (or again I have a misunderstanding of how things work).

Here's what I'm doing from the CLI on the source system. I'm focusing on a single dataset within the "ssdpool" pool, "ssdhome".
Code:
# snapshots being created (as expected) on the primary/source system:
root@fileserver[~]# zfs list -t snapshot | grep ssdhome | grep '2023-04-2'
ssdpool/ssdhome@auto-2023-04-20_00-00                                             1.41M      -      180G  -
ssdpool/ssdhome@auto-2023-04-21_00-00                                             1.23M      -      180G  -
ssdpool/ssdhome@auto-2023-04-22_00-00                                             1.06M      -      180G  -
ssdpool/ssdhome@auto-2023-04-23_00-00                                             2.87M      -      185G  -
ssdpool/ssdhome@auto-2023-04-24_00-00                                             1.11M      -      186G  -

# first show the "live" dataset on the primary system, and the dated test file I created yesterday:
root@fileserver[~]# cd /mnt/ssdpool/ssdhome/matt
root@fileserver[/mnt/ssdpool/ssdhome/matt]# pwd
/mnt/ssdpool/ssdhome/matt
root@fileserver[/mnt/ssdpool/ssdhome/matt]# ls -lah testfile_20230423-184120.txt
-rw-rw-rw- 1 matt matt 55 Apr 23 18:41 testfile_20230423-184120.txt

# "ssdhome" is a dataset, and has a hidden .zfs directory with its snapshots:
root@fileserver[/mnt/ssdpool/ssdhome/matt]# cd ../.zfs/snapshot/
root@fileserver[/mnt/ssdpool/ssdhome/.zfs/snapshot]# ls -d *2023-04-24*
auto-2023-04-24_00-00
root@fileserver[/mnt/ssdpool/ssdhome/.zfs/snapshot]# cd auto-2023-04-24_00-00
root@fileserver[...e/.zfs/snapshot/auto-2023-04-24_00-00]# ls
matt
root@fileserver[...e/.zfs/snapshot/auto-2023-04-24_00-00]# cd matt
root@fileserver[...s/snapshot/auto-2023-04-24_00-00/matt]# ls -lah testfile_20230423-184120.txt
-rw-rw-rw- 1 matt matt 55 Apr 23 18:41 testfile_20230423-184120.txt


Now, I want to see that same testfile on the backup system:
Code:
# starting from a fresh boot
root@dumpster[~]# uptime
11:04AM  up 19 mins, 1 user, load averages: 0.23, 0.25, 0.20

# there are ssdpool/ssdhome snapshots:
root@dumpster[~]# zfs list -t snapshot | grep ssdrepl | grep ssdhome | grep '2023-04-2'
backup-pool/ssdrepl/ssdhome@auto-2023-04-20_00-00                                      0B      -      165G  -
backup-pool/ssdrepl/ssdhome@auto-2023-04-21_00-00                                      0B      -      165G  -
backup-pool/ssdrepl/ssdhome@auto-2023-04-22_00-00                                      0B      -      165G  -
backup-pool/ssdrepl/ssdhome@auto-2023-04-23_00-00                                      0B      -      165G  -
backup-pool/ssdrepl/ssdhome@auto-2023-04-24_00-00                                      0B      -      165G  -

# check out the mount situation
# according to the zfs-mount manpage, it looks like "zfs mount -a" should generally be run at boot
root@dumpster[~]# mount | grep ssdhome
backup-pool/ssdrepl/ssdhome on /mnt/backup-pool/ssdrepl/ssdhome (zfs, local, read-only, nfsv4acls)

# try to verify that the snapshot does indeed contain my test file:
root@dumpster[~]# cd /mnt/backup-pool/ssdrepl/ssdhome
root@dumpster[/mnt/backup-pool/ssdrepl/ssdhome]# cd .zfs/snapshot
root@dumpster[...up-pool/ssdrepl/ssdhome/.zfs/snapshot]# ls -d *2023-04-24*
auto-2023-04-24_00-00
root@dumpster[...up-pool/ssdrepl/ssdhome/.zfs/snapshot]# cd auto-2023-04-24_00-00/matt

root@dumpster[...s/snapshot/auto-2023-04-24_00-00/matt]# pwd
/mnt/backup-pool/ssdrepl/ssdhome/.zfs/snapshot/auto-2023-04-24_00-00/matt

# moment of truth...
root@dumpster[...s/snapshot/auto-2023-04-24_00-00/matt]# ls -lah testfile_20230423*
zsh: no matches found: testfile_20230423*


Any further thoughts? Am I still missing some conceptual knowledge about snapshots + replication, and how they appear on the backup system?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Is ...ssdhome/matt a separate dataset? If yes, there's your problem. You need to snapshot and replicate that. Neither snapshots nore replication are recursive by default. You can set the recursive flag in the UI to snapshot and replicate ssdhome and all child datasets.
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
No, it's not a separate dataset. Also, both the snapshots and the replication are set to recursive (see above screenshots).
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
expecting it to essentially be a living clone of the source
this is basically accurate, however, it doesn't need to be mounted. while zfs is mounted automatically, replications can automatically unmount, and they might not get mounted afterwards until reboot or a manual mount. it functionally becomes a display issue.

you have enough numbers here to make my head spin. ive never delved that deeply. I would just mount the expected directory and see if the file is there. i ran into the same confusion with mounts, that's part of why it occured to me that that might be part of your confusion.

I *think* the snapshot would be incremental. the first snapshot that see's that file would have it, but everything after would not?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I *think* the snapshot would be incremental. the first snapshot that see's that file would have it, but everything after would not?
Every snapshot contains all the data that was in the dataset at the time the snapshot was taken. Snapshots are incremental at the block level and ZFS - as far as snapshots and replication are concerned - does not really care about files at all. The POSIX filesystem view is another layer of abstraction implemented on top of the fundamental block level mechanics. But to repeat - all snapshots are complete with respect to the moment of their creation.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
Every snapshot contains all the data that was in the dataset at the time the snapshot was taken.
then I have no idea why those snapshots appear to be missing files. it looks like the commands listed should have the expected results.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Me neither. Probably a matter of minutes with direct access to the system but over the forum these things are hard sometimes.
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
I agree, it's probably something simple/silly. I'm happy to post any more screenshots and/or terminal output. Or even some conceptual hints on what I should be looking for... I'm kind of stumped, because I don't feel like this is a wacky use-case.
 
Joined
Oct 22, 2019
Messages
3,641
To rule something out, do the source and destination snapshots share the same GUID?
Code:
zfs get -H guid ssdpool/ssdhome@auto-2023-04-24_00-00

zfs get -H guid backup-pool/ssdrepl/ssdhome@auto-2023-04-24_00-00
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
So I did one of the worst things you can do in debugging: changed multiple parameters at once and now it appears to be magically working. What I did went something like this:
  1. Removed the "ix-applications" dataset under ssdpool on both the source and backup systems.
  2. Doing that caused the replication to fail outright, with the error "Warning: Permanently added the ECDSA host key for IP address '<backup server ip>' to the list of known hosts. cannot receive incremental stream: most recent snapshot of backup-pool/ssdrepl does notmatch incremental source."
  3. At this point I thought I'd just start from the beginning, and deleted the "ssdrepl" dataset on the backup server.
  4. As I have been struggling with this a bit before I posted, there was one other dataset on the backup server named "ssdpool". I deleted that as well. (That was either a previous attempt at the snapshot replication I'm trying to do now, or maybe even an old rsync-based backup from when the source system was running Proxmox.)

I did that last night. On the source system, both the snapshots and replication are configured to run hourly. What I'm seeing on the backup system is exactly what I initially expected: the backup-pool/ssdpool dataset is current with the source system as of the last snapshot - and this is without having to do anything with mounts or snooping in the special ".zfs" directory. In other words, I'm creating dated test files on the source system, and within an hour, seeing them on the backup system. So at this moment, it appears I have a near real-time hot copy of the primary system (which is essentially what I want).

Another thing that is promising: now, on the source system, when I look at the replication tasks, under the "Last Snapshot" column, it shows a snapshot with a date and timestamp that I expect (i.e. today, at the start of the current hour). Previously, that "Last Snapshot" column always showed a snapshot older than expected. See my post #5 above, which has screenshots made on Apr 22, but the last replicated snapshot date was Apr 20. Previously, I was hoping that was just a display issue, but given that I was only seeing "old" data on the backup system, I now think the backup was in fact not running. (Presumably, it would have reported an error if it tried to run and failed; here I think it erroneously believed it did not need to run.)

Realizing I took a bit of a "shotgun approach" here, I know it's hard to pinpoint exactly what was wrong. I am guessing that having an old dataset on the backup system named "ssdpool" was somehow confusing one of the systems.

Another potential explanation, and question: on the backup system, I have a daily recursive snapshot task for backup-pool enabled. Could this be another source of confusion for the replication process? I am sending snapshots from source to backup; but if the backup also creates snapshots, is that a potential issue? And furthermore, does it even make sense to do snapshots on the backup system's pool? The backup system is used exclusively for backup, there is no other data on it.
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
changed multiple parameters at once and now it appears to be magically working.
sometimes what needed to happen was to change all those parameters. it's good that its working. maybe you had something out of sync, athough its weird there would have been no errors.
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
I appreciate all the help everyone, I just marked this solved. It's been running along smoothly for a while now. Thanks again!
 
Top