Monitoring replication & Shutdown remote server when replication is done

extera · Oct 24, 2013

Hi all,

I have been busy settings up a backup solution for my FreeNAS box.
using the documentation, I have setup ZFS Replication to a remote server (same LAN).

At this time, the remote server boots up at 00:50.
replication is set to start at 01:00 and run to 05:00
The remote server powers down at 05:00

I would very much like the remote FreeNAS server to power down when replication is finished.
I don't know how to trigger a remote power down, or even how to setup a scheduled one from the Main FreeNAS box. (PUSH).

It would be even greater if can also power on the server before replication starts.
I have also successfully tested WOL, however not from my FreeNAS Box.

When replication runs, it is very hard to tell (for me) the state of the process.
I use zpool list on the remote server to see if the remote volumes are growing, but this only works the first time, and it's not an automated way of monitoring the process.

I also check the utilization of the switch port, and / or NICs, to see if data is still transfer.

Do you guy's know if there is a better way to monitor the replication process?
It would be great to see the status, time left, success or failed, etc.

I am running 9.1.1 on both boxes.

Thanks!

cyberjock · Oct 24, 2013

There is no indicator for replication. It's done when its done. There is no easy way to monitor it. Just accept that it is how it is and its done when its done. CPU usage and/or network usage can definitely tell you if its working or not though.

Keep in mind that the replication is NOT set to a particular time. If you choose 0100 it does NOT start at 0100. It is only guaranteed to start between 0100 and 0159. At least, this was how it was last time I played with replication. If I'm wrong feel free to poke me in the eye.

The danger comes from a replication task taking too long. Say you have a snapshot that takes 6 hours to replicate. It will forever lock you up. Because your backup server will shutdown before the replication finishes the first night and every single night after it will restart and try again(and of course failing again and again). So 6 months from now when you realize you need your backups you'll realize they haven't worked the whole time. That sucks, but that's exactly what has happened to many people here that try to do timed shutdowns.

Automated powerdowns are easy. Do a cronjob to run shutdown -p now. Again, remember my warning in the previous paragraph.

Your best bet is to do an ultra low powered system for backups. Pay attention to power. Use a low power CPU and sleep the disks. Then, they wake up when you need them, and only sip power the rest of the day.

Dusan · Oct 25, 2013

cyberjock said:
There is no indicator for replication. It's done when its done. There is no easy way to monitor it. Just accept that it is how it is and its done when its done. CPU usage and/or network usage can definitely tell you if its working or not though.

Yes, there is. Look for the /var/run/autorepl.pid file. If it exists, the replication is running and the file contains the PID of the replication process. The file disappears when the replication finishes.

cyberjock said:
Keep in mind that the replication is NOT set to a particular time. If you choose 0100 it does NOT start at 0100. It is only guaranteed to start between 0100 and 0159. At least, this was how it was last time I played with replication. If I'm wrong feel free to poke me in the eye.

Close your eye, incoming poke! :D This is how the FreeNAS autoreplication works:

cron runs the /usr/local/www/freenasUI/tools/autosnap.py script every minute.
The autosnap.py script checks if there is any snapshot scheduled for that minute and executes it.
autosnap.py then runs /usr/local/www/freenasUI/tools/autorepl.py (it always runs autorepl.py, even if no snapshot was taken),
autorepl.py checks if there are any pending replications (unreplicated snapshots that needs to be replicated) and starts the replication if we are inside the allowed time window.
However, if a replication is already running it will not start another one. Only one replication can run at any point in time. Any new replication has to wait for the previous one to finish (it will start the next minute after the previous one finishes, but only if we are still inside the defined time window) -- it uses the /var/run/autorepl.pid file to determine if a replication is already running.

Back to your example. Let's assume that the replication is allowed to start between 0100 and 0159 and that you have a snapshot scheduled every 30 minutes. What happens? At 0100 the system will start replication of all snapshots that were created since 0159 last day. If the replication finishes before 0130, the 0130 snapshot will be replicated immediately after it is created. If not, it will be replicated as soon as the previous replication finishes, if that happens before 0159. If not, the 0130 snapshot will have to wait and will be included in the next day 0100 replication batch.

Dusan · Oct 25, 2013

extera said:
At this time, the remote server boots up at 00:50.
replication is set to start at 01:00 and run to 05:00
The remote server powers down at 05:00

This is not how the replication time window setting works. The Begin and End time fields only specify when is the replication allowed to start. It will then run until it finishes not caring about the End time (i.e. the replication will not be forcibly stopped if it still runs beyond the End time). So, you should not blindly shut down the remote server at the specified time.

extera said:
I would very much like the remote FreeNAS server to power down when replication is finished.
I don't know how to trigger a remote power down, or even how to setup a scheduled one from the Main FreeNAS box. (PUSH).

You can shutdown the remote system by running a command via SSH. Example:
ssh -i private_key_file root@remote_system shutdown -p now
(If you set up your SSH keys properly, this will run the shutdown command on the remote system using the root account).

extera said:
It would be even greater if can also power on the server before replication starts.
I have also successfully tested WOL, however not from my FreeNAS Box.

You can download this package: http://pkg.cdn.pcbsd.org/freenas/9.1-RELEASE/amd64/net/wol-0.7.1_2.txz, extract the binary and use it. Documentation: http://www.freebsd.org/cgi/man.cgi?manpath=freebsd-release-ports&query=wol

extera said:
When replication runs, it is very hard to tell (for me) the state of the process.
I use zpool list on the remote server to see if the remote volumes are growing, but this only works the first time, and it's not an automated way of monitoring the process.

I also check the utilization of the switch port, and / or NICs, to see if data is still transfer.

Do you guy's know if there is a better way to monitor the replication process?
It would be great to see the status, time left, success or failed, etc.

If you use the FreeNAS autoreplication you can track the /var/run/autorepl.pid file as mentioned above. You won't get any detailed progress report though.
The other option is to write your own replication script (check autorepl.py for inspiration) and use the new zfs send -v switch that reports live send progress (www.freebsd.org/cgi/man.cgi?manpath=freebsd-release-ports&query=zfs).

cyberjock · Oct 25, 2013

Dusan said:
Yes, there is. Look for the /var/run/autorepl.pid file. If it exists, the replication is running and the file contains the PID of the replication process. The file disappears when the replication finishes.

But it doesn't tell him if the replication is 5% done or 95% done. That's what the OP is wanting to do(I think). There is no easy way to monitor that except with the first replication task. When you are initially sending your snapshot to a new pool you can use zpool list to see how full the system is. So you can have a "rough" idea. But after that its anyone's guess.

Dusan said:
Close your eye, incoming poke! :D This is how the FreeNAS autoreplication works:

cron runs the /usr/local/www/freenasUI/tools/autosnap.py script every minute.

The autosnap.py script checks if there is any snapshot scheduled for that minute and executes it.

autosnap.py then runs /usr/local/www/freenasUI/tools/autorepl.py (it always runs autorepl.py, even if no snapshot was taken),

autorepl.py checks if there are any pending replications (unreplicated snapshots that needs to be replicated) and starts the replication if we are inside the allowed time window.

However, if a replication is already running it will not start another one. Only one replication can run at any point in time. Any new replication has to wait for the previous one to finish (it will start the next minute after the previous one finishes, but only if we are still inside the defined time window) -- it uses the /var/run/autorepl.pid file to determine if a replication is already running.

Back to your example. Let's assume that the replication is allowed to start between 0100 and 0159 and that you have a snapshot scheduled every 30 minutes. What happens? At 0100 the system will start replication of all snapshots that were created since 0159 last day. If the replication finishes before 0130, the 0130 snapshot will be replicated immediately after it is created. If not, it will be replicated as soon as the previous replication finishes, if that happens before 0159. If not, the 0130 snapshot will have to wait and will be included in the next day 0100 replication batch.

Thanks for the explanation. It sounds like things may have changed slightly.

Are you sure that every unreplicated snapshot will run at that one time? We had a thread from 6 or so months ago that claimed that only 1 snapshot would replicate at a time. So if his backup server was shutdown for a few days and he had daily snapshot/replication when the backup server was eventually back online it wouldn't catch up all at once. He was forced to manually push the applicable snapshots to "catch" his backup server back up to real time. This was causing major problems for him because his server was not local and to catch them up he was forced to move his server local to redo all of the snapshots to get the FreeNAS GUI to properly detect and display the snapshots(or something like that). It had happened to him twice in like 4 months because of internet problems between the servers.

Dusan · Oct 25, 2013

cyberjock said:
Are you sure that every unreplicated snapshot will run at that one time? We had a thread from 6 or so months ago that claimed that only 1 snapshot would replicate at a time. So if his backup server was shutdown for a few days and he had daily snapshot/replication when the backup server was eventually back online it wouldn't catch up all at once. He was forced to manually push the applicable snapshots to "catch" his backup server back up to real time. This was causing major problems for him because his server was not local and to catch them up he was forced to move his server local to redo all of the snapshots to get the FreeNAS GUI to properly detect and display the snapshots(or something like that). It had happened to him twice in like 4 months because of internet problems between the servers.

It works now (9.1.1). This is my backup setup: I have one main RAIDZ2 pool plus one single drive (in a removable bay) backup pool. I snapshot and replicate the important data daily from the main pool to backup. I have two backup drives. Every month I remove the current backup drive from the system, take it to an offsite location and bring back the other one. So, the newly inserted drive is now missing one month worth of snapshots, but FreeNAS has no problem detecting this -- the first replication is a big one and the drive receives all missing snapshots. This is the code that does it:
Here it figures out the latest remote snapshot: https://github.com/freenas/freenas/blob/master/gui/tools/autorepl.py#L228
Here it realizes that the latest remote snapshot is not what it expects (autorepl.py keeps locally track of the last replicated snapshot, but it's now missing on the destination drive, because I swapped them) and starts to look for local snapshot that matches the latest remote one: https://github.com/freenas/freenas/blob/master/gui/tools/autorepl.py#L242
It finds it (https://github.com/freenas/freenas/blob/master/gui/tools/autorepl.py#L248) and now knows the range of snapshots it needs to send over.

Dusan · Oct 25, 2013

To demonstrate this I simulated the setup in a VM. I created 3 pools (main, backup, backup1) and set up the first one (main) to be snapshoted and replicated every 5 minutes. I let it run for a while replicating to backup. Then I switched the replication target to backup1. Let it run for 15 minutes and switched the replication target back to backup. backup was now missing three snapshots (1621, 1626, 1631), the latest snapshot on backup was 1616.
I enabled all logging and this is the (a bit cleaned up + my comments in bold) output after the system took the next snapshot (1636) and proceeded to do the replication:
[PANEL]16:36:02 [tools.autorepl:110] Autosnap replication started
16:36:02 [tools.autorepl:111] temp log file: /tmp/repl-10656
16:36:02 [tools.autorepl:165] Checking dataset main
16:36:02 [common.pipesubr:57] Popen()ing: /sbin/zfs list -Ht snapshot -o name,freenas:state -r -d 1 main
16:36:02 [tools.autorepl:192] Snapshot: main@auto-20131025.1636-2w State: NEW (it thinks it only needs to replicate 1636)
16:36:02 [tools.autorepl:199] Snapshot main@auto-20131025.1636-2w added to wanted list
16:36:02 [tools.autorepl:192] Snapshot: main@auto-20131025.1631-2w State: LATEST (1631 is expected to be the latest one on remote)
16:36:02 [tools.autorepl:196] Snapshot main@auto-20131025.1631-2w is the recorded latest snapshot
16:36:02 [common.pipesubr:57] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p lhost "zfs list -Hr -o name -t snapshot -d 1 backup | tail -n 1 | cut -d@ -f2" (finding out the latest snapshot on remote)
16:36:03 [tools.autorepl:235] Can not locate expected snapshot main@auto-20131025.1616-2w, looking more carefully (huh, the latest remote snapshots seems to be 1616)
16:36:03 [common.pipesubr:57] Popen()ing: /sbin/zfs list -Ht snapshot -o name,freenas:state main@auto-20131025.1616-2w (checking if we still have 1616 locally)
16:36:03 [tools.autorepl:241] Marking main@auto-20131025.1616-2w as latest snapshot (we do, great, let's mark 1616 as the latest)
16:36:03 [common.pipesubr:71] Executing: /sbin/zfs inherit freenas:state main@auto-20131025.1631-2w
16:36:03 [common.pipesubr:71] Executing: /sbin/zfs set freenas:state=LATEST main@auto-20131025.1616-2w
16:36:03 [common.pipesubr:71] Executing: (/sbin/zfs send -R -I main@auto-20131025.1616-2w main@auto-20131025.1636-2w | /bin/dd obs=1m dd obs=1m | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 localhost "/sbin/zfs receive -F -d backup && echo ed.") > /tmp/repl-10656 2>&1 (replicate all snapshots from snapshot 1616 to snapshot 1636)
16:36:03 [tools.autorepl:286] Replication result: 7+4 records in 0+1 records out 4824 bytes transferred in 0.035766 secs (134876 ec) 9+1 records in 0+1 records out 4824 bytes transferred in 0.033923 secs (142205 bytes/sec) Succeeded.
16:36:03 [common.pipesubr:57] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p lhost "zfs list -Hr -o name -t snapshot -d 1 backup | tail -n 1 | cut -d@ -f2" (verify that the latest remote snapshot now matches our expectation -- 1636)
16:36:03 [common.pipesubr:71] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 st "/sbin/zfs inherit -r freenas:state backup"
16:36:03 [common.pipesubr:71] Executing: /sbin/zfs inherit freenas:state main@auto-20131025.1616-2w
16:36:03 [common.pipesubr:71] Executing: /sbin/zfs set freenas:state=LATEST main@auto-20131025.1636-2w (mark local snapshot 1636 as the latest replicated one)
16:36:03 [tools.autorepl:336] Autosnap replication finished[/PANEL]

panz · Nov 9, 2013

Dusan said:
It works now (9.1.1). This is my backup setup: I have one main RAIDZ2 pool plus one single drive (in a removable bay) backup pool. I snapshot and replicate the important data daily from the main pool to backup. I have two backup drives. Every month I remove the current backup drive from the system, take it to an offsite location and bring back the other one. So, the newly inserted drive is now missing one month worth of snapshots, but FreeNAS has no problem detecting this -- the first replication is a big one and the drive receives all missing snapshots. This is the code that does it:
Here it figures out the latest remote snapshot: https://github.com/freenas/freenas/blob/master/gui/tools/autorepl.py#L228
Here it realizes that the latest remote snapshot is not what it expects (autorepl.py keeps locally track of the last replicated snapshot, but it's now missing on the destination drive, because I swapped them) and starts to look for local snapshot that matches the latest remote one: https://github.com/freenas/freenas/blob/master/gui/tools/autorepl.py#L242
It finds it (https://github.com/freenas/freenas/blob/master/gui/tools/autorepl.py#L248) and now knows the range of snapshots it needs to send over.

So, both target drives (the ones you put in the removable bay) have the same destination pool name?

Dusan · Nov 9, 2013

panz said:
So, both target drives (the ones you put in the removable bay) have the same destination pool name?

Yes, the pool is named the same on both drives (backup). The only thing that gets lost when I swap the drives is the "drive config" -- standby time, SMART extra options, ... I have a script that reinserts the correct values back into the config DB.

panz · Nov 9, 2013

OMG, too much complicated for me ;)

cyberjock · Nov 9, 2013

Dusan said:
Yes, the pool is named the same on both drives (backup). The only thing that gets lost when I swap the drives is the "drive config" -- standby time, SMART extra options, ... I have a script that reinserts the correct values back into the config DB.

Care you share the script? :D

Dusan · Nov 9, 2013

cyberjock said:
Care you share the script? :D

Sure, here you go. My removable tray is ada1, the script sets the HDD standby for that drive to 60 minutes and adds -a to its SMART options. It then applies the standby, regenerates smartd.conf and restarts smartd.

Code:

#!/bin/sh
sqlite3 /data/freenas-v1.db "update storage_disk set disk_hddstandby=60, disk_smartoptions='-a' where disk_name='ada1';"
service ix-ataidle quietstart ada1
service ix-smartd quietstart
service smartd forcestop
service smartd restart

Important Announcement for the TrueNAS Community.

Monitoring replication & Shutdown remote server when replication is done

extera

Cadet

cyberjock

Inactive Account

Dusan

Guru

Dusan

Guru

cyberjock

Inactive Account

Dusan

Guru

Dusan

Guru

panz

Guru

Dusan

Guru

panz

Guru

cyberjock

Inactive Account

Dusan

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Monitoring replication & Shutdown remote server when replication is done

Cadet

Inactive Account

Guru

Guru

Inactive Account

Guru

Guru

Guru

Guru

Guru

Inactive Account

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Monitoring replication & Shutdown remote server when replication is done"

Similar threads