Detecting running replication in TrueNAS 12, autorepl.pid no longer a thing?

FF_CCSa1F

Cadet
Joined
Jan 3, 2021
Messages
7
Hello,

I've used a 2-line shell script for detecting the presence of /var/run/autorepl.pid to automatically turn off a server once replication is finished. However since updating from FreeNAS 11 to TrueNAS 12, I no longer see autorepl.pid when a replication is running, so the system fails. I can see a "replication.run" task in the web GUI task manager, but I can't figure out any way to detect this from a shell script.

Is there any replacement for autorepl.pid in TrueNAS 12? I've spent hours searching for solutions, but it's a very difficult query to put into words that a search engine will understand.

Thank you for any help and suggestions!
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
The newer replication agent doesn't create any pid files AFAIK. You'd probably be better off writing a query to the API to poll and see if a job is still running before shutdown.

 

FF_CCSa1F

Cadet
Joined
Jan 3, 2021
Messages
7
The newer replication agent doesn't create any pid files AFAIK. You'd probably be better off writing a query to the API to poll and see if a job is still running before shutdown.

Thank you for replying, I'll have to get some help with that.

Looking at that documentation, it's not very obvious what to look for, to someone unfamiliar with the inner workings of TrueNAS. None of the things listed under "API methods - replication" seem to be able to indicate whether or not a replication is actually presently running, but in my ignorance I might be looking at the completely wrong thing.

If someone with experience using this API has any pointers as to what queries to look at, I'd be very grateful to hear them. Alternate ways of detecting the replication status are of course still interesting as well!

Thank you!
 

_Alchemist_

Dabbler
Joined
Jan 10, 2019
Messages
23
Are there any updates to this? I would love to have my Script working as well ... :(
 

FF_CCSa1F

Cadet
Joined
Jan 3, 2021
Messages
7
After spending some time trying to figure out if there's a way to get replication status out of the API, I've resigned to just using a dumb timer to shut down my off-line target machine after a time that "should be long enough to let a normal replication finish". This is terrible and will assuredly cause loss of data at some point, but in lack of an update to the API or reintroduction of autorepl.pid, I've concluded that there is no way for a normal user to detect if a replication is running any more.
 

_Alchemist_

Dabbler
Joined
Jan 10, 2019
Messages
23
After spending some time trying to figure out if there's a way to get replication status out of the API, I've resigned to just using a dumb timer to shut down my off-line target machine after a time that "should be long enough to let a normal replication finish". This is terrible and will assuredly cause loss of data at some point, but in lack of an update to the API or reintroduction of autorepl.pid, I've concluded that there is no way for a normal user to detect if a replication is running any more.
I don't know if this helps but I might have found something useful:

If you run htop, all running processes are listed (as you probably already know).
When I run a replication Task on my Backup Server, first I see a process that starts with "zfs list ...".
Then multiple processes (they all have different PID's) with "zfs recv ..." in the name.

I think there should be a way to check if a process with "zfs send" or "zfs recv" is running and use this instead of /var/run/autorepl.pid.
Guess I have to dig a little deeper...
 

Attachments

  • zfs-1.png
    zfs-1.png
    45.7 KB · Views: 194
  • zfs-2.png
    zfs-2.png
    19.4 KB · Views: 167
  • zfs-3.png
    zfs-3.png
    25.8 KB · Views: 148
  • zfs-4.png
    zfs-4.png
    26.2 KB · Views: 179

_Alchemist_

Dabbler
Joined
Jan 10, 2019
Messages
23
I found a workaround:

For it to work, I had to set "Replication Succeeded" under "SystemAlert > Settings" from "INFO (default)" to "WARNING".
1621168860082.png

This way, if the Replication Tasks finishes, a Mail is sent to my E-Mail Account and a log entry is written (/var/log/maillog).

I wrote a little Script (my First Bash script btw) that looks up the current date and checks if there is a log entry in /var/log/maillog with the same date:

Code:
#!/bin/bash
while :
do
time=$(date +"%b %d")
log=$(cat /var/log/maillog | grep "$time")
if [ -z "$log" ];
then
    sleep 30
else
    shutdown -p now
break
fi
done


If it does not find a matching log entry, the script waits 30 seconds and tries again.
If it does find a matching log entry, it shuts down the Server.

My Replication Tasks starts at 09:00 am, the script is run by a cron job at 09:05.
1621169467314.png


However if another kind of Error triggers a Warning and a Mail is sent, the Server shuts down even if the Replication Task is still running.
I wish I'd knew a better method but this is all I could come up with on my own.
 

FF_CCSa1F

Cadet
Joined
Jan 3, 2021
Messages
7
I took the time to make a slightly more fleshed out version of Alchemist's script (tested on 12.0-U6), which includes a facility to leave the cold storage server on in case the main NAS to be backed up is inaccessible for some reason. The point of that being that the cold storage server will be able to automatically turn on and replace the main NAS in production in case of catastrophic hardware failure while no one is on site. I am NO programmer so use this at your own risk, I have no clue what I'm doing.

This is still a terrible, terrible way of doing this, but it does kind of work. TrueNAS should honestly implement a "Shutdown upon completion" option in the GUI for setting up replications.

CHANGELOG:
2021-11-03
Changed order of date command from "date +"%b %d" to "date +"%d %b". Maillog dates are written as "03 Nov", and the date command was returning "Nov 03", causing the script to fail shutting the machine down. I must have been writing the script on the 20th of some month, so it managed to trigger on "20 Oct 2021", causing the bug to go unnoticed. I don't know if this is affected by the locale on your machine, so if you use MDY instead of DMY, you might need to rearrange the %d %b.

Code:
#!/bin/sh

#Script to turn off a "cold storage" TrueNAS server after a replication is done.
#This script should run on the server to be turned off, which should also be the server initiating the replication (PULL).
#This script requires an e-mail alert to be enabled and working for the "Replication Succeeded" event in System->Alert Settings->Tasks
#This script will also shut the server down for ANY OTHER email sent, so all other email alerts should be DISABLED.
#Uncomment echo lines for verbose operation/testing.
#You need to specify your own main NAS IP if it is not 10.1.1.1.

while :
do

#Ping main NAS to see if it's up.  -quiet, -count 1, -timeout 1 second:
        ping -q -c 1 -t 1 10.1.1.1 > /dev/null

#       Check if the exit status of the ping command equals 0:
        if [ $? -eq 0 ]

#       If ping returns 0, it means the main NAS is up, and we can proceed to check if a replication has finished by using Alchemist's magic:
        then
#               This section checks if an email has been sent today.
                time=$(date +"%d %b")
                log=$(cat /var/log/maillog | grep "$time")
                if [ -z "$log" ];
                then
#                       echo "No email today, replication must still be running, sleeping"
                        sleep 60
                else
#                       echo "Email today, main NAS pingable, my work here is done,  shutting down"
                        shutdown -p now
                        break
                fi


#       If ping returns other than 0, the main NAS is not up, and we want the backup NAS to remain on.
#       The script will turn the backup NAS off as soon as the main NAS becomes pingable again.
        else
#               echo "Ping not ok, main NAS is not available. Backup NAS remains on."
                sleep 60
fi
done
 
Last edited:

FF_CCSa1F

Cadet
Joined
Jan 3, 2021
Messages
7
The email thing has worked terribly, so I just put together a script that instead checks if there's data actually being transferred into or out of the computer we want to shut down. This script is intended to be launched after replication has been started, since it'll shut the server down if it sees that there's no traffic on the selected interface. This is easily done as a cron job. I've done a lot of testing while writing it and it seems fairly solid, but I haven't put it in service as of posting this, so your mileage may vary, there might be bugs. I am not a programmer.

Code:
#!/bin/bash


#Script to turn off a "cold storage" TrueNAS server after a replication is done. This is a bad replacement for repl.pid.
#This script should run on the server to be turned off and requires root.
#This script will shut off a server if it detects <1 kb of transfers over 10 seconds on the selected interface.
#Since interfaces are rarely completely quiet, it tends to fail randomly now and then. You can coarsely tune the sensitivity
#by choosing ratelimit 0-4, where 0 is < 1 kb, 1 is < 1 Mb, 2 is < 1 Gb and 3 is anything that is not those. 0 and 1 are sane.

#The script also checks if your main NAS is running, and will not shut down the back-up NAS if the main NAS is down,
#thus providing access to the back-up NAS in case of hardware failure.

#Uncomment echo lines for verbose operation/testing.

ratelimit=0

while :
do


#    Ping main NAS to see if it's up.  -quiet, -count 1, -timeout 1 second.
#    MODIFY to match your main NAS IP or set 127.0.0.1 to disable:

    ping -q -c 1 -t 1 10.1.1.1 > /dev/null

#    The following line checks if the exit status of the ping command equals 0. 
#    If ping returns 0, we know that the main NAS is up, so we want to shut down the backup NAS once it's done replicating.

    if [[ $? -eq 0 ]]
    then


#The following function checks if there is traffic on bge0. $traffic will contain the bitrate
#of the interface as reported by iftop. Awk parses the iftop output and returns the
#1-second average on $6, 10-second average on $7 and 40-second average on $8.
#Iftop is fairly slow and potentially resource intensive, so I choose to run it for 10 seconds by using -s 10
#and check $7 for the 10-second average. I do not recommend using the 1-second average, as
#replication sometimes will not be transferring data, so this might cause premature shutdowns.


        traffic=$( iftop -i bge0 -t -s 10 -n -N 2>/dev/null | awk '/send and receive/ {print ($7)}' )

        #echo "Traffic is:"
        #echo $traffic


#The following if clause parses the data returned from iftop and awk and determines how much data is being moved.
#The rate variable contains the result. I have no idea what a value of 3 means. Gb?


        if [[ "$traffic" == *"Mb"* ]]
        then
            rate=2
        elif [[ "$traffic" == *"Kb"* ]]
        then
            rate=1
        elif [[ "$traffic" == *"b"* ]]
        then
            rate=0
        else
            rate=3
        fi
        
#The following if clause checks if the transfer rate is below the threshold specified up top, and shuts the server down
#if it's below the threshold. A 5-minute grace period is inserted before shutdown, just in case.
#If traffic is above the threshold, the server is not shut down, and the script waits for 10 minutes before starting over.


        if (( $rate <= $ratelimit ))
        then
            echo "Main NAS pingable, traffic below limit. Replication should be done, shutting down in 5 minutes."
            echo "Traffic is:"
            echo $traffic
            sleep 300 
            shutdown -p now
            break

        else
            echo "Traffic on network, not shutting down. Re-check in 10 minutes. Rate:"
            echo $traffic
            sleep 600
        fi
    

#If ping returns other than 0, the main NAS is not up, and we want the backup NAS to remain on and accessible.
#The script will turn the backup NAS off as soon as the main NAS becomes pingable again.

    else
        echo "Ping not ok, main NAS is not available. Backup NAS remains on. Re-check in 10 minutes."
        sleep 600
fi
done


 

Attachments

  • traffic_shutdown.txt
    3.3 KB · Views: 88
Last edited:

_Alchemist_

Dabbler
Joined
Jan 10, 2019
Messages
23
I completely forgot about this, and I now use a slightly modified version of this script (the one based on E-Mail really was garbage):

It works by checking:
- a manually specified time window, shuts down if no replication runs
- if a replication is running by looking for a replication process (zettarepl)
- if `/var/log/zettarepl.log` has new entries via `tail -n 1`

The script can shut down either of the servers it is running on (or the other one via ssh like i did)

It has been working perfect for 6 months now and combined with a cronjob that powers on the "cold backup" via IMPI:
Code:
ipmitool -I lan -U BACKUP -H 172.16.10.210 -f /mnt/pool/password.txt power on

The password.txt is only readable by root and the BACKUP user on the Supermicro IPMI has Administrator priviliges, I will try using WoL instead sometimes
 

FF_CCSa1F

Cadet
Joined
Jan 3, 2021
Messages
7
Excellent, we finally have some good options for actually doing this task again - finally!
 
Top