Monitoring status of replication tasks

Tenou

Cadet
Joined
Jan 22, 2020
Messages
9
Hello folks,

I'm currently trying to automate my TrueNAS install (currently running 12.0-U4) to shutdown after it was reliably determined that all scheduled replication tasks have finished successfully.

I had the plan to let the system start up through an IPMI command fifteen minutes before the tasks are scheduled to run, execute a short SMART-test, wait for the replication tasks to begin (they all start at the same time, at 4AM) and shutdown the system afterwards. If the replication task failed, I'd like to have the option to keep the system running until an administrator could determine what's the issue.

The best I could come up with so far was to look if this command has given an output in the last minute:
Code:
ps -U root -axwwo lstart,command | grep 'python3 -u /tmp/zettarepl' | grep -v grep | grep -v middlewared


However, since every snapshot gets it's own process wich run sequentially, I do not trust this to be reliable enough, in case there should be a larger time between two jobs.

Has anyone a more reliable - or in general nicer - way to monitor progress and eventually results as well? In the end, it's all stuff that can be seen in the WebUI, so it should be possible to somehow be accessed from the CLI as well... right?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Since replication works with snapshots, maybe you can check for the presence of the last snapshot on the target side or until a zfs diff comes back empty.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
You could also see if parsing /var/log/zettarepl.log can bring something useful... It seems after the last [replication_task_*] event, you get a [retention] task, so that would be the indication the replication was done.

So if the last event in that file is [retention] maybe you're good to go ahead with your actions.
 

Tenou

Cadet
Joined
Jan 22, 2020
Messages
9
Thanks! That helped a lot. I came up with this spaghetti code calling itself a proof of concept so far:

Code:
#!/bin/bash
##################################
#Exit codes                      #
#0 = all good                    #
#3 = not in timerange for backups#
#4 = timeout waiting for backups #
##################################


# Check if we're roughly in time for the automated backups
currenttime=$(date +%H:%M)
   if [[ "$currenttime" > "03:45" ]] && [[ "$currenttime" < "04:30" ]]; then
     echo "We're in time!"
     continue
   else
     echo "No time for backups, going back to bed :("
     exit 3
   fi

#Wait for replication to begin
timeout=0
procsalive=0
while [[ $procsalive -eq 0 ]]; do
    procsalive=$(ps -U root -axwwo lstart,command | grep 'python3 -u /tmp/zettarepl' | grep -v grep | grep -v middlewared | wc -l)
    timeout=((timeout+1))
    if [[ $timeout -ge 3600 ]]
        exit 4
    fi
    sleep 1
done

# Wait for replication to finish
unset procsalive
proccount=0
zettacount=0
until [[ $proccount -ge 60 ]] && [[ $zettacount -eq 1 ]]; do
    procsalive=$(ps -U root -axwwo lstart,command | grep 'python3 -u /tmp/zettarepl' | grep -v grep | grep -v middlewared | wc -l)
    if [[ $proccount -eq 0 ]]
        proccount=((proccount+1))
    else
        proccount=0
    fi
    zettaresult=$(tail -n 1  /var/log/zettarepl.log | grep '\[retention\]' | wc -l)
    if [[ $zettaresult -eq 1 ]]
        zettacount=1
    else
        zettacount=0
    fi
    sleep 1
done

shutdown -p +180s "Backups have been finished, shutting down."


I'll run some tests with it, maybe that's already all I needed.
 

Tenou

Cadet
Joined
Jan 22, 2020
Messages
9
Yup, even tho the scripted I posted yesterday was hot garbage and didn't work in the slightest, this now works great:
Code:
#!/bin/bash
##################################
#Exit codes                      #
#0 = all good                    #
#3 = not in timerange for backups#
#4 = timeout waiting for backups #
##################################

echo "$(date +"%Y-%m-%d_%T") || The holy script's been summoned!"
# Check if we're roughly in time for the automated backups
currenttime=$(date +%H:%M)
   if [[ "$currenttime" > "03:30" ]] && [[ "$currenttime" < "04:30" ]]; then
         echo "$(date +"%Y-%m-%d_%T") || We're in time!"
   else
         echo "$(date +"%Y-%m-%d_%T") || No time for backups, going back to bed :("
     exit 3
   fi

#Wait for replication to begin
echo "$(date +"%Y-%m-%d_%T") || Waiting for replication to begin."
timeout=0
procsalive=0
while [[ $procsalive -eq 0 ]]; do
        procsalive=$(ps -U root -axwwo lstart,command | grep 'python3 -u /tmp/zettarepl' | grep -v grep | grep -v middlewared | wc -l | awk '{print $1}')
        timeout=$((timeout+1))
        echo "$(date +"%Y-%m-%d_%T") || No job found so far (waited for: ${timeout}s)"
        if [[ $timeout -ge 3600 ]]; then
                echo "$(date +"%Y-%m-%d_%T") || No job started within an hour, I'm going back to bed."
                exit 4
        fi
        sleep 1
done
echo "$(date +"%Y-%m-%d_%T") || Yay, a replication job started! I'll wait for it to finish now."

# Wait for replication to finish
unset procsalive
proccount=0
zettacount=0
until [[ $proccount -ge 60 ]] && [[ $zettacount -eq 1 ]]; do
        procsalive=$(ps -U root -axwwo lstart,command | grep 'python3 -u /tmp/zettarepl' | grep -v grep | grep -v middlewared | wc -l | awk '{print $1}')
        if [[ $procsalive -eq 0 ]]; then
                proccount=$((proccount+1))
        else
                proccount=0
        fi
        zettaresult=$(tail -n 1  /var/log/zettarepl.log | grep '\[retention\]' | wc -l | awk '{print $1}')
        if [[ $zettaresult -eq 1 ]]; then
                zettacount=1
        else
                zettacount=0
        fi
        sleep 1
done
echo "$(date +"%Y-%m-%d_%T") || Seems as if all replication jobs finished! Let's shut this bad boy down again."

shutdown -p +180s "Backups have been finished, shutting down."

Thank you for the hint with the logfile, that gave me the additional hint I needed.
 
Top