Resource icon

Monitor and email an alert for ECC Memory errors on OS-level

Purpose
Monitor /var/log/messages for MCA related messages and email them to you when found. MCA messages contain, for example, memory ECC error reportings and much more.
More details: https://en.wikipedia.org/wiki/Machine_Check_Architecture

Why
TrueNAS / FreeNAS expects people to use motherboards that support monitoring and reporting for hardware issues outside of the OS (Platform First Error Handling). In case your motherboard does not fully / properly support this, you will not know about issues, because FreeNAS / TrueNAS itself doesn't do anything with it.

Audience
If you're not using a motherboard that is community-confirmed-to-work, then you should probably get this. My motherboard, for example, has all the settings in the IPMI to monitor the "DIMM [1234] ECC error" sensor and to send emails when triggered, but it doesn't do anything, because that part simply wasn't implemented by the Vendor (because AMD refuses to supply the required information). So make sure it is confirmed / to confirm it yourself...

How
Two simple bash scripts:
  • One script run by a cron job every 10 minutes (For periodic monitoring while the server is online)
  • One script run on init (This catches potentially missed MCA messages, in case the server is turned off when the above periodic cron job should run)

The scripts

periodic_mca_log_monitoring.bash
In the shell as 'root':

Code:
mkdir /root/bin
touch /root/bin/periodic_mca_log_monitoring.bash
chown root:wheel /root/bin/periodic_mca_log_monitoring.bash
chmod 700 /root/bin/periodic_mca_log_monitoring.bash
vi /root/bin/periodic_mca_log_monitoring.bash

contents of /root/bin/periodic_mca_log_monitoring.bash :
Code:
#!/bin/bash
declare MCA_MESSAGES
declare MCA_LOG_PATH="/var/log/MCA-messages"
declare EMAIL_ADDRESS="email@domain.com"

# Load all MCA related messages from the previous 10-minute-timeframe from /var/log/messages.0.bz2 (this is previous / rotated messages logfile)
#   If the errors occured around the time the messages logfile was rotated, then it is important to check the previous file too.
[[ -f /var/log/messages.0.bz2 ]] && MCA_MESSAGES="$(bzcat /var/log/messages.0.bz2 | egrep "$(date -v -10M "+%b %e %H:%M" | sed 's/.$//').*$(hostname -s) MCA.*")"

# If something is in ${MCA_MESSAGES}, then append a new-line
[[ ! -z "${MCA_MESSAGES}" ]] && MCA_MESSAGES+=$'\n'

# Append all MCA related messages from the previous 10-minute-timeframe from /var/log/messages
[[ -f /var/log/messages ]] && MCA_MESSAGES+="$(cat /var/log/messages | egrep "$(date -v -10M "+%b %e %H:%M" | sed 's/.$//').*$(hostname -s) MCA.*")"

# If something is in ${MCA_MESSAGES}, then store it in ${MCA_LOG_PATH} and send an email. If the email command fails, then 'exit 1'
if [[ ! -z "${MCA_MESSAGES}" ]]; then
  echo "${MCA_MESSAGES}" >>"${MCA_LOG_PATH}"
  mail -s "TrueNAS $(hostname): Alerts" ${EMAIL_ADDRESS} <<< "MCA Errors were found in /var/log/messages:"$'\n\n'"${MCA_MESSAGES}" || exit 1
fi

# If the script completes successfully then 'exit 0'
exit 0


init_mca_log_monitoring.bash
Code:
mkdir /root/bin
touch /root/bin/init_mca_log_monitoring.bash
chown root:wheel /root/bin/init_mca_log_monitoring.bash
chmod 700 /root/bin/init_mca_log_monitoring.bash
vi /root/bin/init_mca_log_monitoring.bash

contents of /root/bin/init_mca_log_monitoring.bash :
Code:
#!/bin/bash
declare MCA_MESSAGES_ALL MCA_MESSAGES_NEW
declare MCA_MESSAGE
declare MCA_LOG_ENTRIES="$(cat "${MCA_LOG_PATH}")"
declare MCA_LOG_ENTRY
declare MCA_LOG_PATH="/var/log/MCA-messages"
declare EMAIL_ADDRESS="email@domain.com"
declare OLDIFS

# Load all MCA related messages from /var/log/messages.0.bz2 (this is previous / rotated messages logfile)
#   If the errors occured around the time the messages logfile was rotated, then it is important to check the previous file too.
[[ -f /var/log/messages.0.bz2 ]] && MCA_MESSAGES_ALL="$(bzcat /var/log/messages.0.bz2 | egrep ".*$(hostname -s) MCA.*")"

# If something is in ${MCA_MESSAGES_ALL}, then append a new-line
[[ ! -z "${MCA_MESSAGES_ALL}" ]] && MCA_MESSAGES_ALL+=$'\n'

# Append all MCA related messages from /var/log/messages
[[ -f /var/log/messages ]] && MCA_MESSAGES_ALL+="$(cat /var/log/messages | egrep ".*$(hostname -s) MCA.*")"

# Only store MCA messages that are not found in the MCA-log-file, in the ${MCA_MESSAGES_NEW} variable
OLDIFS="$IFS"                                                    # Backup $IFS
IFS=$'\n'                                                        # To loop line-per-line
for MCA_MESSAGE in ${MCA_MESSAGES_ALL}; do
  for MCA_LOG_ENTRY in ${MCA_LOG_ENTRIES}; do
    [[ "${MCA_MESSAGE}" == "${MCA_LOG_ENTRY}" ]] && continue 2   # Jump out of the ${MCA_MESSAGES_ALL}-loop, if ${MCA_MESSAGE} was already in ${MCA_LOG_ENTRIES}
  done
  MCA_MESSAGES_NEW+="${MCA_MESSAGE}"$'\n'                        # If ${MCA_MESSAGE} was not found in ${MCA_LOG_ENTRIES}, then add it to ${MCA_MESSAGES_NEW}
done
MCA_MESSAGES_NEW="${MCA_MESSAGES_NEW%$'\n'}"                     # Remove trailing new-line
IFS="$OLDIFS"                                                    # Restore $IFS from backup

# If something is in ${MCA_MESSAGES_ALL}, then store it in ${MCA_LOG_PATH} and send an email. If the email command fails, then 'exit 1'
if [[ ! -z "${MCA_MESSAGES_NEW}" ]]; then
  echo "${MCA_MESSAGES_NEW}" >>"${MCA_LOG_PATH}"
  mail -s "TrueNAS $(hostname): Alerts" ${EMAIL_ADDRESS} <<< "MCA Errors were found in /var/log/messages:"$'\n\n'"${MCA_MESSAGES_NEW}" || exit 1
fi

# If the script completes successfully then 'exit 0'
exit 0


The cron job
1612703269832.png

1607734020584.png


Init script
1612702940343.png


Additional requirements
  • Properly configure and test your email settings in TrueNAS / FreeNAS.
    1607735566739.png
  • In both bash scripts, replace email@domain.com by your own email address
  • Make sure you execute the scripts as bash scripts
    • Either configure the executing user to have bash as shell
    • Or run the scripts as an argument of /bin/bash. For example "/bin/bash /root/bin/periodic_mca_log_monitoring.bash"
  • You can further customize the script as you wish (if you know what you're doing)
How it works

periodic_mca_log_monitoring.bash
  1. The cronjob schedule "1-59/10 * * * *" = Every 10 minutes, at 1 past the 10-minute-block (for example: 01, 11, 21, ...)
  2. The /root/bin/email_mca_log_messages.bash script is executed
  3. egrep "$(date -v -10M "+%b %e %H:%M" | sed 's/.$//') ..." = The script searches for MCA messages in the previous-10-minute-block (for example: at minute 01 it will check minute 50-59 from the previous hour, at minute 11 it will check minute 00-09, at minute 21 it will check minute 10-19, etc)
  4. bzcat /var/log/messages.0.bz2 and cat /var/log/messages = It searches them in both /var/log/messages and /var/log/messages.0.bz2 (to properly work during log rotations)
  5. egrep "... .*$(hostname -s) MCA.* = To make sure only MCA messages are reported
  6. echo "${MCA_MESSAGES}" >>"${MCA_LOG_PATH}" = The processed messages are stored in a file
  7. mail -s "<subject>" "<youremailaddress>" <<< "<message with MCA messages>" = And the MCA messages are emailed to an address of your choice
init_mca_log_monitoring.bash
This script is very similar to periodic_mca_log_monitoring.bash, but it isn't limited to a 10 minute block. Instead it searches for MCA entries in all entries of the messages files. It uses the "processed-messages-files" created by periodic_mca_log_monitoring.bash to determine if an entry was already processed (and should be ignored) or if it was missed by periodic_mca_log_monitoring.bash and should be processed by init_mca_log_monitoring.bash instead. See the in-script-comments for more details on how it does that.

How it was tested
I have
  1. created a Fedora Linux VM on my TrueNAS
  2. installed memtester on this VM
  3. overclocked my ECC memory to a point where it isn't 100% stable anymore, but still boots
  4. started memtester
  5. and after about 10 minutes I got below email
Subject: TrueNAS data.local: Alerts
Message:
MCA Errors were found in /var/log/messages:

Dec 12 01:36:31 data MCA: Bank 18, Status 0x9c2040000000011b
Dec 12 01:36:31 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Dec 12 01:36:31 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Dec 12 01:36:31 data MCA: CPU 0 COR GCACHE LG RD error
Dec 12 01:36:31 data MCA: Address 0x40000059e825c40
Dec 12 01:36:31 data MCA: Misc 0xd01b0fff01000000

Possible improvements
  • If TrueNAS / FreeNAS has its configuration available as shell variables or something, then I could use those to for example determine the email address to send to, the subject to use, etc. To make it look even more like an official TrueNAS / FreeNAS email.
  • Perhaps there are other useful types of messages worth monitoring and alerting for?
  • With some minor improvements, perhaps it can be officially included to TrueNAS12 as a simple but efficient workaround?
Author
Mastakilla
Views
242,625
First release
Last update
Rating
5.00 star(s) 1 ratings

Latest updates

  1. Resource updated to make it work when bash isn't the default shell of the executing user

    My scripts assumed you were using /bin/bash as configured shell when executing the scripts. I've...
  2. Init script added to make sure downtime does not cause missed messages

    I've done a minor change to the cron script to store which MCA messages were processed I've...
  3. Possible improvement identified

    I've identified an extra possible improvement. One that is also very important (you could call...
  4. Check added to make sure the log file exists before trying to read it

    I've updated /root/bin/email_mca_log_messages.bash [[ -f /var/log/messages.0.bz2 ]] && and [[...
  5. Script code updated to properly exit with 'exit 0'

    I've made some small modifications to /root/bin/email_mca_log_messages.bash, so that it does an...

Latest reviews

making truenas finally complete ;) great work.
Top