Purpose
Monitor /var/log/messages for MCA related messages and email them to you when found. MCA messages contain, for example, memory ECC error reportings and much more.
More details: https://en.wikipedia.org/wiki/Machine_Check_Architecture
Why
TrueNAS / FreeNAS expects people to use motherboards that support monitoring and reporting for hardware issues outside of the OS (Platform First Error Handling). In case your motherboard does not fully / properly support this, you will not know about issues, because FreeNAS / TrueNAS itself doesn't do anything with it.
Audience
If you're not using a motherboard that is community-confirmed-to-work, then you should probably get this. My motherboard, for example, has all the settings in the IPMI to monitor the "DIMM [1234] ECC error" sensor and to send emails when triggered, but it doesn't do anything, because that part simply wasn't implemented by the Vendor (because AMD refuses to supply the required information). So make sure it is confirmed / to confirm it yourself...
How
Two simple bash scripts:
The scripts
periodic_mca_log_monitoring.bash
In the shell as 'root':
contents of /root/bin/periodic_mca_log_monitoring.bash :
init_mca_log_monitoring.bash
contents of /root/bin/init_mca_log_monitoring.bash :
The cron job
Init script
Additional requirements
periodic_mca_log_monitoring.bash
This script is very similar to periodic_mca_log_monitoring.bash, but it isn't limited to a 10 minute block. Instead it searches for MCA entries in all entries of the messages files. It uses the "processed-messages-files" created by periodic_mca_log_monitoring.bash to determine if an entry was already processed (and should be ignored) or if it was missed by periodic_mca_log_monitoring.bash and should be processed by init_mca_log_monitoring.bash instead. See the in-script-comments for more details on how it does that.
How it was tested
I have
Possible improvements
Monitor /var/log/messages for MCA related messages and email them to you when found. MCA messages contain, for example, memory ECC error reportings and much more.
More details: https://en.wikipedia.org/wiki/Machine_Check_Architecture
Why
TrueNAS / FreeNAS expects people to use motherboards that support monitoring and reporting for hardware issues outside of the OS (Platform First Error Handling). In case your motherboard does not fully / properly support this, you will not know about issues, because FreeNAS / TrueNAS itself doesn't do anything with it.
Audience
If you're not using a motherboard that is community-confirmed-to-work, then you should probably get this. My motherboard, for example, has all the settings in the IPMI to monitor the "DIMM [1234] ECC error" sensor and to send emails when triggered, but it doesn't do anything, because that part simply wasn't implemented by the Vendor (because AMD refuses to supply the required information). So make sure it is confirmed / to confirm it yourself...
How
Two simple bash scripts:
- One script run by a cron job every 10 minutes (For periodic monitoring while the server is online)
- One script run on init (This catches potentially missed MCA messages, in case the server is turned off when the above periodic cron job should run)
The scripts
periodic_mca_log_monitoring.bash
In the shell as 'root':
Code:
mkdir /root/bin touch /root/bin/periodic_mca_log_monitoring.bash chown root:wheel /root/bin/periodic_mca_log_monitoring.bash chmod 700 /root/bin/periodic_mca_log_monitoring.bash vi /root/bin/periodic_mca_log_monitoring.bash
contents of /root/bin/periodic_mca_log_monitoring.bash :
Code:
#!/bin/bash declare MCA_MESSAGES declare MCA_LOG_PATH="/var/log/MCA-messages" declare EMAIL_ADDRESS="email@domain.com" # Load all MCA related messages from the previous 10-minute-timeframe from /var/log/messages.0.bz2 (this is previous / rotated messages logfile) # If the errors occured around the time the messages logfile was rotated, then it is important to check the previous file too. [[ -f /var/log/messages.0.bz2 ]] && MCA_MESSAGES="$(bzcat /var/log/messages.0.bz2 | egrep "$(date -v -10M "+%b %e %H:%M" | sed 's/.$//').*$(hostname -s) MCA.*")" # If something is in ${MCA_MESSAGES}, then append a new-line [[ ! -z "${MCA_MESSAGES}" ]] && MCA_MESSAGES+=$'\n' # Append all MCA related messages from the previous 10-minute-timeframe from /var/log/messages [[ -f /var/log/messages ]] && MCA_MESSAGES+="$(cat /var/log/messages | egrep "$(date -v -10M "+%b %e %H:%M" | sed 's/.$//').*$(hostname -s) MCA.*")" # If something is in ${MCA_MESSAGES}, then store it in ${MCA_LOG_PATH} and send an email. If the email command fails, then 'exit 1' if [[ ! -z "${MCA_MESSAGES}" ]]; then echo "${MCA_MESSAGES}" >>"${MCA_LOG_PATH}" mail -s "TrueNAS $(hostname): Alerts" ${EMAIL_ADDRESS} <<< "MCA Errors were found in /var/log/messages:"$'\n\n'"${MCA_MESSAGES}" || exit 1 fi # If the script completes successfully then 'exit 0' exit 0
init_mca_log_monitoring.bash
Code:
mkdir /root/bin touch /root/bin/init_mca_log_monitoring.bash chown root:wheel /root/bin/init_mca_log_monitoring.bash chmod 700 /root/bin/init_mca_log_monitoring.bash vi /root/bin/init_mca_log_monitoring.bash
contents of /root/bin/init_mca_log_monitoring.bash :
Code:
#!/bin/bash declare MCA_MESSAGES_ALL MCA_MESSAGES_NEW declare MCA_MESSAGE declare MCA_LOG_ENTRIES="$(cat "${MCA_LOG_PATH}")" declare MCA_LOG_ENTRY declare MCA_LOG_PATH="/var/log/MCA-messages" declare EMAIL_ADDRESS="email@domain.com" declare OLDIFS # Load all MCA related messages from /var/log/messages.0.bz2 (this is previous / rotated messages logfile) # If the errors occured around the time the messages logfile was rotated, then it is important to check the previous file too. [[ -f /var/log/messages.0.bz2 ]] && MCA_MESSAGES_ALL="$(bzcat /var/log/messages.0.bz2 | egrep ".*$(hostname -s) MCA.*")" # If something is in ${MCA_MESSAGES_ALL}, then append a new-line [[ ! -z "${MCA_MESSAGES_ALL}" ]] && MCA_MESSAGES_ALL+=$'\n' # Append all MCA related messages from /var/log/messages [[ -f /var/log/messages ]] && MCA_MESSAGES_ALL+="$(cat /var/log/messages | egrep ".*$(hostname -s) MCA.*")" # Only store MCA messages that are not found in the MCA-log-file, in the ${MCA_MESSAGES_NEW} variable OLDIFS="$IFS" # Backup $IFS IFS=$'\n' # To loop line-per-line for MCA_MESSAGE in ${MCA_MESSAGES_ALL}; do for MCA_LOG_ENTRY in ${MCA_LOG_ENTRIES}; do [[ "${MCA_MESSAGE}" == "${MCA_LOG_ENTRY}" ]] && continue 2 # Jump out of the ${MCA_MESSAGES_ALL}-loop, if ${MCA_MESSAGE} was already in ${MCA_LOG_ENTRIES} done MCA_MESSAGES_NEW+="${MCA_MESSAGE}"$'\n' # If ${MCA_MESSAGE} was not found in ${MCA_LOG_ENTRIES}, then add it to ${MCA_MESSAGES_NEW} done MCA_MESSAGES_NEW="${MCA_MESSAGES_NEW%$'\n'}" # Remove trailing new-line IFS="$OLDIFS" # Restore $IFS from backup # If something is in ${MCA_MESSAGES_ALL}, then store it in ${MCA_LOG_PATH} and send an email. If the email command fails, then 'exit 1' if [[ ! -z "${MCA_MESSAGES_NEW}" ]]; then echo "${MCA_MESSAGES_NEW}" >>"${MCA_LOG_PATH}" mail -s "TrueNAS $(hostname): Alerts" ${EMAIL_ADDRESS} <<< "MCA Errors were found in /var/log/messages:"$'\n\n'"${MCA_MESSAGES_NEW}" || exit 1 fi # If the script completes successfully then 'exit 0' exit 0
The cron job
Init script
Additional requirements
- Properly configure and test your email settings in TrueNAS / FreeNAS.
- In both bash scripts, replace email@domain.com by your own email address
- Make sure you execute the scripts as bash scripts
- Either configure the executing user to have bash as shell
- Or run the scripts as an argument of /bin/bash. For example "/bin/bash /root/bin/periodic_mca_log_monitoring.bash"
- You can further customize the script as you wish (if you know what you're doing)
periodic_mca_log_monitoring.bash
- The cronjob schedule "1-59/10 * * * *" = Every 10 minutes, at 1 past the 10-minute-block (for example: 01, 11, 21, ...)
- The /root/bin/email_mca_log_messages.bash script is executed
- egrep "$(date -v -10M "+%b %e %H:%M" | sed 's/.$//') ..." = The script searches for MCA messages in the previous-10-minute-block (for example: at minute 01 it will check minute 50-59 from the previous hour, at minute 11 it will check minute 00-09, at minute 21 it will check minute 10-19, etc)
- bzcat /var/log/messages.0.bz2 and cat /var/log/messages = It searches them in both /var/log/messages and /var/log/messages.0.bz2 (to properly work during log rotations)
- egrep "... .*$(hostname -s) MCA.* = To make sure only MCA messages are reported
- echo "${MCA_MESSAGES}" >>"${MCA_LOG_PATH}" = The processed messages are stored in a file
- mail -s "<subject>" "<youremailaddress>" <<< "<message with MCA messages>" = And the MCA messages are emailed to an address of your choice
This script is very similar to periodic_mca_log_monitoring.bash, but it isn't limited to a 10 minute block. Instead it searches for MCA entries in all entries of the messages files. It uses the "processed-messages-files" created by periodic_mca_log_monitoring.bash to determine if an entry was already processed (and should be ignored) or if it was missed by periodic_mca_log_monitoring.bash and should be processed by init_mca_log_monitoring.bash instead. See the in-script-comments for more details on how it does that.
How it was tested
I have
- created a Fedora Linux VM on my TrueNAS
- installed memtester on this VM
- overclocked my ECC memory to a point where it isn't 100% stable anymore, but still boots
- started memtester
- and after about 10 minutes I got below email
Possible improvements
- If TrueNAS / FreeNAS has its configuration available as shell variables or something, then I could use those to for example determine the email address to send to, the subject to use, etc. To make it look even more like an official TrueNAS / FreeNAS email.
- Perhaps there are other useful types of messages worth monitoring and alerting for?
- With some minor improvements, perhaps it can be officially included to TrueNAS12 as a simple but efficient workaround?