Resource icon

Monitor and email an alert for ECC Memory errors on OS-level

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Mastakilla submitted a new resource:

Monitor and email an alert for ECC Memory errors on OS-level - Monitor and email an alert for ECC Memory errors on OS-level

Purpose
Monitor /var/log/messages for MCA related messages and email them to you when found. MCA messages contain, for example, memory ECC error reportings and much more.
More details: https://en.wikipedia.org/wiki/Machine_Check_Architecture

Why
TrueNAS / FreeNAS expects people to use motherboards that support monitoring and reporting for hardware issues outside of the OS (Platform First Error Handling). In case your motherboard does not fully /...

Read more about this resource...
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
although triggering ecc errors as mastakilla suggests is the best way to do it (as it can trigger both single and multi bit errors)
it is also enormously time consuming and hard to get right.

for those that are strapped for time and want a quick fix, one can short pin 2 and 5 of a memory slot. it's fast and when done with a steady hand safe to do. It wont give you multi bit ecc though, only single bit
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I've made some small modifications to the script itself, so that it does an 'exit 0' at the end. This should prevent the system from thinking the command has failed and writing below error in /var/log/middlewared.log
Code:
[2020/12/13 16:31:00] (ERROR) middlewared.job.run():373 - Job <bound method accepts.<locals>.wrap.<locals>.nf of <middlewared.plugins.cron.CronJobService object at 0x81cb735e0>> failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 361, in run
    await self.future
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 399, in __run_body
    rv = await self.middleware.run_in_thread(self.method, *([self] + args))
  File "/usr/local/lib/python3.8/site-packages/middlewared/utils/run_in_thread.py", line 10, in run_in_thread
    return await self.loop.run_in_executor(self.run_in_thread_executor, functools.partial(method, *args, **kwargs))
  File "/usr/local/lib/python3.8/site-packages/middlewared/utils/io_thread_pool_executor.py", line 25, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/cron.py", line 282, in run
    raise CallError(f'CronTask "{cron_cmd}" exited with {cp.returncode} (non-zero) exit status.')
middlewared.service_exception.CallError: [EFAULT] CronTask "/root/bin/email_mca_log_messages.bash > /dev/null" exited with 1 (non-zero) exit status.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Mastakilla updated Monitor and email an alert for ECC Memory errors on OS-level with a new update entry:

Script code updated to properly exit with 'exit 0'

I've made some small modifications to the script itself, so that it does an 'exit 0' at the end. This should prevent the system from thinking the command has failed and writing below error in /var/log/middlewared.log
Code:
[2020/12/13 16:31:00] (ERROR) middlewared.job.run():373 - Job <bound method accepts.<locals>.wrap.<locals>.nf of <middlewared.plugins.cron.CronJobService object at 0x81cb735e0>> failed
Traceback (most recent call last):
  File...

Read the rest of this update entry...
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Mastakilla updated Monitor and email an alert for ECC Memory errors on OS-level with a new update entry:

Possible improvement identified

I've identified an extra possible improvement. One that is also very important (you could call it even a bug).

The script only properly works if FreeNAS/TrueNAS remains online...

For example:
If you have MCA messages at 16h43m and a crash at 16h49m and FreeNAS/TrueNAS only comes online again at 17h03, it will have missed the cronjob at 16h51 that would search in the 16h4x timeframe and the MCA messages that occurred will NOT be reported!

Read the rest of this update entry...
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
To solve the above, I could try and detect "missed timeslots" after a reboot, but I'm afraid this would make the permanent-monitoring-script quite a bit more complicated and resource intensive (as it would need to keep track of which timeslots it processed).

So I'm leaning more towards keeping things simple and perhaps less perfect.

I'm thinking of a simple "on-boot-script" that mails all MCA messages it finds on boot. That way you certainly won't "miss" any MCA messages (which is the most important thing to solve), but, as a downside, you might get MCA messages that you already received in an earlier email (which should not be a disaster)...


Edit:
Ok, I think I figured out a way to do it without complicating things too much or making it more resource intensive during normal operation...

Edit 2:
Done! Implemented and tested...
 
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Mastakilla updated Monitor and email an alert for ECC Memory errors on OS-level with a new update entry:

Resource updated to make it work when bash isn't the default shell of the executing user

My scripts assumed you were using /bin/bash as configured shell when executing the scripts. I've now updated the resource so it uses /bin/bash explicitly when executing the scripts, so that it also works when your user has a different shell configured.

Read the rest of this update entry...
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Thanks Mastakilla for keeping us updated
 
Top