Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

SOLVED The usefulness of ECC (if we can't assess it's working)?

Yorick

Dedicated Sage
Joined
Nov 4, 2018
Messages
1,731
I'm trying to get ECC running on the new board with a Zen 2 CPU but I'm not sure if it works. ASRockRack X470D4U with Ryzen 3 3100.

Do I need another CPU for ECC?
ECC will work with that CPU. It's the APUs that don't support ECC unless they are "PRO" models.

dmidecode can be wrong.

You want to be sure that "platform-first" reporting in UEFI is disabled. That option may have been removed in recent UEFI versions. TrueNAS 12 Core should then be able to report ECC errors to you, although official support for that feature has been put on the back burner.

The advantage of an Intel platform is that IPMI notifications work today, and aren't a "may or may not happen" feature.
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
FreeNAS does not yet alert, though Event Log through IPMI will show it of course. FreeNAS will alert on ECC error in a future version, that's tracked here: https://jira.ixsystems.com/browse/NAS-105287
May I please draw attention to this again? The issue is marked as low priority and as far as I can tell still open.

I hope I am somehow mistaken otherwise @developers whaaaat?
 
  • Like
Reactions: ECC

diversity

Member
Joined
Dec 4, 2018
Messages
120
It might be political. I just can't shake that notion
 

Yorick

Dedicated Sage
Joined
Nov 4, 2018
Messages
1,731
A couple days ago I've asked Asrock Rack for an update on fixing the IPMI ECC reporting and they responded:
  1. We check ASPEED but they said ECC function need AMD BIOS support.
  2. We then check with TW AMD if they can provide BIOS code? But there is no such code.
Any further update on this? Do you get the sense that AMD/Aspeed will work together to enable IPMI reporting, or is that a non-starter?
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
AMD is actively ignoring the issue even though being reminded several times by mastakilla and asrock rack them selfs.

I know as I am (was actually as I gave up hope some time ago) involved with this as cc's in some emails.
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
looking at
which I marked as closed I hope we will finally learn someday why IXsystems refuses to implement the email on ecc error feature.
It looks like it possible given the thread above.

Mastakilla and I have demonstrated that it is well possible to assess whether ecc detection, correction and reporting is working/possible or not on several levels.
The quickest way being just short pin 2 and 5. (only for single bit confirmation)
The better way is to use Mastakillas overclocking technique. (when done correctly also for multi bit)

Respect all for your input
 

danb35

Wizened Sage
Joined
Aug 16, 2011
Messages
11,744
which I marked as closed I hope we will finally learn someday why IXsystems refuses to implement the email on ecc error feature.
Well, you seem to care about this roughly 100x more than anyone else, but what makes you think they "refuse" to implement it? They're allowed to have their priorities--why must this be the top one? Or even a high one?
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
because ecc is needed to protect data. and if ecc fails or is about to one needs to see it coming.

Why is this not top priortiy?

And even if it was, and I am contradicting this, only medium priority, why does version 12 still does not have it?
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
Well, you seem to care about this roughly 100x more than anyone else, but what makes you think they "refuse" to implement it? They're allowed to have their priorities--why must this be the top one? Or even a high one?
have you really looked at mastakillas responses and that of others. there is a whole subset of people that deeply care about the usefeullness of ecc
 

danb35

Wizened Sage
Joined
Aug 16, 2011
Messages
11,744
Why is this not top priortiy?
I'd speculate that it's because disk data errors secondary to bad RAM are vanishingly rare, especially when ECC RAM is being used. When ECC is used, the system will correct single bit errors, and halt on double bit errors--in neither case is data compromised. IMO, this is a "nice to have", nothing more.
there is a whole subset of people that deeply care about the usefeullness of ecc
And how many of them think it should be a "top priority" for TrueNAS to email you when the system corrects a bit error?
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
I'd speculate that it's because disk data errors secondary to bad RAM are vanishingly rare, especially when ECC RAM is being used. When ECC is used, the system will correct single bit errors, and halt on double bit errors--in neither case is data compromised. IMO, this is a "nice to have", nothing more.

And how many of them think it should be a "top priority" for TrueNAS to email you when the system corrects a bit error?
max respect for your contribution. I should however inform you that a proper kernel will not halt on multi bit ecc. Only when critical ram space is involved.
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
nice to have would be if I would get some nice looking lady ring on my door to tell me to go check out a memory module that is resposible for keeping my data safe might be about to fail.

critical is to see that eventuality coming miles away.

I just don't get it why people can be so stubborn about this.

If ecc memory is essential to keeping data safe, and we can not assess if it is working or about to fail then the whole thing is broken by design.

even Einstein or Hawkings can't blow a (black)hole in that logic
 
Last edited:

diversity

Member
Joined
Dec 4, 2018
Messages
120
But in all fairness, mastakilla and I did find a way to asses. Now the email reporting part is still open
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
I'd speculate that it's because disk data errors secondary to bad RAM are vanishingly rare, especially when ECC RAM is being used. When ECC is used, the system will correct single bit errors, and halt on double bit errors--in neither case is data compromised. IMO, this is a "nice to have", nothing more.

And how many of them think it should be a "top priority" for TrueNAS to email you when the system corrects a bit error?
Thank you Dan! Max respect for your candid insight
 

danb35

Wizened Sage
Joined
Aug 16, 2011
Messages
11,744
I should however inform you that a proper kernel will not halt on multi bit ecc.
I'd understood this would happen at the BIOS level--perhaps I've misunderstood.
even Einstein can't blow a hole in that logic
A schoolboy could do it--two false premises yield an invalid conclusion.
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
Last edited:

diversity

Member
Joined
Dec 4, 2018
Messages
120
A schoolboy could do it--two false premises yield an invalid conclusion.
I remember Einstein saying. if you keep trying the same test then don't expect different results. something along those lines.

this is not related to what we are talking about as I (Mastakilla and others) have been trying relentlesly different things to get to a result.

Please educate me if you will as I fail to see the 2 false premises you are referring to
A schoolboy could do it--two false premises yield an invalid conclusion
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
And how many of them think it should be a "top priority" for TrueNAS to email you when the system corrects a bit error?
All of them(I.e 100% of people that actually went in deep), believe you me!
 

diversity

Member
Joined
Dec 4, 2018
Messages
120
ok guys, i could not let this rest.

I have a proof of concept for a cron that sends an email when MCA errors are detected. I already confirmed with real ecc errors that these errors end up in /var/log/messages

it's a python script (because python is installed on truenas by default)

Code:
# Import smtplib for the actual sending function
import smtplib
# Import the email modules we'll need
from email.message import EmailMessage

import os;
#run a command line statement
os.system("cat /var/log/messages | grep MCA: > MCAmessages.cat"); #create new file MCAmessages.cat

lineCnt = len(open("MCAmessages.cat").readlines(  ));

if lineCnt > 0:
    print("sending email");
    # Import smtplib for the actual sending function
    import smtplib

    # Import the email modules we'll need
    from email.message import EmailMessage

    msg = EmailMessage()
    msg.set_content("MCA error(s) detected on TrueNAS") #TODO: send the contents of MCAmessages.cat

    msg['Subject'] = "MCA error(s) detected on TrueNAS"
    msg['From'] = "your email goes here"
    msg['To'] = "your email goes here"

    # Send the email via our own SMTP server.
    s = smtplib.SMTP("your smtp server goes here")
    s.send_message(msg)
    s.quit()
else: #TODO: actually do nothing so remove the else statement
    print ("no lines containing MCA: substring found in messages"); #INFO: only for testing

#remove all lines from /var/log/messages that contain MCA: as to prevent getting stuck in a loop
os.system("sed -i.bak '/ MCA: /d' ./MCAmessages.cat"); #INFO: test it first on MCAmessages.cat


when running this in the shell using python {filename.py} I am getting an email. all good so far.

but when crontab -e 0/10 0/1 * * * * * python /root/test.py #(run every 10 seconds) the cron is getting installed but I am not getting any output or emails.

Does anyone have a suggestion on how to proceed?
 
Last edited:

Mastakilla

Member
Joined
Jul 18, 2019
Messages
147
Good work!

You triggered my interest and got me looking at this (I did need some diversion from my horrible stability issue :mad: ). Here is my take at it:
Code:
#!/bin/bash
declare MCA_ERRORS

MCA_ERRORS="$(cat /var/log/messages | egrep "$(date -v -1H "+%b %e %H").*$(hostname -s) MCA.*")"
[[ ! -z "${MCA_ERRORS}" ]] && mail -s "TrueNAS $(hostname): Alerts" youremail@outlook.com <<< "MCA Errors were found in /var/log/messages:\n${MCA_ERRORS}"


I didn't try or tested it yet, but it should be possible to get it working like this...

Specifics are:
  • One language only (bash). Yours is Python and some shell by using os.system().
  • I think yours might email you every 10 seconds, once MCA errors are found, as it checks the whole /var/log/messages for MCA each time ;) I tried to solve this by by only checking once per hour and only the last hour (you could probably modify this to make it more frequent)
  • My /var/log/messages contains entries like "Nov 28 09:14:19 data Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>", which aren't errors. By making the grep more specific, these should be excluded as well...
  • It will email the MCA errors for the previous hour to your email address
Flaws:
  • It may miss errors shortly before a logrotate. Not sure yet how to properly / efficiently catch these.
  • If you create your crontab like 1 second before the hour, perhaps it could happen that crontab is delayed and does the check for the "wrong" hour, so best set it somewhere not around the hour (like at ??h30m)
  • I tried to make it send emails similarly to "official" TrueNAS emails, but I didn't succeed yet in making mail send emails with a custom name in the from field. In my TrueNAS install, TrueNAS emails me with the short hostname in the from field, instead of just the email address.
  • Perhaps a variable exists for which email address to email to, so that the script doesn't need to be edited to work?
Regarding crontab:
Sorry, I didn't really look at your crontab issue yet, but I think TrueNAS prefers that you use the GUI -> Tasks instead of the crontab command. Not sure if that is because the crontab command is special / different or crippled somehow. But I suggest you try that.

And now I'm back to figuring out what's wrong with my server :frown: Assistance is always welcome :grin:
 
Last edited:
Top