Trigger ECC fault to test Notifications

Status
Not open for further replies.

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
I would like to try out this fancy utility that I keep hearing about that will allow me to "flip" a bit and watch to see if ECC will correct and notify me.

So, whats the name of the tool? How do I use it?

If it matters I have a Supermicro X10SLL-F with a E3-1241 CPU and 32GB of Micron ECC DDR3.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Tool? No such tool that I know of. Intel processors are capable of injecting ECC errors, but I haven't seen any tools.
 

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
Tool? No such tool that I know of. Intel processors are capable of injecting ECC errors, but I haven't seen any tools.

I have seen vague references to this feature, I was just wondering how I activate it (I made the assumption it would be a CLI tool I could compile and run). Ideally I want to unmount all of my pools, then run this to create a single bit error and see if I get notifications (so I know where to check as part of monthly PM).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The only solution that I know of is to read the datasheet (I'll help you out, it's volume II that you want, IIRC), figure out which registers need to be written and write an assembly program that does that. And this assumes that it's not something that's only available through the chipset, or some other complication.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You can do a dmidecode that simulates a correctable error. It allows you to test that the system properly logs those kinds of errors and the reporting system is functional. There is no way to simulate an uncorrectable error.
 

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
You can do a dmidecode that simulates a correctable error. It allows you to test that the system properly logs those kinds of errors and the reporting system is functional. There is no way to simulate an uncorrectable error.

I don't see anything in the man page of dmidecode, and the only information it gives me with -t is:

Code:
        
dmidecode -t memory
# dmidecode 2.12
SMBIOS 2.7 present.

Handle 0x0022, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 32 GB
        Error Information Handle: Not Provided
        Number Of Devices: 4

Handle 0x0023, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1333 MHz
        Manufacturer: Micron
        Serial Number: 36626F5D
        Asset Tag: 9876543210
        Part Number: 18KSF1G72AZ-1G4E1
        Rank: 2
        Configured Clock Speed: 1333 MHz

Handle 0x0025, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA2
        Bank Locator: P0_Node0_Channel0_Dimm1
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1333 MHz
        Manufacturer: Micron
        Serial Number: 29826E65
        Asset Tag: 9876543210
        Part Number: 18KSF1G72AZ-1G4E1
        Rank: 2
        Configured Clock Speed: 1333 MHz

Handle 0x0027, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMB1
        Bank Locator: P0_Node0_Channel1_Dimm0
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1333 MHz
        Manufacturer: Micron
        Serial Number: 33546EBE
        Asset Tag: 9876543210
        Part Number: 18KSF1G72AZ-1G4E1
        Rank: 2
        Configured Clock Speed: 1333 MHz

Handle 0x0029, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMB2
        Bank Locator: P0_Node0_Channel1_Dimm1
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1333 MHz
        Manufacturer: Micron
        Serial Number: 298FF35C
        Asset Tag: 9876543210
        Part Number: 18KSF1G72AZ-1G4E1
        Rank: 2
        Configured Clock Speed: 1333 MHz

Handle 0x004A, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 16 bits
        Data Width: 16 bits
        Size: 16 MB
        Form Factor: <OUT OF SPEC>
        Set: None
        Locator: SPI ROM
        Bank Locator: Not Specified
        Type: Flash
        Type Detail: None
        Speed: Unknown
        Manufacturer: Winbond
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: 25X/Q Series
        Rank: Unknown
        Configured Clock Speed: Unknown



Which of course tells me what I already know, this is ECC ram, and it supports single bit ECC when in the correct system.
 

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
The only solution that I know of is to read the datasheet (I'll help you out, it's volume II that you want, IIRC), figure out which registers need to be written and write an assembly program that does that. And this assumes that it's not something that's only available through the chipset, or some other complication.

I found a hardforum post (http://hardforum.com/showthread.php?t=1693051) talking about a C program that some guy wrote to confirm ECC is enabled. So I compiled it in a jail and then copied it out into the FreeNAS system and ran it and it reported the following

Code:
./ecc_check
5004-5007h: 20 20 66 3
5008-500Bh: 20 20 66 3


Which means "3: ECC active in both I/O and ECC logic"

That makes me feel happier about it being enabled since I assume that 5004-5007h is a CPU register so that is the CPU telling you it is working (it should know I hope). It doesn't however let me know how my supermicro board will report that error.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Take metal filings and sprinkle them over the RAM modules until it starts coughing up errors, of course. :tongue:
 

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
Take metal filings and sprinkle them over the RAM modules until it starts coughing up errors, of course. :p

No, I think you misunderstood. This is the recommended intel system. Not AMD. Repeat Not AMD. :p

But more seriously guys, I was told intel was the way to go because you could verify the ECC was working. How do I do that?

If the answer is assembly on the CPU registers I will embark on that route, but I would prefer some cool tool since I paid for the right hardware this time around.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That ECC program you speak of has been mentioned many times. It has been deemed "semi-unreliable" as it can give what appears to be false positives. If used on the wrong chipset those registers may give the impression that ECC is supported when those registers are not even used for the ECC feature.

There are plenty of ways to verify. One of the more popular is to use dmidecode -t memory and see if the RAM is listing itself as 72 bits (or 144 bits) wide. RAM should only be 64 or 128 bits wide, so the 8 extra bits are your ECC bits.

Another way with dmidecode is to run "dmidecode -t 16" and see if it says single-bit ECC.

The last way is to initiate an ECC RAM error. You can only simulate a single-bit (and therefore correctable) which should be logged and alerted to you properly. Unfortunately I cannot find the command to do that for my life. I know I ran it on my box once and I got an IPMI event and an email. But I can't tell you what the command is. :(
 

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
The last way is to initiate an ECC RAM error. You can only simulate a single-bit (and therefore correctable) which should be logged and alerted to you properly. Unfortunately I cannot find the command to do that for my life. I know I ran it on my box once and I got an IPMI event and an email. But I can't tell you what the command is. :(

That's exactly what I am looking for. The other less than reliable methods listed prior I have all confirmed to be correct, but when my ECC RAM fails, I want to know how I am notified and of course verify that the Single Bit Errors are corrected without panic-ing the system.
 

JoeVulture

Dabbler
Joined
Sep 8, 2013
Messages
22
There are plenty of ways to verify. One of the more popular is to use dmidecode -t memory and see if the RAM is listing itself as 72 bits (or 144 bits) wide. RAM should only be 64 or 128 bits wide, so the 8 extra bits are your ECC bits.

Another way with dmidecode is to run "dmidecode -t 16" and see if it says single-bit ECC.
Should I be concerned then if "dmidecode -t memory" shows 64-bits wide? I have 4x8GB ECC DIMMs, but I am only showing one:
Code:
Handle 0x0029, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x0027
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 8192 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM0
    Bank Locator: BANK 0
    Type: DDR3
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 1600 MHz
    Manufacturer: Samsung
    Serial Number: 20675673
    Asset Tag:  BANK 0 DIMM0 AssetTag
    Part Number: M391B1G73QH0-YK0
    Rank: 2
    Configured Clock Speed: 1600 MHz


However, running "dmidecode -t 16" does show that I am using Single-bit ECC:
Code:
Handle 0x0027, DMI type 16, 23 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Single-bit ECC
    Maximum Capacity: 64 GB
    Error Information Handle: Not Provided
    Number Of Devices: 4


When running the dmidecode program, I get a message indicating that there is only partial support for SMBIOS 2.8, maybe that is a possible cause.

Thanks,
Joe
 

malcolmputer

Explorer
Joined
Oct 28, 2013
Messages
55
Should I be concerned then if "dmidecode -t memory" shows 64-bits wide? I have 4x8GB ECC DIMMs, but I am only showing one:
Joe

What hardware are you running this on? Mine shows more bits for Total than Data (which makes sense because you have to have extra bits to calculate ECC).
 

JoeVulture

Dabbler
Joined
Sep 8, 2013
Messages
22
Hi malcolmputer,

This is on a FreeNAS Mini from ixSystems:
ASRock Rack C2750D4I (Avoton C2750 CPU/chipset)
4x 8GB Samsung DIMMs (two installed at the factory, two installed by me, ordered from ixSystems via Amazon - the official RAM upgrade).

-- Joe
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
There are plenty of ways to verify. One of the more popular is to use dmidecode -t memory and see if the RAM is listing itself as 72 bits (or 144 bits) wide. RAM should only be 64 or 128 bits wide, so the 8 extra bits are your ECC bits.
How do we know this isn't just what the DIMM is saying, regardless of there being an electrical/logical connection on those last bits?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
How do we know this isn't just what the DIMM is saying, regardless of there being an electrical/logical connection on those last bits?

Unless I'm mistaken, that command doesn't query the DIMMs. That querying info from the chipset and interpreting that in language that we understand. Sure, the RAM provides inputs into those values (and obviously no RAM means no input too), but I don't believe that number is arrived at because of the DIMMs.

On my FreeNAS Mini I just the same commands you did and got the same results. We even have the same model of RAM. I can tell you that my RAM is definitely ECC RAM as I installed it myself. So I think in this case the most likely cause for the confusion is dmidecode isn't interpreting the data from the avoton chipset properly (or the avoton chipset isn't designed to report this information).

In any case, I think it's a safe bet that you are using ECC RAM with ECC unless you've disabled ECC in the BIOS. ;)
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There's tons of stories that revolve around ECC that are "like" this thread. The next gen Intel server chipset (I forget the name) will have solid ECC testing (even supporting simulated memory failures, both single-bit and multi-bit). Personally, I feel like that kind of functionality should have existed a decade ago. :P
 
Status
Not open for further replies.
Top