Memory Errors

Joined
Dec 31, 2012
Messages
8
Hi there,
I just built a FreeNAS server, and it was running flawlessly for about a week, then this morning I noticed the following errors on the console. What's interesting to me was that the server didn't kernel panic. Is this indicative of memory failure/errors that will cause issues in the future? I'd rather not take the machine offline to do memory testing if someone has seen this before, but I don't want it to crash later on either, so I just wanted to post it here and check before taking more action.

Hardware specs:
SuperMicro server board with 32GB of Kingston RAM, and top is showing the following:
Code:
Mem: 6522M Active, 6323M Inact, 17G Wired, 1036M Cache, 206M Buf, 299M Free

The output of some config info:
Code:
[root@storage] ~# uname -a
FreeBSD storage.ccwis.com 8.3-RELEASE-p4 FreeBSD 8.3-RELEASE-p4 #0 r241984M: Wed Oct 24 00:57:10 PDT 2012     root@build.ixsystems.com:/usr/home/jpaetzel/8.3.0-RELEASE/os-base/amd64/usr/home/jpaetzel/8.3.0-RELEASE/FreeBSD/src/sys/FREENAS.amd64  amd64

[root@storage] ~# sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'
hw.machine: amd64
hw.model: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
hw.ncpu: 4


If it matters, I have the following zPool configuration of 3TB WD SATA drives:
Code:
	NAME        STATE     READ WRITE CKSUM
	CCWIS       ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da8     ONLINE       0     0     0
	    da9     ONLINE       0     0     0
	    da10    ONLINE       0     0     0
	    da11    ONLINE       0     0     0
	    da12    ONLINE       0     0     0
	    da13    ONLINE       0     0     0
	    da14    ONLINE       0     0     0
	    da15    ONLINE       0     0     0


Code:
Dec 29 01:13:16 storage kernel: MCA: Bank 5, Status 0xcc0000c000010090
Dec 29 01:13:16 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:13:16 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:13:16 storage kernel: MCA: CPU 0 COR (3) OVER RD channel 0 memory error
Dec 29 01:13:16 storage kernel: MCA: Address 0x1d8f30f00
Dec 29 01:13:16 storage kernel: MCA: Misc 0x20403ebe86
Dec 29 01:13:33 storage kernel: MCA: Bank 5, Status 0x8c00004000010090
Dec 29 01:13:33 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:13:33 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:13:33 storage kernel: MCA: CPU 0 COR (1) RD channel 0 memory error
Dec 29 01:13:33 storage kernel: MCA: Address 0x1d5730700
Dec 29 01:13:33 storage kernel: MCA: Misc 0x21401e9e86
Dec 29 01:13:33 storage kernel: MCA: Bank 5, Status 0x8c00004000010090
Dec 29 01:13:33 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:13:33 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:13:33 storage kernel: MCA: CPU 0 COR (1) RD channel 0 memory error
Dec 29 01:13:33 storage kernel: MCA: Address 0x1d5730f00
Dec 29 01:13:33 storage kernel: MCA: Misc 0x21404e4e86
Dec 29 01:16:08 storage kernel: MCA: Bank 5, Status 0x8c00004000010090
Dec 29 01:16:08 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:16:08 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:16:08 storage kernel: MCA: CPU 0 COR (1) RD channel 0 memory error
Dec 29 01:16:08 storage kernel: MCA: Address 0x1d5730f00
Dec 29 01:16:08 storage kernel: MCA: Misc 0x21405ede86
Dec 29 01:16:08 storage kernel: MCA: Bank 5, Status 0x8c00004000010090
Dec 29 01:16:08 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:16:08 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:16:08 storage kernel: MCA: CPU 0 COR (1) RD channel 0 memory error
Dec 29 01:16:08 storage kernel: MCA: Address 0x1d5730f00
Dec 29 01:16:08 storage kernel: MCA: Misc 0x21404e4e86
Dec 29 01:32:00 storage kernel: MCA: Bank 5, Status 0xcc00218000010090
Dec 29 01:32:00 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:32:00 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:32:00 storage kernel: MCA: CPU 0 COR (134) OVER RD channel 0 memory error
Dec 29 01:32:00 storage kernel: MCA: Address 0x1d5731f00
Dec 29 01:32:00 storage kernel: MCA: Misc 0x21406e6e86
Dec 29 01:32:00 storage kernel: MCA: Bank 8, Status 0xcc00008f000800c0
Dec 29 01:32:00 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:32:00 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:32:00 storage kernel: MCA: CPU 0 COR (2) OVER MS channel 0 memory error
Dec 29 01:32:00 storage kernel: MCA: Address 0x55730100
Dec 29 01:32:00 storage kernel: MCA: Misc 0x900080448841e8c
Dec 29 01:32:00 storage kernel: MCA: Bank 8, Status 0xcc00134f000800c0
Dec 29 01:32:00 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:32:00 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:32:00 storage kernel: MCA: CPU 0 COR (77) OVER MS channel 0 memory error
Dec 29 01:32:00 storage kernel: MCA: Address 0x55737f00
Dec 29 01:32:00 storage kernel: MCA: Misc 0x9000048c0409e8c
Dec 29 01:36:45 storage kernel: MCA: Bank 8, Status 0xcc00434f000800c0
Dec 29 01:36:45 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 01:36:45 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 01:36:45 storage kernel: MCA: CPU 0 COR (269) OVER MS channel 0 memory error
Dec 29 01:36:45 storage kernel: MCA: Address 0x1d8f37f00
Dec 29 01:36:45 storage kernel: MCA: Misc 0x900002001109e8c
Dec 29 02:38:04 storage kernel: MCA: Bank 5, Status 0xcc024f0000010090
Dec 29 02:38:04 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 02:38:04 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 02:38:04 storage kernel: MCA: CPU 0 COR (2364) OVER RD channel 0 memory error
Dec 29 02:38:04 storage kernel: MCA: Address 0x1d1fd6500
Dec 29 02:38:04 storage kernel: MCA: Misc 0x1404e4e86
Dec 29 02:38:04 storage kernel: MCA: Bank 8, Status 0xcc00440f000800c0
Dec 29 02:38:04 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 02:38:04 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 02:38:04 storage kernel: MCA: CPU 0 COR (272) OVER MS channel 0 memory error
Dec 29 02:38:04 storage kernel: MCA: Address 0x5d8f37f00
Dec 29 02:38:04 storage kernel: MCA: Misc 0x900244008c09e8c
Dec 29 03:38:35 storage kernel: MCA: Bank 5, Status 0xcc00db4000010090
Dec 29 03:38:35 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 03:38:35 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 03:38:35 storage kernel: MCA: CPU 0 COR (877) OVER RD channel 0 memory error
Dec 29 03:38:35 storage kernel: MCA: Address 0x1d1fd6500
Dec 29 03:38:35 storage kernel: MCA: Misc 0x21404e4e86
Dec 29 03:38:35 storage kernel: MCA: Bank 8, Status 0xcc00f3cf000800c0
Dec 29 03:38:35 storage kernel: MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
Dec 29 03:38:35 storage kernel: MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
Dec 29 03:38:35 storage kernel: MCA: CPU 0 COR (975) OVER MS channel 0 memory error
Dec 29 03:38:35 storage kernel: MCA: Address 0x5d8fd7f00
Dec 29 03:38:35 storage kernel: MCA: Misc 0x9000c01004c1e8c
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I would make a memtest CD and give that a whirl. Run it for at least 3 full passes.

If it tells you that you have bad RAM then you'll have to start testing 1 or 2 sticks at a time to determine which one is bad.

If it doesn't tell you that you have bad RAM I don't know what to say. I've never seen that error before so I don't know what advice to give. It could be that your hardware isn't compatible with FreeBSD????
 

BillyBob2

Dabbler
Joined
Feb 23, 2013
Messages
19
I am experiencing the exact same issue.

Anyone have any more info on these messages?

I have not tried memtest as of yet, but will try it now.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That's a machine check architecture message. It's telling you which module you need to replace. As with so many things, the fun is that the location it provides doesn't consistently decode into a way to identify which slot that is on your board. You're best off running memtest and playing the elimination game. Also it looks like maybe there's a problem with two banks.
 

MindBender

Explorer
Joined
Oct 12, 2015
Messages
67
I just received similar messages in my daily security output mail:
Code:
MCA: Bank 7, Status 0x8c00004000010090
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (1) RD channel 0 memory error
MCA: Address 0x187b470040
MCA: Misc 0x40028286
arp: 172.17.12.10 moved from 00:23:df:de:69:b1 to 00:24:36:a0:cd:34 on igb0
MCA: Bank 7, Status 0x8c00004000010090
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (1) RD channel 0 memory error
MCA: Address 0x17fcb50000
MCA: Misc 0x1403ebe86
MCA: Bank 9, Status 0x8c000046000800c0
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (1) MS channel 0 memory error
MCA: Address 0x1788700080
MCA: Misc 0x900020202020c8c
MCA: Bank 9, Status 0xcc000186000800c0
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (6) OVER MS channel 0 memory error
MCA: Address 0x1788760080
MCA: Misc 0x900022020200c8c
MCA: Bank 9, Status 0xcc000186000800c0
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (6) OVER MS channel 0 memory error
MCA: Address 0x17887c0080
MCA: Misc 0x900002220220c8c
MCA: Bank 9, Status 0xcc05ef86000800c0
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (6078) OVER MS channel 0 memory error
MCA: Address 0x184e9f0000
MCA: Misc 0x900022020200c8c
MCA: Bank 9, Status 0xcc018686000800c0
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50662, APIC ID 0
MCA: CPU 0 COR (1562) OVER MS channel 0 memory error
MCA: Address 0x187f8f0000
MCA: Misc 0x900002000220c8c

I too am running a SuperMicro mainboard (X10SDV-TLN4F) with 128GiB of SuperMicro-tested Samsung DDR4 RAM.

I will start testing the memory right now; Here goes my Saturday. The question I have for now: Should I be worried about my data?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Any events in your motherboard log? Data should be fine. Sometimes you get memory errors especially with what much.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Yup, run that Memtest86 for at least 3 full passes. If you get a failure then I'd reseat the RAM and try again, or you could reseat the RAM first and then test, this would save you some time.
 

MindBender

Explorer
Joined
Oct 12, 2015
Messages
67
Any events in your motherboard log? Data should be fine. Sometimes you get memory errors especially with what much.
Unfortunately there's nothing in the BIOS log. Memory error logging is switched on though, but it also has configured a threshold of 10.

Yup, run that Memtest86 for at least 3 full passes. If you get a failure then I'd reseat the RAM and try again, or you could reseat the RAM first and then test, this would save you some time.
I'm running Memtest86+ now. Unfortunately it takes a whopping 9 hours to completely test 128GiB. I'm still wondering why it's detecting only 1 core.
In a couple of days I will know more. The first test passed though, so I fear all the others will pass too, leaving just uncertainty. And memory prices are currently just a bit too high to just replace everything...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
It should detect all the cores however by default it may only run on one core at a time. I will change mine to use all cores and typically that will speed up the RAM testing considerably. To change to use all cores (I'm trying to remember) but you would select the options menu and select to use all cores. Sorry I'm not very clear on how to do it, I'm preently at work and cannot look that part up.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Memtest86+ is known to crash in SMP mode, hence why it runs in single core mode.
 

MindBender

Explorer
Joined
Oct 12, 2015
Messages
67
Well, one mystery's solved, another one appeared:
I was running MemTest86+ v5.01, which disables SMP by default and only has a very narrow window of opportunity right after start to enable SMP. I read that hitting F2 in the first 3 seconds should be used to enable SMP, but my version stated that would enable safe mode, so I abandoned MemTest86+. The added mystery was the first pass taking 9 hours, but the second pass not being finished after 22 hours of running.

Now I'm running MemTest86 (note the missing +) v7.4. In basic mode, because I don't the new fashioned GUI thingy. It reports being v4.3.7, to complete the confusion, but it is firing on all 8 cores.

The documentation warns about false positives in SMP mode, but I'll take my changes. In case of an error, I can always re-run in single core mode. For now I want to stress this system as much as I can.

Oh, and the errors in the Kernel log I reported above seem to indicate correctable read errors (http://fxr.watson.org/fxr/source/x86/x86/mca.c#L280). So probably no damage to the data occurred, assuming that no more bits than the maximum number of detectable errors got corrupted.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
ECC is doing what it’s supposed to. You now need to resolve the issue.

It could just be dust in a socket, so start by reseating modules.

Then try to eliminate which module or slot is causing the problem.
 

MindBender

Explorer
Joined
Oct 12, 2015
Messages
67
The BIOS version has not been updated. Only the UEFI version is developed.
Aha; That's something worth knowing. Today I tried to start the UEFI version, but unfortunately it didn't play nice. I got a black screen for 20 or so seconds and after that the system rebooten. Perhaps it's got something to do with IPMI limitations. I might have been waiting for me to make a selection, in an unsupported video mode. I'll try again when I've found a monitor. Now I'm running the BIOS version again.

ECC is doing what it’s supposed to. You now need to resolve the issue.

It could just be dust in a socket, so start by reseating modules.

Then try to eliminate which module or slot is causing the problem.
It is reassuring to see ECC doing it's job.

However, I'm not going to reseat anything yet. First I'd like to run more and longer tests, to quantify the problem. After that I will reset modules, blow out dust, do some voodoo engineering and then run tests again. That will at least give me an indication if my tinkering has changed anything.

No luck so far, though: The test has been running for days now, without any problem so far...
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
The test is not failing because ECC.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Sure they're corrected, but isn't MemTest86 capable of hooking ECC events of the DDR controller and reporting them like the FreeBSD Kernel does in its log?
They sure claim it, but I see little evidence of it.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Sure they're corrected, but isn't MemTest86 capable of hooking ECC events of the DDR controller and reporting them like the FreeBSD Kernel does in its log?

The BIOS version certainly is not.
 

MindBender

Explorer
Joined
Oct 12, 2015
Messages
67
The BIOS version certainly is not.
Now that's something worth knowing.

Unfortunately my headless system doesn't seem to boot from the stick in UEFI mode. It tries, but it resets after half a minute. I can still select UEFI mode from the BIOS boot menu, but after that the IPMI screen turns black, followed by a reset half a minute later. Perhaps IPMI doesn't play nice with the GUI.

Anyway, I needed to fill out my taxes two weeks ago, so I had to boot the system to access my data. I kind'a neglected to shut it down afterwards, but didn't have any MCA errors reported since. This gives me a bit of an eery feeling: Perhaps it was a glitch. The problem doesn't seem to reproduce well for sure.
 

Radu

Dabbler
Joined
Mar 7, 2014
Messages
45
Hello,

I am encountering the same MCA Memory error.

Jun 11 22:12:25 fns MCA: Bank 7, Status 0xcc1005c000010091
Jun 11 22:12:25 fns MCA: Global Cap 0x0000000001000c1d, Status 0x0000000000000000
Jun 11 22:12:25 fns MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Jun 11 22:12:25 fns MCA: CPU 0 COR (16407) OVER RD channel 1 memory error
Jun 11 22:12:25 fns MCA: Address 0xf3bb22540
Jun 11 22:12:25 fns MCA: Misc 0x1420aca86

What is the correct memtest version/variant that should i use?
Did someone resolved this with re seating the memory?
The error appeared after some consecutive resilverings in order to replace and grow a 10HDD vdev, 2TB to 8TB.
Did someone saw a ECC RAM error an a memtest(what varinat/version) instance like described in the above posts?
 
Top