CPU Channel 1 memory error

Status
Not open for further replies.

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
The machine been running for a few days now. I've been transferring my media in the meantime and I notice this on the screen. What is this about? Should I be worried?

20ubyba.jpg
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'd run memtest86+ and see what it says. Not sure what it'll show with ECC, though...
 

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
I ran Memtest86 v5.1 2-weeks ago before building this server. CPU and memory came out fine with no errors.
 

Starpulkka

Contributor
Joined
Apr 9, 2013
Messages
179
Oh boy a ddr4 ecc memory, this is going to be intresting.
What memory brand chip and who assembles if not same brand and what ecc version? Did you try pro memtest and all cores mem test?
edit:its a registered memory so there should be a memory logbook in ipmi or in bios. so why even test that it clearly tells what memory have errors..
 
Last edited:

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
I have (4) Crucial 16GB DDR4 (PC4-2133) CL15 DR x4 ECC Registered DIMM 288-Pin.
I tried the free version and it ran for almost 24 hrs. Yes it did run all cores.
What will the pro version check for that the free version won't?
Any way on guiding me through to check it through IPMI or Bios?
 

Starpulkka

Contributor
Joined
Apr 9, 2013
Messages
179
pro version has new 64 bit tests well it wont help you anyway because of oldscool bios from 70's and you can disable cache anyway on bios no need new fancy program to do it. You can get to bios by restart supermachine and press from supermicro manual: (Note: In most cases, the <Delete> key is used to invoke the AMI BIOS setup screen. There are a few cases when other keys are used, such as <F1>, <F2>, etc.) so you should be pressing smashing delete key, if you then get to bios go to event logs and goto Change SMBIOS Event Log Settings and see is SMBIOS Event Log enabled and enable Runtime Error Logging Support also. Theres also memory device taggin option but im not sure is it for bios or snmp memory trapping sure someone here knows. Also memory scrub (Patrol scrub interval) is 24h on default so changin to it 2h might fasten the error testing process member to change it back to 24 after you have fixed error issue..(if error even is in memory, if its in processor or in simple ecc code version then its a harder case to check). If i where you i first mark that "bad" memory and then try switch memory places and see would error folllow by copying Tb of data and see then ecc logs , if it follows then rma memory. if its not then i gived a call to supermicro. (prpbably first thing they say, have you tried changin it to new one..)
 
Last edited:

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
So I haven't gotten around to running the pro version yet. I'm waiting for the following weekend to bring the server down for a day.
But what I did do was switch the 4 stick of memory around. A1 from A2 and B1 from B2. So far I haven't had any issue since. One thing I did notice was when the server came back on, the hard drive lights in front of bay was blinking for a about 10-15 min consistently. I notice an alert saying it was re-silvering.

What would it do that after a memory swap around?
Why wouldn't I get any more memory error using the same memory?
 

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
I just got this message again. What to do?

mem.jpg
 

Starpulkka

Contributor
Joined
Apr 9, 2013
Messages
179
So its channel 1 again even you swap memory stiks. So what is now know its not a single memory stik error, it is something bigger. I thouth that you said its not a production machine, as you have just your media in it. If its in production backups should be secured and tested months already. Also that its resilvering is really really really bad.
What did ecc memory logs say?
What exact memory model you put that board, is it in supermicro tested list?

So what is left is 1.faulty supermicro board 2.incompatible ecc memorys for that board 3.faulty cpu mem controller 4.something else what i could not think of. Would not be amazed if intel stuff would be broken.

edit: oh wait you just change stick a to a slot and b to b slot, so techically you did not change anything lol.. ok i just realized that i got trolled, haha. good luck on testing =)
 
Last edited:

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
I'm actually doing backups every week of all my data. Im using Karen replicator to do the job. Any app recommendation for backing up?

The re-silvering happened once. I don't know if that was due to the shifting of the memory sticks. I read that re-silvering takes days an days when you have multiple terabytes, Well I have 12x 6TB drives and it only took about 10-15 minutes that time.

The memory model is Crucial 64GB Kit (16GBx4) DDR4 2133 MT/s (PC4-2133) CL15 DR x4 ECC Registered DIMM 288-Pin CT4K16G4RFD4213

Here's another screenshot of today error:
2d2f0xk.jpg
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'm actually doing backups every week of all my data. Im using Karen replicator to do the job. Any app recommendation for backing up?

The re-silvering happened once. I don't know if that was due to the shifting of the memory sticks. I read that re-silvering takes days an days when you have multiple terabytes, Well I have 12x 6TB drives and it only took about 10-15 minutes that time.

The memory model is Crucial 64GB Kit (16GBx4) DDR4 2133 MT/s (PC4-2133) CL15 DR x4 ECC Registered DIMM 288-Pin CT4K16G4RFD4213

Here's another screenshot of today error:
2d2f0xk.jpg

How much data is actually on there? Scrubs/resilvers are much faster on near-empty pools.
 

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
I have about 18tb on the server now. I had about 16.5tb during the resilvering.
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
It only has to touch the changes not every bit. I keep watching your memory error wondering if we have a bug or actually a bad dimm. Can you locate and pull the bad one somehow and see if the problem disappears? Not sure how exactly to interpret channel one bank 7, but has to be a way to figure it out. Not sure if you checked the ipmi logs, but ecc errors should be logged. Look into mcelog as well.

I'd be tempted to Install a single dimm at a time and run it to see if the error popped. Current server load should be fine at 16GB, if it isn't run two sticks. Will also give you a feel for memory requirements and differences.

Wish I could say I've seen this one, but I haven't. Normally bad ram for me has been an unstable box, not a polite error.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
You are getting them far too frequently to be random chance. You've got something bad in your machine. Most likely a DIMM but it could be something else like motherboard, bad power from the PSU, etc.

The idea of running a single DIMM at a time is exactly what I'd do.

ECC RAM with a single-bit error should throw polite errors exactly like the one you are seeing. Multi-bit errors should halt the system though.
 

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
I've checked all my data on my backups and they all seem to be OK. I will remove them this week as I have the chassis in the rack already. Is there a way for me to determine what memory stick is Bank 7?

So does the memory rule apply as follow, 1GB of memory for 1TB of occupied space? Or is it 1GB for for the total space (empty or occupied)?
Because it's occupied space, then I can run 1 single 16GB DIMM at a time without really affecting my performance.

I can't find a place to check the log. I have a browser tab on my desktop with the IP to my IPMI which I usually monitor my temperatures of all components.

Wouldn't a bad power supply start showing other kind of symptoms. I've never seen a bad power supply throw this kind of error. In any case to diagnose this issue, I have dual-power supplies in this chassis just in case one fails.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
it's better to go with total space, but for testing you'd be fine with just 16GB of RAM.

As for how to determine the memory stick for bank 7, not sure. You'd have to contact Supermicro.

You shouldn't see any corruption (and unless you were going to open every file and check every bit against an original that is impossible to be corrupted you'd never be any wiser). Right now, your ECC RAM is saving you from corruption. If you hadn't used server-grade stuff and ECC RAM you'd have a pool that is damaged to some extent. So be thrilled you spent the money for proper hardware.

PSUs can do all sorts of things. It might or might not manifest itself with other symptoms. Harmonic frequencies and such can have very subtle effects on certain components. In short, saying that my RAM is bad is far more likely than the PSU, but you'll have to do process of elimination to determine the true culprit.
 

AltecBX

Patron
Joined
Nov 3, 2014
Messages
285
OK cool. I'll start with 1 DIMM on A1. I'll run 1 stick for 24hrs at a time.
Should I test each memory on 1 slot or do I have to check every memory in every slot?
{1 DIMM on 1 slot testing for 24hrs each will take me 4 days}
{1 DIMM checking on all 8 slots wil take me 8 days x 4 DIMMS = 32 days}
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
You're going to have to account for every variable, so 1 stick and 1 slot as well as two slots and such. It's possible that a memory stick is bad just like it could be a bad memory slot. This is where you get to test your ability to do process of elimination. ;)
 
Status
Not open for further replies.
Top