Memory error in IPMI - CPLD. Capacity or bad mobo?

paulatmig · Nov 16, 2015

Recently, after testing and going beta by uploading our VMs to our new FreeNAS build, we started seeing "random" reboots during (thankfully) off-hours; about one every day and a half. I checked out the IPMI on our system, and normally during memory events - if it's a ECC error - I'll see the problematic DIMM in question.

However, this time I'm seeing CPLD error which causes the reboot - and thus no faulty DIMM information. Unfortunately, the only text is: "OEM CPLD CATTER - Asserted", and that then trips a watchdog timer (on the board) interrupt, which then hard resets the system.

My guesses are, at this point, that I either don't have enough memory for my system or that my mobo is having problems. Given the "OEM" part of the message above, I'm thinking more that it's the FreeNAS interaction with the memory that caused the hang.

My configuration is a Supermicro 6027R-E1R12L, two CPUs, 192Gb of ECC RAM, 8 800gb Intel S3700 SSDs in a RAID10, Intel P3600 ZIL, two Chelsio 10Gb nics. Running latest 9.3.1 FreeNAS. Those SSDs make a pool of 2.6Tb, which I carved a 1.8Tb volume for VMDKs and created a iSCSI target file of 1.5Tb. Currently we're using 250Gb of that iSCSI storage.

-- also attached via SAS expander to a Supermicro JBOD box (just disks and the SAS controller) are 12 2Tb WD RE series (SAS) in another pool that's 6.5Tb of usable space. That's another iSCSI target for our Windows VM, where we've got files stored on that Windows system to about 700Gb of data.

We've had 6 VMs on this system for a few weeks with no problems, but over the last few days we'd migrated another 2 (one of which had a rather large lazy zeroed partition associated with it). After doing so, that's when the reboots started.

I've got the ability to swap out that memory for 32Gb LDIMMS, but that'll be a blow to our budget that - if it's just a matter of further tuning or fixing our configuration - I'd like to avoid if possible.

paulatmig · Nov 20, 2015

Okay, things seem to have settled for the time being.

I noticed this was after I had transferred about 700Gb of data over to one of the volumes, as well as the two VMs. Looking at the memory consumption, I see more and more memory being consumed until it seems to brush up to a certain point - where the "free memory" is about the same amount allocated to swap. That's when it reboots.

So I'm guessing that I just need to throw more RAM at the problem? Or is there a way to periodically decrease the memory consumption (like flushing the cache)?

dlavigne · Nov 22, 2015

IMHO, throw more RAM at it.

paulatmig · Dec 4, 2015

Looks like the system freezes every 5 days or so, forcing a reboot. The memory consumption goes right back to the levels before reboot (160gb consumed out of 192) and stays steady for a few days. Mysterious. I'm going to put in more memory this weekend, switching to LRDIMM 32Gb modules (wow those were expensive) and boosting up to 256Gb.

rsquared · Dec 4, 2015

paulatmig said:
Looks like the system freezes every 5 days or so, forcing a reboot. The memory consumption goes right back to the levels before reboot (160gb consumed out of 192) and stays steady for a few days. Mysterious. I'm going to put in more memory this weekend, switching to LRDIMM 32Gb modules (wow those were expensive) and boosting up to 256Gb.

There's another thread regarding watchdog triggered reboots with iSCSI at https://forums.freenas.org/index.php?threads/39433/

The fix for that is in the latest stable build (11/28) so you may want to try that update before throwing more RAM at the box.

Sent from my Nexus 6 using Tapatalk

paulatmig · Dec 4, 2015

Of course I bought the RAM before the update (which I applied on the 28th). Crossing my fingers!

paulatmig · Dec 31, 2015

Well, that stayed stable for about a month and then we got another random hang. Seems to be a larger amount of data transfer caused it this time, about 2Tb. Submitting a bug and seeing if there's any more dedicated resources available to check this out.

paulatmig · Mar 28, 2016

No dice - we sometimes get up to 20 or so days of uptime before the system hangs. Sometimes as short as 2 days. It happens during low usage and high usage, at 3:00am on the dot, at random times throughout the day. RAM's been replaced and upgraded using Supermicro tested memory, up to 256Gb, and that hasn't helped. When it hangs, I can see the console gets warnings of disconnection (i.e. cannot reach) to our iSCSI guests, but that's about it. Looking through the logs provides me with nothing. There's no L2ARC, just a 10Gb ZIL (Intel p3600).

Best guess: processors or mainboard, and I really hope it's not the mainboard.

Important Announcement for the TrueNAS Community.

Memory error in IPMI - CPLD. Capacity or bad mobo?

paulatmig

Dabbler

paulatmig

Dabbler

dlavigne

Guest

paulatmig

Dabbler

rsquared

Explorer

paulatmig

Dabbler

paulatmig

Dabbler

paulatmig

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Memory error in IPMI - CPLD. Capacity or bad mobo?

paulatmig

Dabbler

paulatmig

Dabbler

dlavigne

Guest

paulatmig

Dabbler

rsquared

Explorer

paulatmig

Dabbler

paulatmig

Dabbler

paulatmig

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Memory error in IPMI - CPLD. Capacity or bad mobo?"

Similar threads