SOLVED Advice Needed for Cursed SuperMicro TrueNAS Build

ewellinger

Cadet
Joined
Aug 4, 2017
Messages
5
Hello!

I need some advice about a build I've been working on that has been hitting A LOT of different issues and wanted to get the communities input on how best to proceed without just throwing money and time at the problem. TLDR: RMA'd my motherboard 3 times and still hitting issues

The basic components of the build are as follows:
  • Motherboard: Supermicro X11SPH-nCTF
  • CPU: Intel Xeon Silver 4210 / 4214
  • Memory: 8x 32GB Crucial DDR4 2933MHz PC4-23400 ECC
  • Chassis: Supermicro CSE-826BE16-R920LPB
NOTE: There are obviously other components beyond these, but these are really the only components pertinent to the issue that I've been hitting.

At this point I'm actually on my fourth (!!) motherboard as I hit a variety of issues with the previous 3 that required RMA replacements. The motherboard is new, but the CPU(s) are refurbished chips from ServerMonkey. The RAM is also refurbished but I'm pretty confident that they are all working as I was able to do 4 passes with no errors on the full set of 8 sticks. The basic trajectory of this cursed build is as follows:
  • Motherboard 1:
    • System would not power on at all. The BMC light would turn on but for the live of me I couldn't get it to boot up. I used a multi-meter and verified that the chassis PSU was working as expected so that wasn't the issue.
    • Wasn't sure whether it was the CPU or motherboard so I RMA'd the motherboard.
  • Motherboard 2:
    • System turned on, yay!
    • But I was hitting memory issues where it was not identifying one of the DIMMs. At this point my suspicions turned to the CPU. I cleaned the CPU contacts with some isopropyl alcohol and tried reseating it numerous times. The memory issue went away but then reappeared shortly into a memory test. Proceeded with RMA'ing the CPU and getting a different 4210 chip.
    • The new chip worked and I was able to test all the memory sticks for a full 4 runs over ~48 hours.
    • Unfortunately the NIC was bad and booting into TrueNAS I was unable to see it at all. Tried updating the BIOS and booting into a different OS but no dice, the adaptor wasn't recognized at all.
    • Looking back I should have just added a network adaptor card since this was the closest I got to a working system
  • Motherboard 3:
    • The next motherboard immediately had memory issues. I was seeing messages like "Memory Training Failure" when booting with just a single DIMM of memory. I tried moving the DIMM around (based on a request from Supermicro support) and was able to boot with just the DIMMC1 slot filled. At this point I was able to update the BIOS but that didn't help.
    • RMA'd a 3rd time. At this point I was informed that any future issues would only result in a repair and not a replacement. I also heard that a tech from Supermicro would personally QA the board before shipping it out.
  • Motherboard 4:
    • Motherboard 4 initially booted with just DIMMA1 occupied. Added in the rest of the memory sticks and then DIMMD1 couldn't be found. Cleaned the CPU, reset it, and tried just DIMMA1 again and it couldn't find it now.
    • At this point I'm convinced the CPU is bad because I know this board was inspected before being sent (the BIOS was updated to a recent version that none of the others were running).
    • In the hope of getting something working I ordered a Xeon Silver 4214 CPU and a different heatsink (the officially tested Supermicro one as opposed to a Dynatron B5).
    • This arrived yesterday and I tried progressively added more memory to the system. No matter what I do DIMMD1 is consistently not found. I have not yet tried reseating it (generally didn't have a lot of time yesterday).
    • On the plus side it looks like the network adaptor is working, so that's something I supposed.
How would you proceed? I could pursue an additional RMA to try and get my current board repaired but I'm not sure how long that would take and my luck doesn't seem to be too good thus far.

One option I was considering was just writing off the DIMMD1 and working with the working DIMM slots. Not sure if this is a good long term approach though since I don't know if that leaves the board more prone to failure.

I'm pretty sure I didn't bend any pins on the CPU connector with the multiple reseating of the first CPU, but I guess I can't be sure of that. It's worth noting that this is not my first build so I'm not a complete n00b regarding these types of things but this is my first foray into the more enterprise geared components.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yikes, that's a rough experience. The X11SPH-nCTF + Xeon Silver 4214 should work flawlessly - I should know, I specced one at work and it's been working without hiccups since early 2020.

If two CPUs don't work with the D channel, you either have a freak coincidence on your hands or a damaged motherboard.
I'm pretty sure I didn't bend any pins on the CPU connector with the multiple reseating of the first CPU, but I guess I can't be sure of that.
Nothing like removing the CPU and inspecting the pins. Taking a good picture tends to help. Also make sure the retaining bracket is torqued down correctly, these large sockets can be pretty sensitive apparently.

Also, I'm staying well clear of wherever you're buying stuff, that's way too much trouble for something so straightforward.
 

ewellinger

Cadet
Joined
Aug 4, 2017
Messages
5
The boards were actually drop shipped directly from Supermicro so I'm not sure how to get any closer to the source than that.

I have visually inspected the pins and I don't think there are any bent pins but I haven't yet gotten a magnifying glass to meticulously inspect all of them.

Given the luck I've had thus far I'm thinking about throwing in the towel and becoming a luddite

I'm going to take some time tonight to continue testing to compare and contrast the two CPUs and update if I discover anything.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Important note: Be extra sure to clear the firmware settings between each step. Don't want dodgy memory training parameters to stick around and complicate things.
 

ewellinger

Cadet
Joined
Aug 4, 2017
Messages
5
What do you mean by clearing the firmware settings? Would that be resetting the BIOS / CMOS?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, I am doing all I can to push for correct nomenclature and get rid of the vestiges of IBM PC terminology that do not make any sense at all. "CMOS settings" has to be the weirdest of the bunch. I guess it must have come about because the RTC IC was the only CMOS part in the PC AT? Can't imagine any other reason.

Another candidate is "BIOS", referring to system firmware only barely related to the IBM PC BIOS or re-implementations thereof. There are probably more I'm not remembering.
 

ewellinger

Cadet
Joined
Aug 4, 2017
Messages
5
Oh my god, I don't want to jinx it but it may be working. I removed the CPU and reseated it and made sure to tighten down all the screws as much as they would possibly go and lo and behold it booted and saw all the RAM.

Going to have it do a pass with memtest just to make sure everything is looking good but this is quite the relief.
 

ewellinger

Cadet
Joined
Aug 4, 2017
Messages
5
Marking this as resolved.

I'm off to the races and everything is looking great. Lesson learned to purchase the heatsink that is specifically called out as working with the motherboard and ensuring that all the screws are properly screwed as far as they will go. Unclear if the previous CPU was okay or not but I'm done touching the current one less it has issues seeing the memory again.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Yeah, I am doing all I can to push for correct nomenclature and get rid of the vestiges of IBM PC terminology that do not make any sense at all. "CMOS settings" has to be the weirdest of the bunch. I guess it must have come about because the RTC IC was the only CMOS part in the PC AT? Can't imagine any other reason.

Another candidate is "BIOS", referring to system firmware only barely related to the IBM PC BIOS or re-implementations thereof. There are probably more I'm not remembering.
My Supermicro X10SRA-F manual has a jumper for, you guessed it, to clear CMOS settings, and a BIOS recovery switch. It even has a BIOS restore button. Good luck! I think those terms are ingrained.

Glad you got the issue resolved ewellinger. There are all sorts of odd hardware issues in the forums
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
At least HPE are using more accurate language--what most of us call the BIOS, they call the ROM-based setup utility or RBSU. A little more cumbersome to say, though.
 
Top