Odd RAM issues "DDR Basic RxDqDqs Failure" ?

dropslot

Cadet
Joined
Feb 5, 2024
Messages
6
I'm not an IT pro but I had this system up and running a few months ago. I took it down, put it in storage, and recently pulled it all out of storage, set it up, and it won't boot.

I'm almost positive someone is going to tell me that all my RAM is borked but I'd love a little help with this.

Its a Supermicro 8048b-TRFT with an X10 QBi board. I wish I could tell you TrueNAS software and/or the bios are but I can't remember. Can't even get into the BIOS because I'm getting memory error messages from a number of different locations whenever I fire it up.

It was working great when I took it down in June. It was all carefully packaged and moved to a storage unit until last week. Maine does get cold, and the unit was not heated, so did I freeze all my RAM? Is that a thing?

I am hoping, best case scenario is that there is something that was loosened during the move that I am not seeing, or there is some sort of minor hardware issue wherein the initial memory testing on startup is being compromised. But as I work through this, I seem to think I may need to replace my RAM.

Any thoughts would be great. I'm not a pro, just trying to get my media server back up and running so I can edit video for some clients.
 

Attachments

  • IMG_3300.jpeg
    IMG_3300.jpeg
    223.3 KB · Views: 107

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Quad-CPU board with memory risers? That's some seriously overengineered stuff right there, for niche applications...

In any case, start by reseating the memory risers and the DIMMs. It sort of looks like one of the risers may have shifted during the move...
 

dropslot

Cadet
Joined
Feb 5, 2024
Messages
6
Ha - yeah, way bigger than I need now. It seemed to make sense at the time, but the video world got pretty slow last year. So I'm sitting on a lot of metal. I got a deal on it but it's one of those "the car is free but the repairs are gonna cost you" kind of deals. I had big plans because we were all working remote... Long story short, yeah - agreed.

I pulled all the risers and reset all the DIMMS.

I also replaced the BIOS battery, as I thought that might be something that could have died in the cold.

The only difference I see now is that the risers and DIMMS mentioned in the error code seem to have changed? Is that progress? Wondering if I should begin pulling all the DIMMS and move through them one by one. or maybe run memtest off a usb?

Thanks so much for your help!
 

Attachments

  • IMG_3314.jpeg
    IMG_3314.jpeg
    174.8 KB · Views: 101

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I also replaced the BIOS battery, as I thought that might be something that could have died in the cold.
Always a good one to check, I once spent months thinking my workstation was in a death spiral, with boot problems and memory sometimes flaking out during boot. It was just a bad battery corrupting the UEFI config, including IMC training cache!
Do reset the settings to default according to the motherboard manual, as replacing the battery may not be enough to reset back to sane defaults.

maybe run memtest off a usb?
Doesn't look like it would get that far...

The only difference I see now is that the risers and DIMMS mentioned in the error code seem to have changed? Is that progress? Wondering if I should begin pulling all the DIMMS and move through them one by one.
Hard to tell if it's progress. Besides the risers and DIMMs, there are also the CPUs themselves to consider. A bit more painful to reseat, though.
 

dropslot

Cadet
Joined
Feb 5, 2024
Messages
6
So -

1) I reset the board using the JBT1 short, as per the mother board manual. Didn't know that just removing the battery didn't do it, so thanks for that!

2) I began pulling memory risers until I could get it to move past the memory test - so i have two risers out, and I am past the opening Supermicro logo. That is progress!

And, 3) I got the IMPI up and logged in from my PC. So i have a lot more information now. But -

It's just booting to a gray screen. That's all I am getting.

My theory is as such - In the move, the RAM risers, which are only loosely connected to the back of the server anyway, were jostled enough so that that may not sit right, and/or their slots on the mother board were bent or somehow damaged. And that theory was working, until now.

Is there ANYTHING I could look at in the web interface for the IPMI that might give me a clue as to what's holding me back?

I appreciate your help so much. Thank you!
 

Attachments

  • Screenshot 2024-02-07 103223.png
    Screenshot 2024-02-07 103223.png
    14.5 KB · Views: 99

dropslot

Cadet
Joined
Feb 5, 2024
Messages
6
And now I can make it past the gray screen, but it's hung on POST code BF. So, still the memory, right?
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
So -

1) I reset the board using the JBT1 short, as per the mother board manual. Didn't know that just removing the battery didn't do it, so thanks for that!

2) I began pulling memory risers until I could get it to move past the memory test - so i have two risers out, and I am past the opening Supermicro logo. That is progress!

And, 3) I got the IMPI up and logged in from my PC. So i have a lot more information now. But -

It's just booting to a gray screen. That's all I am getting.

My theory is as such - In the move, the RAM risers, which are only loosely connected to the back of the server anyway, were jostled enough so that that may not sit right, and/or their slots on the mother board were bent or somehow damaged. And that theory was working, until now.

Is there ANYTHING I could look at in the web interface for the IPMI that might give me a clue as to what's holding me back?

I appreciate your help so much. Thank you!
Regarding the grey screen, this is an issue with Java (or something of the sort). I have an old X10SLI board I use for backup and usually, when it fails rendering the message screen, I open my web browser (Firefox) and type the IPMI address to the Supermicro board and log in and if memory serves will present a prompt to update Java script.
Once this is done, the "Remote Console Preview" should now be working properly.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Or use the HTML5 viewer, though you may need to update the BMC firmware first.
And now I can make it past the gray screen, but it's hung on POST code BF. So, still the memory, right?
Could be. I'm really not familiar with Xeon E7 systems, but I imagine they're similarly picky about how memory is populated, which adds an extra snag in the process.
 

dropslot

Cadet
Joined
Feb 5, 2024
Messages
6
this is an issue with Java (or something of the sort). I have an old X10SLI board I use for backup and usually, when it fails rendering the message screen, I open my web browser (Firefox) and type the IPMI address to the Supermicro board and log in and if memory serves will present a prompt to update Java script.
Nice! So I DID manage to get the remote console preview to work - it matches the monitor I have now hooked up the the server.

@Ericloewe I saw that the memory needs to be populated in a very particular manner, and I think I did it right, but that's definitely worth checking again.

I found a youtube video from the Art of the Server channel that suggest a BF code means the code is happening in second half of the CPUs, meaning, I think in my quad setup, the ram in P3M1 or P4M1? Does that sound accurate?

Anyway, got to this - baby steps!
 

Attachments

  • Screenshot 2024-02-07 120951.png
    Screenshot 2024-02-07 120951.png
    18.5 KB · Views: 96

dropslot

Cadet
Joined
Feb 5, 2024
Messages
6
So last night I began to get desperate and started just pulled things out. Pulled CPU 4 (based on the idea that a BF POST code would be near the end of the processor line) and then just kept pulling memory risers until I was down to two, but then it booted up!

Had to get into the BIOS and reset some parameters obviously but now I’ve got that big empty metal case humming along nice.

Of course, the tinkerer in me is still wondering what happened and how much of my ram I can put back in. I feel like there was at least one riser, maybe more, that got bumped in the move.

Anyway, I appreciate all the help I found here. Thank you!
 
Joined
Jun 15, 2022
Messages
674
Personally, I'd pull all the memory and power on the system, which should give a stable system with error message. Then I'd add in as little memory as possible and hopefully have a stable system, then use a USB stick to boot and run a short memory test.

When that worked I'd start keeping notes with pencil and paper and adding memory back in slowly until it failed, then isolate what failed by slowly removing/swapping memory, remembering there may be more than one bad component (this is where good notes come in).

Eventually the system should be stable and I'd run a long MemTest.

About the time RAM goes bad is when heatsink compound dries up so I'd clean the old processor paste off and use new thermal compound (not off eBay). I'd do the same on the HBA chips.

Others may do things differently for good reason, this is just one starting point given the issue you're having and personal experience.
 
Top