- Took out 2 sticks of RAM and re-arranged and re-seated the other 2A physical inspection of the modules (especially the pins) and the slots might help. Dust can cause issues too.
Reseating the CPU is also an option.
You need to run it for at least 24 hours or maybe even 48 hours. 6 hours and one pass is not enough.I did run MemTest last night. It ran for 6 hours and came back as passed.
You need to run it for at least 24 hours or maybe even 48 hours. 6 hours and one pass is not enough.
The other stress test you should run is a CPU Stress Test for at least 8 hours.
Also, just in case you might be doing this, ensure you are testing the system with all your hard drives connected to ensure the power draw on the power supply is at the maximum. If you disconnect some of the hardware then the power supply isn't pulling the same current.
And lets say those things all pass. I would then grab a copy of Debian or Ubuntu and boot it, let it run for a day or two, see if the system crashes.
Lastly, see if you can identify what operation you might be doing when the system crashes. Is it when you are using Plex? Copying data? Nothing at all? Little things can provide big clues.
Best of luck to you.
Either tends to cause something worse: Data Corruption.Thanks for the recommendatons, they sound very thorough and I understand that perhaps in enterprise scenarios why they may want to be this thorough. ....What would be the benefit of purposely forcing this type of proactive downtime as opposed to reactively addressing these types of issues as they arise? I don't think there is a risk of data loss in the event of these types of crashes are there? You are recommending memory and CPU tests, which if either failed, would that cause data loss?
Either tends to cause something worse: Data Corruption.
It's like Ransomeware, your data is corrupt, your backups are corrupt, your stuff has been corrupted and then corrupted differently over time (even though you didn't touch it), and you weren't aware of any of this. Unlike Ransomeware there's no coming back from the dead, the zombies are all you're left with. Most of us would rather not store the data, it's less trouble than trying to figure out what is bad, how bad, etc.
Mind you, I've only learned much of this recently due to the awesome contributions of TrueNAS members, so while I'll do my best to be accurate, the last few months have been a whirlwind eye-opener into how many things can and do go wrong inside computer systems and I may misstate some things.I was thinking that ECC would prevent that....but reading more about corruption sources, the RAM is just one potential right?
Could it corrupt all existing data, or just data during the copy process specifically?
Is the process of just copying a large amount of data a form of stress test? Or is the issue more that this process may not identify an issue unless it was severe enought to cause a crash...where as using MemTest and a CPU test are meant to capture and report these less severe issues that would otherwise go unnoticed?
That's one of those things that's a bit a niche feature in high-end servers and comes with relevant catches. Effectively, it mirrors memory controllers, halving memory performance and capacity in order to eke out a little bit more reliability.Advanced Error Correction can actually correct for multi-bit errors on certain systems. I understand it's only on Intel Xenon processors, but I could be wrong.
That's what I had previously understood also, however:That's one of those things that's a bit a niche feature in high-end servers and comes with relevant catches. Effectively, it mirrors memory controllers, halving memory performance and capacity in order to eke out a little bit more reliability.
It's an extreme measure that's really only useful to meet insane reliability criteria.
(emphasis added)per DELL (linked previously):
Intel has redesigned and optimized their Advanced Error Correcting Code in 3rd Gen Xeon Scalable Processors to handle the most common failure patterns known among the major DRAM suppliers. In doing this, many of the multi-bit error patterns that were uncorrectable by previous generations of Intel Xeon Scalable Processors are now correctable by 3rd Gen Xeon SPs. This uplift will result in a significant decrease in uncorrectable memory errors. This enhancement is available on all 3rd Gen Xeon Scalable Processors. There are no memory or system configuration requirements necessary to take advantage of the improved Advanced ECC (or SDDC).
(emphasis added)per HP (HP ProLiant ML350 G6 Server User Guide, p46):
Advanced ECC—provides the greatest memory capacity for a given DIMM size, while providing up to 4-bit error correction. This mode is the default option for this server.
Standard ECC can correct single-bit memory errors and detect multi-bit memory errors. When multi-bit errors are detected using Standard ECC, the error is signaled to the server and causes the server to halt.
Advanced ECC protects the server against some multi-bit memory errors. Advanced ECC can correct both single-bit memory errors and 4-bit memory errors if all failed bits are on the same DRAM device on the DIMM. Advanced ECC provides additional protection over Standard ECC because it is possible to correct certain memory errors that would otherwise be uncorrected and result in a server failure. The server provides notification that correctable error events have exceeded a pre-defined threshold rate. p47
Note Mirrored, Lockstep, and Online Spare memory configurations are on p48.
p49 Under Advanced ECC states DIMMs may be installed individually, so that would indicate it is not using a RAID configuration for RAM.