Multiple ungraceful reboots and a temporarily unhealthy pool on "scrub-day"

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
fyi:
It seems like the root cause was a broken CPU. After replacing the CPU (first temporarily with my desktops Ryzen, later by RMAing it for a new Ryzen 3600), the problems are gone.

So the problem was not caused by TrueNAS <-> AMD incompatibility. Although perhaps not perfect, my experience with TrueNAS <-> AMD compatibility has not been a bad one.

It is a bit concerning that the data corruption itself wasn't detected by anything. Only the corrupted data itself got detected by scrub. But as I wasn't even able to trigger any PCIe AER errors for example in Linux or Windows either (I tried this using my Optane instead of the HBA), I am not sure exactly which part of the CPU was broken and if it is a TrueNAS issue or a platform (AMD) issue or perhaps a combination...
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Thanks for your assistance AlexGG. It was an annoying and complex issue...

I wanted to say that I can now finally start enjoying my TrueNAS server again, but I'm afraid I already ran into the next issue o_O (I think I got that one resolved too now ;) )
 
Last edited:

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
After seeing this train wreck, it reminded me of my own shut down/reboot hell from 2018: CPU cooler went bad.

And it was not a stock cooler either.

Right now, CPU shows hottest was 25C; at the time of the reboots, it was 90C!

But I'm going to be doing some major surgery (new mobo, new CPU, new RAM, new NIC, new SSD for boot) so will have your page of madness open for references should this go pear shaped.
 
Top