Kernel panic on ZFS pool import

jgreco · Dec 31, 2015

cyberjock said:
^^^ That is precisely why I want to look at this, and why I'm going to offer help for free. On a 1-to-terrifying scale this is towards the "terrifying" side of things. It also proves that those software guys that curse my name every night before bed and wrote some code that allegedly makes bad RAM impossible to corrupt a zpool didn't do what it was supposed to (which I have always argued had nearly a 0% chance of working properly).

Bad RAM can always corrupt a pool. A "software guy" is generally too inexperienced with hardware to have sufficient imagination as to how this kind of thing could happen; they're almost always identifiable because they think software can solve anything.

The failure of a DIMM is rarely a flip-of-a-switch event, and as such the mayhem that may be present in the memory subsystem as senility develops isn't guaranteed to be detected in time to prevent corruption in every circumstance.

Bidule0hm · Dec 31, 2015

Ericloewe said:
Sounds like a little utility that injects ECC errors has just become a more urgent matter.

Yeah, we've talked about it on another thread, I just need to find some time to do it...

Edit: the other thread: https://forums.freenas.org/index.ph...an-application-that-injects-ecc-errors.40123/

cyberjock · Dec 31, 2015

So a few things:

1. His LSI controller firmware was at v15. He had the WebGUI warnings, but didn't know what they meant and never got around to investigating. It is very possible that all of his problems are from the LSI firmware mismatch. That isn't good for the purpose of troubleshooting this issue, but keep reading.
2. I was unable to easily mount is zpool. I got full screenshots of all of his BIOS settings as well as a FreeNAS debug. Nothing particularly eventful that can be gained from the debug though.
3. I am buying the bad RAM stick from him so that once it is here in the USA we can hopefully do some investigating and use it as a test platform to dig deeper into these kinds of problems. He had a massive amount of correctable and uncorrectable errors listed. A small percentage of the errors were correctable. The log was filled so fast that the log doesn't actually go back very far because of the rate at which they were occurring and the limited amount of space in the log.
4. He has tested his RAM in another system and verified on that system that one of the sticks are in fact bad, the other 3 are fine.
5. The patrol scrub settings are not in his BIOS anywhere, and none of his settings seemed out of place. In shorter words, nothing seemed misconfigured in his BIOS, nothing seemed out of place in the logs, nothing seems to be wrong aside from the LSI firmware issue (fixed already) and the bad stick of RAM.

This is one of those things that's going to be "a work in progress" for a few weeks/months. This is pretty scary because ECC seems to be doing what its supposed to do... except for the most important part.... halting the system. The bad firmware alone could be the cause for the corruption, but once I have the RAM in my hands we can definitely do more definitive proving of how much the RAM could have contributed.

jgreco · Jan 1, 2016

Well, there's no guarantee that the system will halt. In the old days, ECC catastrophes would generate an NMI, causing a panic, but in a modern system, it's managed through the MCE mechanism. I've seen FreeBSD report bit corrections via MCA but I haven't looked to see what the kernel is set to do upon receipt of an uncorrectable MCE.

Having a spectacularly failed module probably isn't helpful because you're not likely to be able to boot a system with it in; the moment of failure may have been a one-time window of opportunity that is lost now. However, it seems likely that a known-good DIMM could be instrumented to fail in a predictable manner.

cyberjock · Jan 1, 2016

jgreco said:
Well, there's no guarantee that the system will halt.

That definitely seems pretty obvious. But, that defeats the whole purpose for detection from multi-bit error detection. If the system isn't immediately stopped because an uncorrectable MCE should be a sign that the system has recognized that things are totally insane and the system cannot really trust that it can do the right thing. In short, you've potentially got a major problem. Granted, not all MCEs necessarily should cause a panic. But, IMO (and all of the information I've read in the last 24-48 hours) says that it absolutely and unequivocally should.

jgreco said:
Having a spectacularly failed module probably isn't helpful because you're not likely to be able to boot a system with it in; the moment of failure may have been a one-time window of opportunity that is lost now. However, it seems likely that a known-good DIMM could be instrumented to fail in a predictable manner.

In this situation, things don't seem to be going that way. The module seems to be in pretty bad shape, but as long as it is not in the first slot of the first memory bank, the system POSTs and boots up just fine. If you do put this bad stick in that one slot, the system won't complete a POST.

There's also a lot of definitions of "spectacularly failed module". Typically, when a RAM stick fails, far more than a single memory location fails (normally you have something where a particular path is shorted to ground or shorted to a voltage pin where a bunch of memory blocks are suddenly stuck in a particular state or a bunch of bits end up shorted together so they all keep whatever bit was last "written" (but sometimes this doesn't happen because if conditions are right the leakage of the electrons from the capacitors (memory locations are more like little capacitors) they'll end up as zeros instead of ones due to excessive leakage between refreshes. That's why I didn't want to see about buying one of those memory sticks that has a pushbutton to simulate a failure. The way in which the test is simulated really isn't what is commonly seen. Although for most purposes of simulating single-bit errors it is perfectly sufficient for the purposes of studying the effects.

Anyway, I have lots of homework to do, and no doubt once I have that bad stick of RAM in my grubby hands I'm sure quite a few people will want to see the hardware. This is all just very depressing to me though. :(

jgreco · Jan 1, 2016

cyberjock said:
That definitely seems pretty obvious. But, that defeats the whole purpose for detection from multi-bit error detection. If the system isn't immediately stopped because an uncorrectable MCE should be a sign that the system has recognized that things are totally insane and the system cannot really trust that it can do the right thing. In short, you've potentially got a major problem. Granted, not all MCEs necessarily should cause a panic. But, IMO (and all of the information I've read in the last 24-48 hours) says that it absolutely and unequivocally should.

No, the idea is that it gives you some better infrastructure to do better things. If you look at the history of high end servers, they're littered with technologies such as Lockstep Memory, Extended ECC, Chipkill, Advanced ECC, etc., many of which are aimed at being able to tolerate the catastrophic failure of a DIMM at a hardware level. As systems have become larger, with hypervisors in particular sometimes running hundreds of virtual machines, the pain associated with having to down a computing platform in order to remediate a failure like this is fairly high, and part of the MCA is that it means that you can potentially pull tricks within the operating system itself to stop using a certain section of physical memory that's suddenly gone bad.

The MCA just generates a software interrupt and allows the OS to determine what happens next.

In this situation, things don't seem to be going that way. The module seems to be in pretty bad shape, but as long as it is not in the first slot of the first memory bank, the system POSTs and boots up just fine. If you do put this bad stick in that one slot, the system won't complete a POST.

Has it been configured to do a full POST, or a quick POST? ("this is why we don't do a quick POST.")

There's also a lot of definitions of "spectacularly failed module". Typically, when a RAM stick fails, far more than a single memory location fails (normally you have something where a particular path is shorted to ground or shorted to a voltage pin where a bunch of memory blocks are suddenly stuck in a particular state or a bunch of bits end up shorted together so they all keep whatever bit was last "written" (but sometimes this doesn't happen because if conditions are right the leakage of the electrons from the capacitors (memory locations are more like little capacitors) they'll end up as zeros instead of ones due to excessive leakage between refreshes. That's why I didn't want to see about buying one of those memory sticks that has a pushbutton to simulate a failure. The way in which the test is simulated really isn't what is commonly seen. Although for most purposes of simulating single-bit errors it is perfectly sufficient for the purposes of studying the effects.

Anyway, I have lots of homework to do, and no doubt once I have that bad stick of RAM in my grubby hands I'm sure quite a few people will want to see the hardware. This is all just very depressing to me though. :(

Pfft. Get over it and welcome to life with computers already.

Important Announcement for the TrueNAS Community.

Kernel panic on ZFS pool import

jgreco

Resident Grinch

Bidule0hm

Server Electronics Sorcerer

cyberjock

Inactive Account

jgreco

Resident Grinch

cyberjock

Inactive Account

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

Kernel panic on ZFS pool import

jgreco

Resident Grinch

Bidule0hm

Server Electronics Sorcerer

cyberjock

Inactive Account

jgreco

Resident Grinch

cyberjock

Inactive Account

jgreco

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Kernel panic on ZFS pool import"

Similar threads