Kernel panic on ZFS pool import

Vossy · Dec 30, 2015

Hi

I'm having a strange problem. I get a kernel panic during boot when trying to mount the filesystem. This started two days ago when I heard the system reboot. I got an email alerting me to the fact that the pool was online but that a device had experienced an error which could result in data corruption. I ran zpool status and smarctl on each device to see what was going on (see below)

Code:

[root@freenas] /var/log# zpool status -v
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Tue Dec 29 03:47:35 2015
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/24e9b3bd-592b-11e5-bcc2-0cc47a691a60  ONLINE       0     0     0

errors: No known data errors

  pool: volume1
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 3h54m with 0 errors on Sun Nov 29 03:54:51 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume1                                         ONLINE       0     0    22
          raidz2-0                                      ONLINE       0     0    44
            gptid/7727335e-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/77d24c41-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/787d3011-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/792f59c3-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/79da0661-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/7a825715-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/volume1/jails/maraschino_1/etc/spwd.db

smartctl output

I restarted the system, but the system refused to boot and there were a lot of errors the details of which escape me now. I then ran memtest which refused to boot, so I removed all the RAM on a hunch, and then ran memtest on each stick of RAM individually. As it turns out, apparently one stick was causing the problem and, after removing the offending stick of ram, I was able to run memtest overnight on the remaining 3 sticks (4 or 5 passes, no errors).

I then unplugged all the hard drives and booted the system. It started without issue.

Satisfied (and hoping the problem was resolved) I plugged them all back in and restarted the computer. It managed to get all the way up to when it tried to mount the zfs pool. I got a nasty kernal panic message and a lot of text (I feel like there was probably more than this, but the frames of the capture didn't get anything more):

I'm at a loss for where to go now. I feel like I need to try and mount the zfs pool in read-only mode and scrub it, or something. I don't really hold high hopes for the data at this point, but I'd love to restore it if at all possible. Any help at all would be greatly appreciated. Let me know if I can provide any other information at this point

System:
FreeNAS-9.3-STABLE-201512121950 64 bit of a SanDisk 8GB USB drive

Motherboard: Supermicro X10-SL7-F
CPU: Intel XEON E3-1231v3
RAM: 4x8GB Crucial ECC RAM (CT2CP102472BD160B)
HDD: 6x3TB WD Red in Raidz2 configuration (WD30EFRX)
Boot device is a Sandisk 8GB USB drive

On creation of the system, memtest ran without fault overnight.

Ericloewe · Dec 30, 2015

To be honest, this sort of failure just should not happen.

Let's start by seeing in what state the data is in. Try one of the following (from best option to least good):

Try to import the pool in a similar/identical server (Server-grade stuff, 8+GB of ECC RAM) with a clean install of FreeNAS
Try to import the pool in the original server with a clean install of FreeNAS
Try to import the pool in a desktop you might have (8GB+ of RAM) with a clean install of FreeNAS

As for the original server, I suggest a few troubleshooting steps - please do as many of these as you can and report back:

Test the server without the "bad" DIMM and with the good ones shuffled around (including to the slot where the "bad" DIMM was)
Try out the "bad" DIMM in one of the other slots
Try out the "bad" DIMM in a different server
Repeat these procedures with a different ECC-enabled CPU

Also, does the IPMI log contain anything related to memory?

Vossy · Dec 30, 2015

Just took a quick glance at the IPMI logs. I didn't realise they existed. Would've saved me a fair bit of time... There are a ton of errors that seem to be related to the issues I was having before I found the bad RAM.

Code:

Assertion: Memory| Event = Uncorrectable ECC@DIMMA2(CPU1)
Assertion: Memory| Event = Correctable ECC@DIMMA2(CPU1)

Repeated over and over again. Other than watchdog alerts when I reset the system, that's all that I see there.

I unfortunately don't have another server-grade setup. I'll create a new freenas installland see if I can score there. Fingers crossed.

Vossy · Dec 30, 2015

On my attempt to import the volume through the GUI/wizard, the console throws up quite a lot of errors, then restarts. They fly by too fast to be read, I tried to grab a screenshot.

Ericloewe · Dec 30, 2015

Vossy said:
Just took a quick glance at the IPMI logs. I didn't realise they existed. Would've saved me a fair bit of time... There are a ton of errors that seem to be related to the issues I was having before I found the bad RAM.

Code:
Assertion: Memory| Event = Uncorrectable ECC@DIMMA2(CPU1) Assertion: Memory| Event = Correctable ECC@DIMMA2(CPU1)

Repeated over and over again. Other than watchdog alerts when I reset the system, that's all that I see there.

I unfortunately don't have another server-grade setup. I'll create a new freenas installland see if I can score there. Fingers crossed.

Ok, so we are looking at a memory subsystem failure of some sort.

What bugs me is that the system should have been halted immediately upon detection of the uncorrectable error, so a pool that causes a panic on mount does not sound good at all.

At this point, I think it's time to call @cyberjock for some help. The panic doesn't bode well for the pool's prospects, but it might be fixable, since it's probably the result of some sudden system halt.

Vossy · Dec 30, 2015

Thanks for your help so far. Yeah, I don't hold high hopes for the data, as much as I'd love for it to be recoverable.

You have any idea what would have caused this? Could that faulty stick of RAM be to blame? Is it possible for RAM to (spontaneously?) develop a fault? My worry is that, if I don't know what's caused the problem that when I set it all up again, I'll have the same issues.

cyberjock · Dec 30, 2015

Yeah... you appear to need some help bro. :(

PM me and I'll see what I can do. I'm not sure there is much I can do, but I'd like to look at your system through Teamviewer and see what I can do.

It is totally possible that bad RAM is to blame (in fact, if history is any indicator, that *is* the cause of your problems). It is also totally possible that something else is wrong.

If you use Skype or some way of chatting please include your info in the PM.

SweetAndLow · Dec 30, 2015

I would be interested in the results of this. Seems like everything was done correctly and they are still seeing problems.

cyberjock · Dec 30, 2015

SweetAndLow said:
I would be interested in the results of this. Seems like everything was done correctly and they are still seeing problems.

^^^ That is precisely why I want to look at this, and why I'm going to offer help for free. On a 1-to-terrifying scale this is towards the "terrifying" side of things. It also proves that those software guys that curse my name every night before bed and wrote some code that allegedly makes bad RAM impossible to corrupt a zpool didn't do what it was supposed to (which I have always argued had nearly a 0% chance of working properly).

DrKK · Dec 30, 2015

What really has me wondering what the hell is going on is that his IPMI shows an uncorrectable ECC error, yet the system did not seem to signal an MCE and halt. ****THAT**** is super problematic. In fact, it vitiates one of the main reasons for having ECC in the first place. I sure as hell would like to see how this resolves. Probably should direct ryao from #zfs to this thread too.

Ericloewe · Dec 31, 2015

Any way you put it, something here went very wrong.

If the system did halt, how did the pool end up in this state? After the first problem, it seems that a single file being written at the time got corrupted (something in a jail), which is understandable and (to me) not unexpected - the really important data would still be safe. So, why was the pool subsequently borked?
If the system did not halt, what the hell is going on? It's reputable server hardware, the memory is validated for the board (and seems to have worked) and it has every expectation of working.

Now that I think about it, something comes to mind: A sequence of 2+ bit errors that goes undetected long enough to cause damage. I'm not familiar with the error correction scheme used in ECC RAM, so I can't comment on the likelihood of this.

Bidule0hm · Dec 31, 2015

Ericloewe said:
that goes undetected long enough

Wait, the CPU doesn't check the parity each time it reads the RAM?

Ericloewe · Dec 31, 2015

Bidule0hm said:
Wait, the CPU doesn't check the parity each time it reads the RAM?

Yes, but I don't think that detection of errors larger than two bits per 72-bit word is guaranteed. Again, I don't know what forward error correction they use, so it's just a vague possibility.

Bidule0hm · Dec 31, 2015

So it would be something like this:?

1 bit error ----> detected + corrected
2 bits error ---> detected
3+ bits error -> end of the universe...

Mirfster · Dec 31, 2015

Wanting to see what comes of this myself. Hope all goes well though...

Ericloewe · Dec 31, 2015

Bidule0hm said:
So it would be something like this:?

1 bit error ----> detected + corrected

2 bits error ---> detected

3+ bits error -> end of the universe...

I think it's:

3+ bits error -> not all errors will be detected

But again, I'm not sure.

jgreco · Dec 31, 2015

Ericloewe said:
I think it's:

3+ bits error -> not all errors will be detected

But again, I'm not sure.

No, that's exactly correct. The error correction can even miscorrect at that point. What you're relying on is that undetected errors will not happen in a vacuum; errors tend to show up in multiple locations at once, so even if the error at 0x12345678 isn't detected, the error at 0x12345679 will be. Single bit errors are to be correctable. Bigger failures generally just result in a panic.

DrKK · Dec 31, 2015

I just spoke with ryao in #zfs. He agrees that this is disturbing, and informed some of the openZFS developers of this.

He was wondering if "memory scrub" which SuperMicro sometimes calls "Patrol Scrub" was turned on. However, I could not recall such a setting in the X10 series of boards, and sure enough I don't see any mention in the manual. On the X9 series, it's right in there, plain as day. Reading up on what it does, obviously such memory scrubs should be de rigueur, so.,....yeah, wonder what's going on?

But in any case, there should have been an MCE, and the system should have come down hard. But apparently, it did not. That's what I really want an answer for.

DrKK · Dec 31, 2015

By the way, congratulations @Vossy ; you have successful gotten more than one community into a tizzy over an interesting problem. That doesn't happen often.

Ericloewe · Dec 31, 2015

DrKK said:
I just spoke with ryao in #zfs. He agrees that this is disturbing, and informed some of the openZFS developers of this.

He was wondering if "memory scrub" which SuperMicro sometimes calls "Patrol Scrub" was turned on. However, I could not recall such a setting in the X10 series of boards, and sure enough I don't see any mention in the manual. On the X9 series, it's right in there, plain as day. Reading up on what it does, obviously such memory scrubs should be de rigueur, so.,....yeah, wonder what's going on?

But in any case, there should have been an MCE, and the system should have come down hard. But apparently, it did not. That's what I really want an answer for.

The X11 manuals also seem to not mention that at all.

Sounds like a little utility that injects ECC errors has just become a more urgent matter.

Important Announcement for the TrueNAS Community.

Kernel panic on ZFS pool import

Cadet

Attachments

Server Wrangler

Cadet

Cadet

Attachments

Server Wrangler

Cadet

Inactive Account

Sweet'NASty

Inactive Account

FreeNAS Generalissimo

Server Wrangler

Server Electronics Sorcerer

Server Wrangler

Server Electronics Sorcerer

Doesn't know what he's talking about

Server Wrangler

Resident Grinch

FreeNAS Generalissimo

FreeNAS Generalissimo

Server Wrangler

Similar threads