Kernel panic on ZFS pool import

Status
Not open for further replies.

Vossy

Cadet
Joined
Feb 2, 2013
Messages
7
Hi

I'm having a strange problem. I get a kernel panic during boot when trying to mount the filesystem. This started two days ago when I heard the system reboot. I got an email alerting me to the fact that the pool was online but that a device had experienced an error which could result in data corruption. I ran zpool status and smarctl on each device to see what was going on (see below)

Code:
[root@freenas] /var/log# zpool status -v
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Tue Dec 29 03:47:35 2015
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/24e9b3bd-592b-11e5-bcc2-0cc47a691a60  ONLINE       0     0     0

errors: No known data errors

  pool: volume1
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 3h54m with 0 errors on Sun Nov 29 03:54:51 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume1                                         ONLINE       0     0    22
          raidz2-0                                      ONLINE       0     0    44
            gptid/7727335e-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/77d24c41-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/787d3011-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/792f59c3-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/79da0661-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0
            gptid/7a825715-58ec-11e5-95ec-0cc47a691a60  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/volume1/jails/maraschino_1/etc/spwd.db


smartctl output

I restarted the system, but the system refused to boot and there were a lot of errors the details of which escape me now. I then ran memtest which refused to boot, so I removed all the RAM on a hunch, and then ran memtest on each stick of RAM individually. As it turns out, apparently one stick was causing the problem and, after removing the offending stick of ram, I was able to run memtest overnight on the remaining 3 sticks (4 or 5 passes, no errors).

I then unplugged all the hard drives and booted the system. It started without issue.

Satisfied (and hoping the problem was resolved) I plugged them all back in and restarted the computer. It managed to get all the way up to when it tried to mount the zfs pool. I got a nasty kernal panic message and a lot of text (I feel like there was probably more than this, but the frames of the capture didn't get anything more):

BLJsXz0.png

9JzLsXX.png


I'm at a loss for where to go now. I feel like I need to try and mount the zfs pool in read-only mode and scrub it, or something. I don't really hold high hopes for the data at this point, but I'd love to restore it if at all possible. Any help at all would be greatly appreciated. Let me know if I can provide any other information at this point

System:
FreeNAS-9.3-STABLE-201512121950 64 bit of a SanDisk 8GB USB drive

Motherboard: Supermicro X10-SL7-F
CPU: Intel XEON E3-1231v3
RAM: 4x8GB Crucial ECC RAM (CT2CP102472BD160B)
HDD: 6x3TB WD Red in Raidz2 configuration (WD30EFRX)
Boot device is a Sandisk 8GB USB drive

On creation of the system, memtest ran without fault overnight.
 

Attachments

  • vlcsnap-error197.png
    vlcsnap-error197.png
    497.8 KB · Views: 280
  • vlcsnap-error643.png
    vlcsnap-error643.png
    777.7 KB · Views: 276
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
To be honest, this sort of failure just should not happen.

Let's start by seeing in what state the data is in. Try one of the following (from best option to least good):
  1. Try to import the pool in a similar/identical server (Server-grade stuff, 8+GB of ECC RAM) with a clean install of FreeNAS
  2. Try to import the pool in the original server with a clean install of FreeNAS
  3. Try to import the pool in a desktop you might have (8GB+ of RAM) with a clean install of FreeNAS
As for the original server, I suggest a few troubleshooting steps - please do as many of these as you can and report back:
  • Test the server without the "bad" DIMM and with the good ones shuffled around (including to the slot where the "bad" DIMM was)
  • Try out the "bad" DIMM in one of the other slots
  • Try out the "bad" DIMM in a different server
  • Repeat these procedures with a different ECC-enabled CPU

Also, does the IPMI log contain anything related to memory?
 

Vossy

Cadet
Joined
Feb 2, 2013
Messages
7
Just took a quick glance at the IPMI logs. I didn't realise they existed. Would've saved me a fair bit of time... There are a ton of errors that seem to be related to the issues I was having before I found the bad RAM.
Code:
Assertion: Memory| Event = Uncorrectable ECC@DIMMA2(CPU1)
Assertion: Memory| Event = Correctable ECC@DIMMA2(CPU1)

Repeated over and over again. Other than watchdog alerts when I reset the system, that's all that I see there.

I unfortunately don't have another server-grade setup. I'll create a new freenas installland see if I can score there. Fingers crossed.
 

Vossy

Cadet
Joined
Feb 2, 2013
Messages
7
On my attempt to import the volume through the GUI/wizard, the console throws up quite a lot of errors, then restarts. They fly by too fast to be read, I tried to grab a screenshot.
WslI7Be.png
 

Attachments

  • 2015-12-31 10_29_34-Java iKVM Viewer v1.69 r13 [10.0.0.33]  - Resolution 720 X 400 - FPS 32.png
    2015-12-31 10_29_34-Java iKVM Viewer v1.69 r13 [10.0.0.33] - Resolution 720 X 400 - FPS 32.png
    140.5 KB · Views: 269

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just took a quick glance at the IPMI logs. I didn't realise they existed. Would've saved me a fair bit of time... There are a ton of errors that seem to be related to the issues I was having before I found the bad RAM.
Code:
Assertion: Memory| Event = Uncorrectable ECC@DIMMA2(CPU1)
Assertion: Memory| Event = Correctable ECC@DIMMA2(CPU1)

Repeated over and over again. Other than watchdog alerts when I reset the system, that's all that I see there.

I unfortunately don't have another server-grade setup. I'll create a new freenas installland see if I can score there. Fingers crossed.
Ok, so we are looking at a memory subsystem failure of some sort.

What bugs me is that the system should have been halted immediately upon detection of the uncorrectable error, so a pool that causes a panic on mount does not sound good at all.

At this point, I think it's time to call @cyberjock for some help. The panic doesn't bode well for the pool's prospects, but it might be fixable, since it's probably the result of some sudden system halt.
 

Vossy

Cadet
Joined
Feb 2, 2013
Messages
7
Thanks for your help so far. Yeah, I don't hold high hopes for the data, as much as I'd love for it to be recoverable.

You have any idea what would have caused this? Could that faulty stick of RAM be to blame? Is it possible for RAM to (spontaneously?) develop a fault? My worry is that, if I don't know what's caused the problem that when I set it all up again, I'll have the same issues.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah... you appear to need some help bro. :(

PM me and I'll see what I can do. I'm not sure there is much I can do, but I'd like to look at your system through Teamviewer and see what I can do.

It is totally possible that bad RAM is to blame (in fact, if history is any indicator, that *is* the cause of your problems). It is also totally possible that something else is wrong.

If you use Skype or some way of chatting please include your info in the PM.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I would be interested in the results of this. Seems like everything was done correctly and they are still seeing problems.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I would be interested in the results of this. Seems like everything was done correctly and they are still seeing problems.

^^^ That is precisely why I want to look at this, and why I'm going to offer help for free. On a 1-to-terrifying scale this is towards the "terrifying" side of things. It also proves that those software guys that curse my name every night before bed and wrote some code that allegedly makes bad RAM impossible to corrupt a zpool didn't do what it was supposed to (which I have always argued had nearly a 0% chance of working properly).
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
What really has me wondering what the hell is going on is that his IPMI shows an uncorrectable ECC error, yet the system did not seem to signal an MCE and halt. ****THAT**** is super problematic. In fact, it vitiates one of the main reasons for having ECC in the first place. I sure as hell would like to see how this resolves. Probably should direct ryao from #zfs to this thread too.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Any way you put it, something here went very wrong.
  • If the system did halt, how did the pool end up in this state? After the first problem, it seems that a single file being written at the time got corrupted (something in a jail), which is understandable and (to me) not unexpected - the really important data would still be safe. So, why was the pool subsequently borked?
  • If the system did not halt, what the hell is going on? It's reputable server hardware, the memory is validated for the board (and seems to have worked) and it has every expectation of working.
Now that I think about it, something comes to mind: A sequence of 2+ bit errors that goes undetected long enough to cause damage. I'm not familiar with the error correction scheme used in ECC RAM, so I can't comment on the likelihood of this.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Wait, the CPU doesn't check the parity each time it reads the RAM?
Yes, but I don't think that detection of errors larger than two bits per 72-bit word is guaranteed. Again, I don't know what forward error correction they use, so it's just a vague possibility.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
So it would be something like this:?
  • 1 bit error ----> detected + corrected
  • 2 bits error ---> detected
  • 3+ bits error -> end of the universe...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
So it would be something like this:?
  • 1 bit error ----> detected + corrected
  • 2 bits error ---> detected
  • 3+ bits error -> end of the universe...

I think it's:

3+ bits error -> not all errors will be detected

But again, I'm not sure.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I think it's:

3+ bits error -> not all errors will be detected

But again, I'm not sure.

No, that's exactly correct. The error correction can even miscorrect at that point. What you're relying on is that undetected errors will not happen in a vacuum; errors tend to show up in multiple locations at once, so even if the error at 0x12345678 isn't detected, the error at 0x12345679 will be. Single bit errors are to be correctable. Bigger failures generally just result in a panic.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I just spoke with ryao in #zfs. He agrees that this is disturbing, and informed some of the openZFS developers of this.

He was wondering if "memory scrub" which SuperMicro sometimes calls "Patrol Scrub" was turned on. However, I could not recall such a setting in the X10 series of boards, and sure enough I don't see any mention in the manual. On the X9 series, it's right in there, plain as day. Reading up on what it does, obviously such memory scrubs should be de rigueur, so.,....yeah, wonder what's going on?

But in any case, there should have been an MCE, and the system should have come down hard. But apparently, it did not. That's what I really want an answer for.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
By the way, congratulations @Vossy ; you have successful gotten more than one community into a tizzy over an interesting problem. That doesn't happen often.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I just spoke with ryao in #zfs. He agrees that this is disturbing, and informed some of the openZFS developers of this.

He was wondering if "memory scrub" which SuperMicro sometimes calls "Patrol Scrub" was turned on. However, I could not recall such a setting in the X10 series of boards, and sure enough I don't see any mention in the manual. On the X9 series, it's right in there, plain as day. Reading up on what it does, obviously such memory scrubs should be de rigueur, so.,....yeah, wonder what's going on?

But in any case, there should have been an MCE, and the system should have come down hard. But apparently, it did not. That's what I really want an answer for.

The X11 manuals also seem to not mention that at all.

Sounds like a little utility that injects ECC errors has just become a more urgent matter.
 
Status
Not open for further replies.
Top