Kernel Panic on zpool import

Eniqmatic · Jul 5, 2016

Hi all,

Having a bit of an issue with one of our FreeNAS boxes for the past couple of days. I'm not really any closer to sorting it (other than figuring out what causes the panic).

Essentially what happens is, the FreeNAS box starts up as normal, then within about 2 minutes of being at the main menu on the console, a kernel panic appears to occur. I can't see what the first few lines of the panic are as they start scrolling way to fast. It eventually then ends with a line saying "reset" and the machine reboots. The cycle then starts again.

The problem is very similar to this post from last year, except from mine actually restarts rather than sitting there, and mine also did not have any issues on it's last scrub.

I've narrowed it down to when the volume is imported, as I've re-installed the USB sticks (did this when I thought it was an OS issue) and I've left the machine running for about 1 hour without restarting. Then as soon as I run the "zpool import" command, the machine panics.

I've then thought it could be a bad stick of RAM and have ran some diagnostics (they were fairly quick running for about 3.5 hours for 2 x 4GB dimms. I could run more intensive tests if you think it is required.

If I mount it readonly within normal mode, it does appear to be fine and does not panic.

I then found this post and tried the first suggestion on here. I booted to single user mode, firstly without using the zfs_recover and zfs.debug command lines (so literally just entered single user mode without changing anything). I then ran the command "zpool import -R /mnt vol1" and this worked, I did not experience any restarting in about 20 minutes. So I restarted into normal mode again and instantly starts panic again.

Back into single user mode with the zfs_recover and zfs.debug command and I've imported the volume again as normal with "zpool import -f -R /mnt vol1" (I had to use -f this time as it said it was previously used).

This is where I currently am and not sure what to do. It's currently running a scrub but it's going to take a little while. Not really sure if this will achieve anything!

Machine specs are:
Dell PowerEdge T20 With Xeon E3-1225 v3 @ 3.20GHz
8GB of ECC RAM (we have 16GB to go in soon)
Dual Kingston USB sticks
4 x 1TB WD RED (we have 4 x 4TB to go in soon)
FreeNAS-9.10-STABLE-201606270534 (dd17351) (think this is very latest)

I have 11 other of these exact same machines and all of these are working fine at the moment so thinking it must be a corruption on the volume somewhere!

maglin · Jul 5, 2016

Have you tried reinstalled FreeNAS and importing your cofing? It could be a bad USB stick.

Robert Trevellyan · Jul 5, 2016

Eniqmatic said:
If I mount it readonly within normal mode, it does appear to be fine and does not panic.

Based on some recent threads, if you can get the pool to import read-only, the simplest solution is to backup the data, destroy the pool, and start over.

Eniqmatic · Jul 6, 2016

maglin said:
Have you tried reinstalled FreeNAS and importing your cofing? It could be a bad USB stick.

Yes unfortunately already tried this.

Robert Trevellyan said:
Based on some recent threads, if you can get the pool to import read-only, the simplest solution is to backup the data, destroy the pool, and start over.

I already have a full backup of the data as it's replicated to another box, it's more to save a bit of hassle and the curiosity to figure out what has went wrong that I want to figure out! And also if there is a bit of hardware went wrong, then so I can get it replaced before it happens in the future. Of course if it comes to it I may have to go down this route!

Dice · Jul 6, 2016

- Ouff.
The way I'd approach this problem is to assume a recent hardware failure.
Proceed by doing the regular checks, most notably the RAM memtest.
Since you've other hardware, I'd definitely try swapping the hard drives into another system and try importing the pool. Ie, unhook the box2 boot drive, and all of its drives first.
Once you rule out the motherboard/psu/RAM/etc from the equation, I bet some users/devs will be interested in the details of logs.

Eniqmatic · Jul 6, 2016

Dice said:
- Ouff.
The way I'd approach this problem is to assume a recent hardware failure.
Proceed by doing the regular checks, most notably the RAM memtest.
Since you've other hardware, I'd definitely try swapping the hard drives into another system and try importing the pool. Ie, unhook the box2 boot drive, and all of its drives first.
Once you rule out the motherboard/psu/RAM/etc from the equation, I bet some users/devs will be interested in the details of logs.

I'm running memtest86 as we speak so I will report back the results, if any. The scrub that ran last night went OK and reported no errors. So rebooted into normal mode and crashed again.

Swapping the drives into a new system is a good idea, annoyingly this box is in a more remote location but I will get it back here and try as you suggested.

Thanks for your help.

Robert Trevellyan · Jul 6, 2016

Crashing on import typically implies a corrupted pool. In other threads, this tends to result from one of two things:

Power outage.
Lack of RAM.

Can you rule out #1?

You have the minimum required 8GB, and your pool isn't large. Does the system do anything except serve shared storage over CIFS? Do you have anything silly in your setup, e.g. de-duplication enabled?

Do you feel like posting a debug file?

Robert Trevellyan · Jul 6, 2016

Eniqmatic said:
I'm running memtest86 as we speak

May as well run the T20's built-in hardware diagnostics too.

Eniqmatic · Jul 6, 2016

Robert Trevellyan said:
Crashing on import typically implies a corrupted pool. In other threads, this tends to result from one of two things:

Power outage.

Lack of RAM.

Can you rule out #1?

You have the minimum required 8GB, and your pool isn't large. Does the system do anything except serve shared storage over CIFS? Do you have anything silly in your setup, e.g. de-duplication enabled?

Do you feel like posting a debug file?

I'm 99% sure we didn't have a power outage, so hopefully it wasn't that.

No de-dup on, just the usual compression. It does only CIFS shares and nothing else, very basic setup indeed.

I will post the debug file once I can retrieve it. Which file is that is required from /data/crash as there are a couple files in there?

Robert Trevellyan said:
May as well run the T20's built-in hardware diagnostics too.

That was the first diagnostics I ran and tested OK the first time. Memtest is almost finished pass 3 of 4 and 0 errors reported so far.

Robert Trevellyan · Jul 6, 2016

Eniqmatic said:
Which file is that is required

In the GUI, go to System | Advanced | Save Debug. It can take a while to generate.

Eniqmatic · Jul 6, 2016

Robert Trevellyan said:
In the GUI, go to System | Advanced | Save Debug. It can take a while to generate.

Any ideas how I can stop the pool from mounting when running "normal mode"? If I start in single user mode then I don't get the GUI and if I start in normal it crashes. Can I start normal without importing the volume?

Robert Trevellyan · Jul 6, 2016

Right now, I can't think of anything that's going to result in a useful debug file. Starting with the disks disconnected means no SMART data, no pool status or history, etc. Maybe someone else has an idea.
:(

Eniqmatic · Jul 6, 2016

Robert Trevellyan said:
Right now, I can't think of anything that's going to result in a useful debug file. Starting with the disks disconnected means no SMART data, no pool status or history, etc. Maybe someone else has an idea.
:(

There is information in the debug files located in "/data/crash" I just can't figure out how to get them off. I'll try and see.

It's just passed the 5.5 hour memtest with no errors at all.

Eniqmatic · Jul 6, 2016

OK I have the panic file which contains 6000 lines, I can't see anything at all obvious at the start or the end to show why.

Anyone have any advice as to how to go about interpreting these?

Dice · Jul 6, 2016

Post it at http://pastebin.com/ and provide a link.

Eniqmatic · Jul 6, 2016

Dice said:
Post it at http://pastebin.com/ and provide a link.

Will do, I'm just sorting through it to remove lots of lines to make it easier, as it seems to have captured several reboots in one file.

Whilst I'm doing that the line that starts the panic is this:

panic: solaris assert: zap_count(mos, dsl_dataset_phys(ds)->ds_next_clones_obj, &count) == 0 (0x2 == 0x0), file: /tank/home/nightlies/build-freenas9/_BE/trueos/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c, line: 1733

Edit: Is it me or isn't the path it's looking for (/tank/home/nightlies/build-freenas9/_BE/trueos/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c) a bit odd given the fact I am on the stable build, and also my volume is called "vol1" not "tank" and therefor I don't have his path. Or is this not what it appears?

Robert Trevellyan · Jul 6, 2016

Eniqmatic said:
is this not what it appears?

The message is telling us which line in which file in the source code generated the panic message. It is not related to your pool.

Robert Trevellyan · Jul 6, 2016

Have you tried zpool import -fFn?

Eniqmatic · Jul 6, 2016

Robert Trevellyan said:
Have you tried zpool import -fFn?

I've used "zpool import -f" but not uppercase F or with the "n". What does the -n do?

Robert Trevellyan · Jul 6, 2016

-F means "Recovery mode for a non-importable pool."
-n means "don't actually do it, just see if it might work"
https://www.freebsd.org/cgi/man.cgi?zpool(8)

Important Announcement for the TrueNAS Community.

Kernel Panic on zpool import

Explorer

Patron

Pony Wrangler

Explorer

Wizard

Explorer

Pony Wrangler

Pony Wrangler

Explorer

Pony Wrangler

Explorer

Pony Wrangler

Explorer

Explorer

Wizard

Explorer

Pony Wrangler

Pony Wrangler

Explorer

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Kernel Panic on zpool import"

Similar threads