TrueNAS panics when importing dataset

JimKusz

Dabbler
Joined
Sep 10, 2018
Messages
19
Hi all:

I've been trying to recover from this for days now, and read lots of forum posts, but so far no avail...

I have two Supermicro 36-bay servers, running Xeon CPUs and ECC ram, and NAS HDDs (WD Red / Seagate Ironwolf). Systems have been in production for >5 years. One system was recently replaced as preventative maintenance since funds were available. Systems run in an active/backup configuration: the main system (newest) serves up all shares via SMB, and takes snapshots on a schedule, and replicates those snapshots to the backup. The backup is the system I'm focusing on here.

It has 24GB of ram and 16 cores, LSI SAS card in IT mode. It has about 140TB of storage online (a bunch of 12TB disks, some 8's...Arranged as 4 "pools?" of 8 disks, each in raidz2. As I said, it had been running smoothly for over 5 years; hardware was originally obtained used, so I know there's some age here.

On Sunday, it finished a reguarly scheduled scrub, and e-mailed me that all was good. During the week, the snapshots are replicated to this device from the primary one every few hours. I believe the first one went OK. The 2nd one caused the master to generate a "lost connectivity" e-mail. I ipmi'ed into the console and found it kernel panic'ed. I rebooted, and it KP'ed again.

I did a lot of time troubleshooting and researching this. I found some forum posts that talk about booting with vfs.zfs.recover flags set and such, and I've done that. Trying to zfs import with -F didn't work (still KP'ed). I found another post that said -FX, and that KP'ed as well. However, mounting it as read-only worked, no problems. zpool status shows all is good in this case. Of course, scrub doesn't work due to read-only status.

I've tried unplugging the boot drives, and installing a clean install of the latest (I was one or two versions back on the original boot disks), but that also failed the same way (both importing through the gui or the command line options with or without the vfs.zfs flags set).

On this system, there is very little data that needs to be preserved inherently (its a backup system, after all). The only things that were only on it was a linux VM (Byhve) with its disk in a zvol (it is a unifi video server -- camera NVR, and honestly, the only thing I need out of it is a config backup), and I'd like to recover my amanda jail. I copied the amanda jail off to another system, but the resulting disk usage was 2x what it is on the freenas system. With the zvol, I haven't figured out how to mount the volume and disk inside to grab the file. I tried zfs sending it to the master, but after running for about 18 hours, the receive end was already over twice the size of the original (10.3TB), and was still going.

So, currently my questions are:

1) Why can I import the pool read-only with no problems and everything showing good health, but as soon as I try and mount it read-write, the kernel panics?
2) Is there a way to get into the bhyve disk image from the zvol? Or why is the zfs send over 2x the size of the original?
3) any suggestions on recovery? Recopying the snapshots is doable, but will be a pain...even at 10Gbps networking (what it has), it still takes about a week typically...

Thanks!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
1) Why can I import the pool read-only with no problems and everything showing good health, but as soon as I try and mount it read-write, the kernel panics?
You probably have a corrupt pool there... read-only somehow avoids getting involved with the corruption to a level that causes kernel panic (to state the obvious).

2) Is there a way to get into the bhyve disk image from the zvol?
You may be able to dd it to a .raw file and attach that to another VM.

2) why is the zfs send over 2x the size of the original?
Size bloat may relate to the corruption. The file you get may not be a healthy zvol.

3) any suggestions on recovery? Recopying the snapshots is doable, but will be a pain...even at 10Gbps networking (what it has), it still takes about a week typically...
Depending on how desperate you are to get the data back... the usual response is rebuild that pool and restore from backup (I guess we wouldn't be discussing this if you had one though... it is after all a backup itself, so why would you).
You might want to look at your SMART data (and test schedule) for the disks in that pool, since the corruption probably started somewhere on one or more of them. You may need to replace bad disks before recreating the pool.

Other "recovery" options are potentially very time consuming and need additional storage to run (and a Windows box to attach/copy disk images to) with Klennet (free to test recoverability, pay to get data out). I don't think that's the droids you're looking for.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
Why can I import the pool read-only with no problems and everything showing good health, but as soon as I try and mount it read-write, the kernel panics?

The write path touches more things than the read path, and thus requires more filesystem elements to be correct. Everything shows good health because self-testing is mostly designed to ensure readability and data integrity. This is because implementation complexity and cost raises sharply as you ask for more tests. As soon as you try to mount it R/W, it touches something inconsistent and panics.
 

JimKusz

Dabbler
Joined
Sep 10, 2018
Messages
19
Ok...

I'm still quite confused as to what went wrong. All my disks are good, there were no issues when it went down. Its enterprise grade hardware on a UPS (with redundant PSUs and no known power issues during the failure time). Suddenly, my filesystem is corrupt beyond rescue with no cause...As this is an enterprise system, its expected that I would have isolated the cause and taken steps to ensure it doesn't re-occur. However, I'm not aware of what I can do differently (other than the sheer age of the computer -- replace with new). I've run memcheck and a cpu burnin and haven't found any issues. I'm not sure how I can check current smart data on the entire array (32 disks) from the command line (it only boots into single user mode currently, as it crashes when it imports the array on a normal boot).

I'm also very confused as to why it kernel panics...I would have expected some level of protection against reading bad data and, for example, marking the disk bad. After all, that's supposed to be some of what makes ZFS so stable. It CAN handle a disk failing (or even a couple if they are the "correct" disks), so why should 1-2 disks failing cause the collapse of the entire array to the extent that it kernel panics when it tries to mount it?

So, I'm still concerned that my entire storage array collapsed for unknown reasons despite all best practices being followed and no way to determine why it failed. This is generally what causes something to be termed "unstable" or "beta" in the filesystem world....

So now, I need to move on. I have my OS drives still in tact, with all the settings and such just fine. However, I have no idea how to "wipe" the current storage such that my existing OS retains its settings (ssh keys and such).
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Ok...

I'm still quite confused as to what went wrong. All my disks are good, there were no issues when it went down. Its enterprise grade hardware on a UPS (with redundant PSUs and no known power issues during the failure time). Suddenly, my filesystem is corrupt beyond rescue with no cause...As this is an enterprise system, its expected that I would have isolated the cause and taken steps to ensure it doesn't re-occur. However, I'm not aware of what I can do differently (other than the sheer age of the computer -- replace with new). I've run memcheck and a cpu burnin and haven't found any issues. I'm not sure how I can check current smart data on the entire array (32 disks) from the command line (it only boots into single user mode currently, as it crashes when it imports the array on a normal boot).

I'm also very confused as to why it kernel panics...I would have expected some level of protection against reading bad data and, for example, marking the disk bad. After all, that's supposed to be some of what makes ZFS so stable. It CAN handle a disk failing (or even a couple if they are the "correct" disks), so why should 1-2 disks failing cause the collapse of the entire array to the extent that it kernel panics when it tries to mount it?

So, I'm still concerned that my entire storage array collapsed for unknown reasons despite all best practices being followed and no way to determine why it failed. This is generally what causes something to be termed "unstable" or "beta" in the filesystem world....

So now, I need to move on. I have my OS drives still in tact, with all the settings and such just fine. However, I have no idea how to "wipe" the current storage such that my existing OS retains its settings (ssh keys and such).
So what version of freenas? There was a rare corruption bug in 12.0-U1 i think. You can read through that to see if your use case fits the scenario. Other that that it's super strange to have the failure you had and it should have some kind of reason behind it.
 

JimKusz

Dabbler
Joined
Sep 10, 2018
Messages
19
It was running a version of 11.something when it crashed (I think it was 11.2-U7, but could be wrong). I installed 12.0-U1.1 iso in my attempts to recover, and that's probably what's there right now. However, that is a secondary boot environment, and the original 11-x should still exist. The failure happened while booted into 11.x.

Just checked, and I see that I do in fact have freenas-boot/ROOT/11.2-U7 as the latest boot environment in the 11 series. My recovery environment is TrueNAS-12.0-U1.1 (401ffb1d98)

Thanks!
 
Top