Boot loop/Kernel Panic Mounting Pool

Orionebula · Nov 22, 2022

Hello all,

Not sure if this is the right place to post this, but hopefully I can get some help here.

My Truenas Core system that had been working perfectly for months without changes nor known updates suddenly stopped working. When I checked the system it was stuck in a boot-loop. It would boot, then try to mount the single pool the system was responsible for and then gave a kernel panic alert before rebooting. Rinse and repeat.

I cannot get the system to a point where I can definitively confirm what the pool settings were, but if I recall it was set up with 2 vdevs in ZFS2.

I could get the system to properly launch into Truenas by unplugging the drives associated with the pool and Truenas seems to be functioning normally. I can get the system to recognize the drives by hotplugging them back into the system after Truenas has launched and each drive appears to be healthy according to SMART testing. However, when I attempt to import/mount the pool, the kernel panic occurs again and the system reboots.

I removed the boot drive and created a separate new Truenas installation using SCALE and also get a crash when I attempt to import the pool.

I created a separate new boot drive with Ubuntu and Ubuntu seems to recognize the drives and recognizes the pool name (Ubuntu labeled all 8 drives with the name of the pool), but also showed two "mountable" drives labeled bpool and rpool (which I am assuming represented the vdevs). Attempting to mount the pool using "sudo zpool import -f xxxxxx" it does not appear to do anything at all, I get no response and the terminal just sits there claiming to be running a task, but no system resources seem to actually be in use.

Google searching seems to suggest that this: https://www.truenas.com/community/threads/pool-import-or-system-boot-causes-kernel-panic.83370/ is the closest similar issue, but I have no idea how to actually fix it.

Suggestions?

Arwen · Nov 22, 2022

Please supply complete hardware details, as well as pool layout, (make, model & number of disks, in what configuration). Plus, which ports are used to connect the disks to the server, (motherboard SATA, what type of PCIe card, etc...)

Next, supply the output of zpool import in code tags.

Note that there is nothing called ZFS2. We can guess you mean RAID-Z2. It is helpful to use the normal ZFS terminology:
Terminology and Abbreviations Primer

Orionebula · Nov 22, 2022

Thank you for your response.

Arwen said:
Note that there is nothing called ZFS2. We can guess you mean RAID-Z2.

My apologies, yes, RAID-Z2 is what I meant to write.

Arwen said:
Please supply complete hardware details, as well as pool layout, (make, model & number of disks, in what configuration). Plus, which ports are used to connect the disks to the server, (motherboard SATA, what type of PCIe card, etc...)

CPU: Intel 4690k (Also tried on AMD 3400g)
Mobo: MSI Z97 PC Mate (Also tried on Gigabyte B450 Aorus M)
Memory: Corsair 8GBx4 2400MHz (Also tried on GSkill 8GBx2 3000MHz)
PCIe controller: LSI 9240-8i to SATA with all pool drives routed through this controller
NIC: Intel X540 T2
Boot Drive: Samsung 870 EVO SATA SSD - connected direct to mobo on original system (in trial of SCALE OS, have a USB drive as boot)

While I am not 100% certain of how I originally set this up, I believe the pool was set up as follows:
Single pool with 8 HDDs, 2 vdevs each with 4 drives
Vdev1 - Seagate exos ST8000NM0055 8TB x 4 drives
Vdev2 - Seagate ironwolf ST8000VN004 8TB x 4 drives

zpool import results in:

Code:

root@truenas[~]# zpool import
   pool: pool8TBx8
     id: 130701487088135189
  state: ONLINE
status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
 config:

        pool8TBx8   ONLINE
          raidz2-0  ONLINE
            sdh2    ONLINE
            sdg2    ONLINE
            sdf2    ONLINE
            sde2    ONLINE
            sdc2    ONLINE
            sdb2    ONLINE
            sda2    ONLINE
            sdd2    ONLINE

Currently I am using the AMD system and SCALE because that combination does not boot loop and I am able to access the web GUI, but the same issue of a kernel panic occurs each time I attempt to import the pool.

Please let me know if I can provider further information and thank you again.

Arwen · Nov 22, 2022

The pool seems healthy as far as the import data shows. (Occasionally we see problems at this level.)

The pool is a single RAID-Z2 vDev of 8 x 8TB disks. This should be okay as well. (8 disks in a RAID-Z2 is fine, 12 or more tends to get problematic.)

The amount of memory, either 4 x 8GB or 2 x 8GB, is normally fine for a generic ZFS pool.

Are you / were you using ZFS De-Duplication?
Do you remember about how many ZFS snapshots your pool might have?

Both the above can cause problems. I vaguely remember someone with 10,000s of snapshots have some problem. And De-Dup can require more memory on import than is available. Though I would not think it would crash the server.

Orionebula · Nov 22, 2022

Arwen said:
Are you / were you using ZFS De-Duplication?
Do you remember about how many ZFS snapshots your pool might have?

Clearly I betrayed my ignorance of my own system in not recalling how many vdevs I set up, but to the best of my remaining knowledge, the set up is pretty standard default settings from a Freenas migrated to Truenas Core set up. As far as I know, I was not using De-Duplication and was using whatever the default set up was.

Arwen said:
Both the above can cause problems. I vaguely remember someone with 10,000s of snapshots have some problem. And De-Dup can require more memory on import than is available. Though I would not think it would crash the server.

As for snapshots, again should have been default frequency and with the system/pool being just under a year old I would be surprised if the system were making that many snapshots.

Regarding memory, the kernel panic happens in an instant on import and does not begin to register any increase in memory usage on the system monitor (though I confess I know little about how the system is designed to check and allocate memory for these purposes.

Any thought on using the "clear" function on the pool or any other mechanism for approaching this?

Thanks!

Arwen · Nov 23, 2022

Hmm, this is tricky, hopefully someone else can chime in.

My only suggestion at present is to attempt some of the recovery options. For example, try the below. You might need the lower case "f" too, as specified in your pool import test above:

Code:

zpool import -Fn pool8TBx8
zpool import -Fnf pool8TBx8

If that works, you can try importing your pool read only and not mount anything:

Code:

zpool import -NFf -o readonly=on pool8TBx8

And work your way up to getting the datasets mounted;

Code:

zpool import -Ff -o readonly=on -R /mnt pool8TBx8

If you have a way to copy off the data, that would be helpful. I would not trust the pool after this. The data is likely trust worthy.

Orionebula · Nov 23, 2022

I appreciate your efforts

Arwen said:
zpool import -Fn pool8TBx8

Results in:

Code:

root@truenas[~]# zpool import -Fn pool8TBx8
cannot import 'pool8TBx8': pool was previously in use from another system.
Last accessed by  (hostid=e3bbb42f) at Tue Nov  8 08:45:46 2022
The pool can be imported, use 'zpool import -f' to import the pool.

But then using:

Code:

zpool import -f pool8TBx8

Results in a crash

Arwen said:
zpool import -Fnf pool8TBx8

Results in a crash

Code:

zpool import -NFf -o readonly=on pool8TBx8

zpool import -Ff -o readonly=on -R /mnt pool8TBx8

Each seem to get accepted, but unfortunately I see no discernable changes to the GUI and no pools that I can interact with. Curiously, after entering either of those, if I attempt to look up the snapshots, the system crashes then to.

Where to next? Since it doesn't freak out too much in read-only mode is there a way to then access the data?

Arwen · Nov 24, 2022

I don't have many more suggestions.

If you get no crash with either of the last 2 commands, perform this from the command line:

Code:

zfs list -t all -r pool8TBx8

and paste the results here, (assuming it does not crash).

Note that the "-F" option throws away a few of the last transactions in an attempt to import pool without corruption. (Because something recently written corrupted the pool.)

There is also a ZFS debugger which can be useful. But, I have limited experience with it.

Orionebula · Nov 24, 2022

Tried with both of the readonly imports, no crash, but when I run zfs list -t all -r pool8TBx8 I get the crash again.

Open to trying other things.

Davvo · Nov 24, 2022

iirc I read somewhere on this forum a similar thread, the issue was something about the mounting points being wrong.
Don't know if this can ring a bell into someone with a better understanding of how ZFS works.

Arwen · Nov 24, 2022

Sorry, I have no further suggestions except recovery service.
Perhaps someone else can suggest something.

Orionebula · Nov 26, 2022

Thank you everyone for your help so far. With regards to the question of mounting points potentially being wrong/damaged - is there anywhere I might turn to start sorting that out?

Davvo · Nov 26, 2022

Orionebula said:
Thank you everyone for your help so far. With regards to the question of mounting points potentially being wrong/damaged - is there anywhere I might turn to start sorting that out?

You have to search in the forum, I sadly can't remember much about the thread.

Orionebula · Nov 26, 2022

Alrighty, so I found this discussion: https://www.truenas.com/community/t...-next-steps-be-with-these-odd-symptoms.94697/

I am not sure, but it seems like it may be worth trying their steps of: "If the mountpoint needs changing, stop the shares/jails/VMs that need that pool and export it, then zpool import -R /mnt POOL then zpool export POOL, re-import it via GUI and restart all your services, etc."

I believe that the system will crash if I use:

Code:

zpool import -R /mnt pool8TBx8

#but I might be able to get away with trying 

zpool import -Ff -o readonly=on -R /mnt pool8TBx8  #and then
zpool export pool8TBx8

I might just go ahead and try this, but I don't know if there is a risk of data destruction with such an approach. Thoughts?

Orionebula · Nov 26, 2022

Oh, also, I was able to get the "zpool history" for my pool. It's awfully long, so I won't clutter things by posting it here unless we really need it, but it looks like the system started a scrub at ~midnight and then every few minutes for a few hours and repeatedly logged things like:

2022-11-06.00:22:50 zpool import 130701487088135189 pool8TBx8
2022-11-06.00:22:50 zpool set cachefile=/data/zfs/zpool.cache pool8TBx8
2022-11-06.00:23:09 zfs snapshot pool8TBx8/.system/samba4@wbc-1667719374

It did this apparently until it went into the boot loop with the kernel panic. No idea if this means anything or not, but it's something else to throw in here.

Orionebula · Dec 2, 2022

Some further follow up - It seems like this has been an ongoing issue for years from what I can surmise. One example was here: https://www.truenas.com/community/threads/panic-mounting-local-filesystems.18752/

Unfortunately it seems like few if any have managed to recover/restore the pool after something like this happens.

While there does not seem to be a definitive answer as to why this happens, there do seem to be some common factors that go into an occurrence like this from what I have read in a dozen or so incidents:
1) No UPS and a power failure in near timing (my PF was 4 days prior and the system continued to work for a time after, but still)
2) Using non-ECC memory

What does NOT seem to be a factor so far might include:
1) Overall size/complexity of the pool (happens with as few as 1 or 2 HDDs) although it wouldn't surprise me to see that incidents increase with complexity if indeed the issue is a bit-flip
2) Amount of memory in the system, though getting down to <4GB may introduce problems
3) CPU type
4) Truenas/Freenas iteration - in fact, this does not seem to be unique to this system at all

I have read similar issues on both FreeBSD and Linux based systems and while I am clearly no programmer, it would seem that this is an issue with how ZFS handles either data corruption/faults/power failures etc. and then attempts to go into recovery. In other words, I suspect this is a ZFS issue at heart and not a Truenas problem. Perhaps it's just my ignorance and whatever is going on makes perfect sense when one knows enough background (I could be at the high-point of the Dunning-Kreuger curve on this), but it would seem to me that in a system taking regular snapshots, as mine was, the programming should be able to (without going into a kernel panic) look back sequentially until it finds a healthy snapshot and then launch/mount/import from that unless there is a hardware failure (which in none of these cases seemed to be the issue). I'd really like to get the thoughts of a developer on this, but have no idea how to get it in front of them.

Part of my contention in why this is important is that Truenas and other services like it seem to market their software/systems on the idea that is plays nice with consumer hardware and that does not really seem to be entirely true. ECC memory and a UPS are good data hygiene practices, but I wouldn't say that they are typical consumer hardware. It's one thing to have it out there that one could take a decommissioned server and re-deploy it has a home server/NAS and another to suggest that your old gaming computer can be turned into a stable NAS. Perhaps both I and the others who have encountered this are in the vast minority, but it seems very strange to market this as a method of building in data safety and redundancy against hardware failure and then leave such a vulnerability to software corruption of a data pool. Or maybe this feeling is just me having sour grapes over not getting my pool back!

In any case, I have a data recovery program working on getting me back the data I had not yet backed up. After that, if other have more risky ways of attempting to recover the pool itself (so that others in the future might have a solution) I will be game to give those ideas a try!

Thanks again to those that have tried to help so far!

Davvo · Dec 3, 2022

Orionebula said:
Part of my contention in why this is important is that Truenas and other services like it seem to market their software/systems on the idea that is plays nice with consumer hardware and that does not really seem to be entirely true. ECC memory and a UPS are good data hygiene practices, but I wouldn't say that they are typical consumer hardware. It's one thing to have it out there that one could take a decommissioned server and re-deploy it has a home server/NAS and another to suggest that your old gaming computer can be turned into a stable NAS. Perhaps both I and the others who have encountered this are in the vast minority, but it seems very strange to market this as a method of building in data safety and redundancy against hardware failure and then leave such a vulnerability to software corruption of a data pool. Or maybe this feeling is just me having sour grapes over not getting my pool back!

That's why almost everyone here strongly suggests using ECC memory (which usually entails server grade hardware).
Altough there are a few users who strongly crusade against the ECC reccomendation they are a (loud?) minority.

It's very unusual for the user base of this forum to suggest the use of consumer grade hardware; there is plenty of bad information on YouTube and other forums about TrueNAS: this community can do little beside pointing you to the guides it puts at your service, as well as the collective wisdom of a lot of people with hands on experience on the field (spoiler: not me).

If you look around you will fill plenty of posts where the risks of not using ECC and "common" or improper hardware are emphasized: often those warnings are scoffed off and seen as a form of elitism and arrogance by other communities.
What saddens me is that there are warnings about not using ECC and there are plenty of threads of data lost where ECC could have probably saved the day, yet they continue to increase.

I'm not, in any way, attacking or judging you; I just can't understand some people who strongly refute that ECC is a requirement if you care about your data.
I wish you luck in your recovery, I can understand your pain.

Arwen · Dec 3, 2022

ZFS was specifically designed to have zero data loss on unexpected power offs, (aka crashes). The only data you can loose, is data in flight, (just like any other file system). When a crash occurs during writes, either the full set of data was written and available afterwards. Or none in flight data is available.

Their are exceptions to this, bad hardware. If;

A storage device lies about flushing it's write cache
Drive re-ordering writes
Using write cache based hardware RAID controller
Or potentially non-ECC RAM with errors

When Sun Microsystems designed and tested ZFS, they did not anticipate the massive numbers of users on home & consumer hardware. Thus, those exceptions generally don't apply to actual server grade hardware designed for NAS uses.

Orionebula · Dec 19, 2022

I was able to successfully recover all of my data using the klennet zfs recovery program. Not free, but was cheaper than the alternatives and Alexey was responsive with questions I had. (https://www.klennet.com/zfs-recovery/default.aspx)

Interestingly, this issue was actually part of the last (as of this writing) set of notes/blog post he wrote and it related to issues with ZFS and power failures: https://www.klennet.com/notes/2022-06-22-zfs-and-power-failures-revisited.aspx

In summation, I think my issue was one more of power failure more than anything else. I suspect the pool became corrupted after a power failure and once Truenas started trying to create a snapshot the pool broke and would no longer mount (again this is based on sequence of events more than any hard knowledge of the process). I suppose that my frustration with this is that in all my years of "bad data practices" NTFS never failed me in this way even on RAID set ups. Perhaps this is just playing chicken with data loss catching up with me, but then again, there's something to be said for idiot (me) proofing your software.

Anyway, last call for ideas on attempted recovery before I destroy (create over) the old pool just for posterity?

Davvo · Dec 19, 2022

Can you try a zpool clear and then a zpool import? After any other suggestions.

Important Announcement for the TrueNAS Community.

Boot loop/Kernel Panic Mounting Pool

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

MVP

Dabbler

MVP

Dabbler

Dabbler

Dabbler

MVP

MVP

Dabbler

MVP

Similar threads