Pool Import or System Boot Causes Kernel Panic

jasonsansone · Mar 25, 2020

My FreeNAS build has been rock solid since being built months ago... until last night. Something caused the NAS to panic and crash. I woke up to it in a constant loop of boot, panic, reboot, repeat. There wasn't a power loss as the system is on redundant UPS's and my home didn't lose power. After the crash my main pool can't be imported to a fresh config install nor will it allow the original config to boot.

System Specs:
Chassis: SuperMicro CSE-864
Motherboard: SuperMicro X9DRi-F
RAM: 8 x 16GB 1866mhz ECC (128GB total)
NIC: NC560SFP+ for 10Gbe and Intel i350 for 1Gbe
Hard Drives: Shucked WD white label 5400 rpm 10TB
HBA: LSI SAS 9207-8i
CPU: 2x Intel E5-2697 v2
OS: FreeNAS 11.3-U1
Boot Pool: Mirrored Samsung SSD's
L2ARC: None
SLOG: None

The pool in question is 2 vdevs of six drives in RAIDZ2 (12 drives total, 120TB raw capacity). The pool was under heavy IOP load when it crashed, but I don't care if there is any data loss related to those in flight writes. The storage is for media and multiple transcodes were running when the system went down. Those files will be corrupted and will need to be rerun under any condition. My primary concern is recovery of the pool.

It's also important to note that I use a replication task to clone a single NVMe drive to the main pool each night. All jails and VM's are on the NVMe, so I back it up nightly to the main pool. The system crashed during the replication task. The main pool that can not import had a zfs receive task in progress.

The pool can only be mounted read-only using "zpool import -o readonly=on -fF -R /mnt home-main". Read only avoids any kernel panics but the console did output "freenas savecore: /dev/ada0p3: Operation not permitted". Using the same command without "-o readonly=on" or booting normally results in the following kernel panic backtrace:

Code:

panic: Solaris(panic): blkptr at 0xfffffe0036f5d580 has invalid TYPE 101
cpuid = 3
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe202d7c9960
vpanic() at vpanic+0x17e/frame 0xfffffe202d7c99c0
panic() at panic+0x43/frame 0xfffffe202d7c9a20
vcmn_err() at vcmn_err+0xcf/frame 0xfffffe202d7c9b50
zfs_panic_recover() at zfs_panic_recover+0x5a/frame 0xfffffe202d7c9bb0
zfs_blkptr_verify() at zfs_blkptr_verify+0x53/frame 0xfffffe202d7c9bf0
zio_read() at zio_read+0x2c/frame 0xfffffe202d7c9c30
arc_read() at arc_read+0x754/frame 0xfffffe202d7c9ce0
traverse_prefetch_metadata() at traverse_prefetch_metadata+0xbd/frame 0xfffffe202d7c9d20
traverse_visitbp() at traverse_visitbp+0x9dc/frame 0xfffffe202d7c9de0
traverse_visitbp() at traverse_visitbp+0x430/frame 0xfffffe202d7c9ea0
traverse_visitbp() at traverse_visitbp+0x430/frame 0xfffffe202d7c9f60
traverse_visitbp() at traverse_visitbp+0x430/frame 0xfffffe202d7ca020
traverse_visitbp() at traverse_visitbp+0x430/frame 0xfffffe202d7ca0e0
traverse_visitbp() at traverse_visitbp+0x430/frame 0xfffffe202d7ca1a0
traverse_dnode() at traverse_dnode+0xd3/frame 0xfffffe202d7ca210
traverse_visitbp() at traverse_visitbp+0x703/frame 0xfffffe202d7ca2d0
traverse_impl() at traverse_impl+0x317/frame 0xfffffe202d7ca3f0
traverse_dataset_destroyed() at traverse_dataset_destroyed+0x2b/frame 0xfffffe202d7ca420
bptree_iterate() at bptree_iterate+0x15f/frame 0xfffffe202d7ca570
dsl_scan_sync() at dsl_scan_sync+0x43a/frame 0xfffffe202d7ca770
spa_sync() at spa_sync+0xb67/frame 0xfffffe202d7ca9a0
txg_sync_thread() at txg_sync_thread+0x238/frame 0xfffffe202d7caa70
fork_exit() at fork_exit+0x83/frame 0xfffffe202d7caab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe202d7caab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

There does not appear to be any physical or mechanical failures. All drives show as online and pass SMART testing. Everything was burned in before originally being deployed.

Code:

root@freenas[~]# zpool import
   pool: home-main
     id: 8732520593021902914
  state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

    home-main                                       ONLINE
      raidz2-0                                      ONLINE
        gptid/14b5f8a8-0f34-11ea-b892-002590e49b70  ONLINE
        gptid/22f53628-0f34-11ea-b892-002590e49b70  ONLINE
        gptid/328e501f-0f34-11ea-b892-002590e49b70  ONLINE
        gptid/4168fa93-0f34-11ea-b892-002590e49b70  ONLINE
        gptid/50d48acd-0f34-11ea-b892-002590e49b70  ONLINE
        gptid/5b342562-0f34-11ea-b892-002590e49b70  ONLINE
      raidz2-1                                      ONLINE
        gptid/439e60f3-1223-11ea-b1fa-002590e49b70  ONLINE
        gptid/446b5b24-1223-11ea-b1fa-002590e49b70  ONLINE
        gptid/453c49d6-1223-11ea-b1fa-002590e49b70  ONLINE
        gptid/461a4f2a-1223-11ea-b1fa-002590e49b70  ONLINE
        gptid/46df8761-1223-11ea-b1fa-002590e49b70  ONLINE
        gptid/47a44bf9-1223-11ea-b1fa-002590e49b70  ONLINE

jgreco · Mar 25, 2020

... causes FreeNAS to KeePass?

Edited post.

jasonsansone · Mar 25, 2020

jgreco said:
... causes FreeNAS to KeePass? :) Edited post.

Apologies. I know it is an acronym and I should spell it out, but I also made the assumption anyone who could assist knew what I meant ;).

I tested with using "zpool import -o readonly=on -fF -R /mnt home-main". I was able to keep from having a KP but the console did output "freenas savecore: /dev/ada0p3: Operation not permitted". The pool is now visible when I run "zpool list" but it isn't mounted under /mnt so I can't check data integrity. It is at least progress.

Edit: I am stupid (likely to repeat that several times in this thread)... I was able to mount and browse the file system. It all looks good... now to figure out how to import the pool without being read-only.

jasonsansone · Mar 25, 2020

A complete debug is now attached to the OP.

ARRRRGGGGHHH · Mar 26, 2020

After sorting out other issues i am having a similar problem to you.

Have 4 pools. 3 work fine. Last one causes kernel panic on import.

Will try going read only like you and copy data off and start again.

jasonsansone · Mar 26, 2020

Some updates. I attempted to run "zdb -e -bcsvL home-main" to verify the checksums of all metadata blocks. That errors on the next to last block.

Code:

root@freenas[~]# zdb -e -bcsv home-main

Traversing all blocks to verify checksums and verify nothing leaked ...

loading concrete vdev 1, metaslab 108 of 109 ...
Assertion failed: (blkptr at 0x823bd39c0 has invalid TYPE 101), file (null), line 0.
zsh: abort (core dumped)  zdb -e -bcsv home-main

I can run "zdb -e" to view all of the metaslabs. The output is attached (it was too large to display as inline code).

I also ran "zpool history home-main". The relevant output verifies the system was receiving a "zfs recv" when it crashed.

Code:

2020-03-24.14:00:14  zpool scrub home-main
2020-03-25.00:02:32 zfs recv -s -F home-main/nvme-backup/WinVM_1
2020-03-25.00:03:37 zfs recv -s -F home-main/nvme-backup/WinVM_2

Unfortunately, I have no clue how to interpret that data. Is it possible to rebuild, ignore, or rollback a metaslab block? Can the type or other information be manually corrected? I am now in way over my head as it comes to knowledge of ZFS...

jasonsansone · Mar 27, 2020

I had the (maybe flawed) idea to try a newer version of ZFS in the hopes it would be able to overcome the metadata issue without the same kernel panic. Or at least somehow shine some light on the issue... I installed Ubuntu Server 18.04.4 LTS to a spare SSD. Unfortunately the ZoL package in the Debian repo is out of date at version 0.7.5. My pool has feature flags from FreeNAS 11.3-U1 that are newer than and unsupported by ZoL 0.7.5, so that didn't work. My next theory was to test a nightly build of TrueNAS Core 12 f/k/a FreeNAS. That produced similar results to the current 11.3-U1 STABLE build.

I am trying to import using the -T txg rollback method. By monitoring disk IOPS, I know it is doing something, but unsure how long the process will take. If it has to read the entirety of every disk, I'll be here a while...

soupy · Jan 16, 2021

My TrueNAS 12 build crashed last night with the same symptoms as the original poster.

After process of elimination, I can stop the kernal panic, restart, bootloop, repeat cycle by disconnecting all 4 drives of my RAIDZ2 pool. The system then boots and my other two pools on different drives are uneffected.

Once booted into GUI, I can reconnect all 4 drives and perform SMART tests, all which come back clean.

I've searched through the forums for this failure, and back in the FreeNAS days, running out of RAM and power spikes are cited as common causes.
My server has 32GB of ECC RAM, which is max for the motherboard/CPU combo and is connected to a UPS.

My hardware is a Dell T20, with Xeon E3-1226v3, 32GB ECC RAM, 8 drives connected through an LSI9211-8i HBA: 2 mirrored pools, and the pool that failed is a 4 drive, RAIDZ2 pool with 2 WD Red NAS 4TB and 2 Seagate Ironwolf NAS 4TB. I have an APC ES-750 UPS connected to a USB port to shutdown if necessary.

The server replicates once a week, which it was doing last night, but should have completed before the failure, as I received an automated status email around 2AM about scrubbing the boot drives. Needless to say I will suffer no data loss because of offsite replications, but I am willing to do some troubleshooting steps if someone more skilled then myself directs me, even just for root cause analysis.

yottabit · May 25, 2021

I just experienced this same problem. System had been running fine for 21 days after last upgrade. I logged in to start a VM and noticed the other VM was disabled, which is usually running 24/7 and has run-on-boot enabled. I couldn't enable either of them, failing with a socket error in the UI. I found others had experienced this, and cloning the VM fixed the problem. So I cloned the VM I needed, started right up, used it for many hours.

Next morning, went to login and shutdown that VM since I was done with it, but the server was locked up. Now on boot it kernel panics during the ZFS import after seeing a screen full of metaslab messages. The pool consists of 2x RAID-Z1 vdevs and has been the same for years, zero problems.

The server is remote from me, and the IPMI/iKVM sucks, so I'm going to have to go there and attempt boot off some ISOs. I have the latest 12.0 (which it was already running), the latest 11.x, and the latest Ubuntu.

Other than attempting to mount readonly, and either zfs send, or rsync the data, does anyone have any other tips for fixing this? I have seen this particular problem in several searches and looks to be a ZFS bug.

jgreco · May 25, 2021

Can you describe your system in more detail? I've noticed a bit of a spike in pool corruptions in the last year or two when people are using nonstandard ways to attach disks. I'm not sure if ZFS is stressing things more, or if some device driver has gotten worse, but it has been particularly targeting people using the MFI driver, from what I can tell, although there have been others.

If you are not using AHCI SATA ports or an LSI HBA with IT mode firmware, be aware that other disk attachment solutions have shown to increase the risk of problems.

yottabit · May 25, 2021

This thread describes my problem.

I'm using a SuperMicro board, dual Xeon, 64 GB ECC RAM.

The controllers are the built-in SATA (C600) and SAS (C606), and a PCIe M.2 add-on board.

I've been using this system configuration for many years now, without any issues at all.

Pools are:

vol1
- raid-z1
  - disk1
  - disk2
  - disk3
  - disk4
  - disk5
- raid-z1
  - disk6
  - disk7
  - disk8
  - disk9
  - disk10
ssd_scratch
- disk11
- disk12

I have been able to import the vol1 pool successfully (with the same errors), but without a kernel panic, using Ubuntu 21.04 live boot.

The last thing that happened before this, was I logged in to enable a bhyve VM that I occasionally use. I noticed a different VM that I always leave running 24/7 was disabled, which was odd because it's set to start automatically on system boot. But when I tried enabling either of the VMs, I was presented with a "socket error" message in the GUI. I searched for this, and found one solution was to just clone the VM. I decided to do this until I had more time to investigate. The clone worked perfectly and booted up right away. I used the VM for several hours, with pretty intense I/O (using a video editor and rendering output files). When I was done, I started a task to upload all of the assets to Google Drive, and I went to bed.

The next day (today), I noticed I could not access the server at all. I did a reboot using IPMI, and here we are. Up until this point, the server had been up 21 days since I last installed an update (12.0-U3, by the way; I did not upgrade to -U3.1).

Edit: also worth mentioning that I tried to import on FreeNAS 11.3-U5 but I was unable to since my pool has already been upgraded. I'm going to spin up a VM in Google Cloud and start a zfs send hopefully, so I have the current state of my pools backed up. I do have the important data backed up, but it would still be annoying to recover and also lose the unimportant data, if I could avoid it.

jgreco · May 26, 2021

Got nothin' for that, sorry. Anyone else?

yottabit · May 31, 2022

Follow up:

The cause for me was an OpenZFS bug, https://github.com/openzfs/zfs/issues/11480.

Unfortunately it took a long time to fix and I couldn't wait any longer for the fix, so I don't know if it actually resolved an existing problem or just prevented it from happening again.

I discovered I was able to boot Ubuntu Linux and mount the pool successfully, despite tons of livelist errors. Afterward, I was able to mount in FreeBSD again temporarily, but it eventually would cause kernel panic again. I ended up migrating from Core (FreeBSD) to Scale (Linux), and never encountered the kernel panic again. Later I ended up rebuilding my pool anyway.

The problem was caused on my pool by cloning a zvol, and then I think later attempting to delete the clone. I think this is what triggered the first kernel panic, and then afterward I was never able to boot again (into FreeBSD) without triggering the kernel panic during import/mount of the pool.

The OpenZFS team assured me that the panic was not the result of corruption and that my data was perfectly safe. Indeed I never encountered any corruption issues after migrating to Scale.

DownSetGo · Jun 9, 2022

Got the same problem.
My server's been running fine for months, then suddenly, boot loop.
KBD: enter: panic, when importing pool.

Jaxseven · Oct 4, 2023

DownSetGo said:
Got the same problem.
My server's been running fine for months, then suddenly, boot loop.
KBD: enter: panic, when importing pool.

How did you solve this? I'm struggling with this exact thing. Every time I try to import the pool, full panic. I'd really appreciate the help.

Haibane · Oct 22, 2023

Same issue here. Just posted a thread looking for help on this.

Important Announcement for the TrueNAS Community.

Pool Import or System Boot Causes Kernel Panic

jasonsansone

Explorer

Attachments

jgreco

Resident Grinch

jasonsansone

Explorer

jasonsansone

Explorer

ARRRRGGGGHHH

Dabbler

jasonsansone

Explorer

Attachments

jasonsansone

Explorer

soupy

Cadet

yottabit

Contributor

jgreco

Resident Grinch

yottabit

Contributor

jgreco

Resident Grinch

yottabit

Contributor

DownSetGo

Cadet

Jaxseven

Dabbler

Haibane

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Pool Import or System Boot Causes Kernel Panic

Explorer

Attachments

Resident Grinch

Explorer

Explorer

Dabbler

Explorer

Attachments

Explorer

Cadet

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Cadet

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool Import or System Boot Causes Kernel Panic"

Similar threads