What's Causing my Kernel Panic?

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
Our Setup
Hardware

  • 45 Drives Turbo 60 XL
  • 60 SATA drives (6 10 disk vdevs raidz2)
  • 4 LSI 9305-24i SAS HBAs
  • 8 SSD cache drives
  • 2 SSDs for config
  • 1 flash drive with backup config and cache
Software
  • FreeNAS 11.1 U7

BACKGROUND
We have a box with 60 hard drives running FreeNAS 11.1 U7 that has been working fine for a long time. Server is lightly used, and running stable with no known environmental or configuration changes occurring recently.


BEHAVIOR WE'RE EXPERIENCING
Server stopped responding. Upon reboot, it is giving a kernel panic when trying to import volumes.


STEPS TAKEN
  1. After trying to reboot, same symptoms persist
  2. Ran memtest86 with no errors

KERNAL PANIC MESSAGE
Virtual Media Record Macro Options User List Capture Power Control Exit
Importing 17238327038330746117
txg 47837309 open pool version 5000; software version 5000/5; uts 11.1-STABLE 1
101505 amd64panic: solaris assert: offset + size <= sm->sm_start + sm->sm_size (
0x64060634802000 <= 0x230000000000), file: /freenas-releng/freenas/_BE/os/sys/cd
dl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 119
cpuid = 1
KDB stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe202386d7e0
upanic() at upanic+0x186/frame 0xfffffe202386d860
panic() at panic+8x43/frame 0xfffffe202386d8c0
assfai130) at assfai13+0x2c/frame 0xfffffe202386d8e0
space_map_load() at space_map_load+0x352/frame 0xfffffe202386d970
metas lab_load() at metas lab_load+0x2b/frame 0xfffffe202386d990
metas lab_preload() at metas lab_preload+0x89/frame 0xfffffe202386d9c0
taskq_run() at taskq_run+0x10/frame 0xfffffe202386d9e0
taskqueue_run_locked() at taskqueue_run_locked+0x147/frame 0xfffffe202386da40
taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe202386da70
fork_exit() at fork_exit+0x85/frame 0xfffffe202386dab0
fork_trampoline () at fork_trampoline +0xe/frame 0xfffffe202386dab0
--- trap 0, rip = 0, rsp = 0, rbp = 0
KDB: enter: panic
[ thread pid 0 tid 101998 1
Stopped at
db>
---
kdb_enter+0x3b: movq $0, kdb_why


I've ordered another HBA card just in case, but not sure where to start. What do you think?
 
Joined
Jul 3, 2015
Messages
926
Have you tried creating a new install on a USB drive and just seeing if you can import the pool on a fresh system?
 
Joined
Jul 3, 2015
Messages
926
Tricky one. I presume you don't have another system to move the drives over to? This would rule out a hardware issue. After that you are down the road of zpool recovery which ideally you want to leave as your last option.
 
Joined
Jul 3, 2015
Messages
926
When you tried booting into your new install could you see all 60 drives?

Also what are the 8 SSD cache drives, SLOG, L2ARC, Metadata, striped or mirrored?

  • 1 flash drive with backup config and cache - What does this mean 'and cache'?
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
I was able to get the system to boot when unplugging 1 of the LSI HBA cards. I then purchased a used "tested" card, installed it and got the same issue. If I unplug one of the 4 mini-SAS plugs, it will boot into freenas. If I have all 4 plugged in, it won't. Weird thing is it doesn't matter which one of the 4 I unplug, as long as at least 1 is unplugged, system will boot.

What do you think it could be?
 
Joined
Jul 3, 2015
Messages
926
My guess is that it’s booting because it can’t see a pool to import when you remove a cable as some drives are missing. Please answer the above questions.
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
The new install errored out before I could see drives. It was detecting drives in the boot process, but I'm not sure if it saw them all. Using the original installation, I booted freenas with one hba card (and attached drives) one by one, and all drives showed up.

I believe the cache is setup as l2arc. Is there a way I can confirm?

And ignore flash drive cache remark, the flash drive is strictly backup.
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
I left the HBA cards plugged in, but unplugged all the sas plugs. After it booted, I then plugged the drives back in one-by-one. Drives were visible in freenas, but tank was not available. We detached and tried an import, and when it was at step 2/2, the server rebooted itself
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Code:
Importing 17238327038330746117
txg 47837309 open pool version 5000; software version 5000/5; uts 11.1-STABLE 1
101505 amd64panic: solaris assert: offset + size <= sm->sm_start + sm->sm_size (
0x64060634802000 <= 0x230000000000), file: /freenas-releng/freenas/_BE/os/sys/cd
dl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 119
cpuid = 1
KDB stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe202386d7e0
upanic() at upanic+0x186/frame 0xfffffe202386d860
panic() at panic+8x43/frame 0xfffffe202386d8c0
assfai130) at assfai13+0x2c/frame 0xfffffe202386d8e0
space_map_load() at space_map_load+0x352/frame 0xfffffe202386d970
metas lab_load() at metas lab_load+0x2b/frame 0xfffffe202386d990
metas lab_preload() at metas lab_preload+0x89/frame 0xfffffe202386d9c0
taskq_run() at taskq_run+0x10/frame 0xfffffe202386d9e0
taskqueue_run_locked() at taskqueue_run_locked+0x147/frame 0xfffffe202386da40
taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe202386da70
fork_exit() at fork_exit+0x85/frame 0xfffffe202386dab0
fork_trampoline () at fork_trampoline +0xe/frame 0xfffffe202386dab0
--- trap 0, rip = 0, rsp = 0, rbp = 0
KDB: enter: panic
[ thread pid 0 tid 101998 1
Stopped at
db>


From the debug it looks like you might have spacemap corruption in your pool.

Can you repeat the process of booting with the SAS cables unplugged, then reconnect the cables - but this time, please run the following command from an SSH session:

zdb -c -e 17238327038330746117

If you know the "friendly name" of your pool as opposed to the numeric identifier, you can use that as well.
 
Joined
Jul 3, 2015
Messages
926
8 drives for L2ARC sounds a lot
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
We were able to import the volume using Ubuntu and importing as read-only. We will be copying out the data from our read-only server to a temporary server, recreate tank on our original box and then transfer the data back, unless there's a better way.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
We were able to import the volume using Ubuntu and importing as read-only. We will be copying out the data from our read-only server to a temporary server, recreate tank on our original box and then transfer the data back, unless there's a better way.
The best solution is the one that results in no data being lost. Assuming that your solution of copying to a temporary location and recreating the pool isn't untenable from a business perspective, it's likely your best path forward.

From a troubleshooting/engineering perspective I would of course be interested in digging into the root cause on your spacemap corruption (if that's what you're facing) but not if it will be an inconvenience.

One thing I would suggest is to upgrade your system to the latest release of TrueNAS, as there have been a significant number of improvements and bug-fixes since FreeNAS 11.
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
We noticed one drive had power-cycled over 7,000 times. Our best guess is that this drive might have contributed to the corruption we experienced. While we wait to get the other hardware, is there a great troubleshooting path that will get answers to the import question of WHY this happened?

Also, with regard to the update... I assume 13 Core would be the logical upgrade target? Knowing we're on 11.1, what's the best upgrade path so we don't need to reconfigure all our AD settings and shares from scratch?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
We noticed one drive had power-cycled over 7,000 times. Our best guess is that this drive might have contributed to the corruption we experienced. While we wait to get the other hardware, is there a great troubleshooting path that will get answers to the import question of WHY this happened?

Also, with regard to the update... I assume 13 Core would be the logical upgrade target? Knowing we're on 11.1, what's the best upgrade path so we don't need to reconfigure all our AD settings and shares from scratch?
A single drive power-cycling might have caused performance oddities, but if it was returning bad data ZFS should have booted it from the array or flagged it as FAULTED.

Are you using a multipath topology with your SAS HBAs by any chance? Dueling SCSI reservations are one common way I've seen this manifest.

CORE 13.0-U5.1 is the latest/current and would be your direct upgrade target, although you'll need to upgrade from 11 -> 12 and then 12 -> 13 in two steps.

I'd suggest backing up your configuration (System -> General -> Save Config) for FN11 and keeping that separate, and then going through the 11 -> 12 -> 13 upgrade cycle. Your configuration should stay intact; however, don't choose the "Upgrade your pool ZFS level" until you've confirmed that everything is working properly, and your data has been copied over.

Recreating the pool itself though might result in the need to redo the SMB shares - do you have a significant number of them?
 
Top