Hi everybody,
I've recently been battling some issues with random reboots of my FreeNAS system. I've had two so far this month. I've been using FreeNAS for a couple years, and I've got a pretty good knowledge of FreeBSD (and other unix-like derivatives) for years before that. I'm wondering if somebody may be able to shed a light on what may be going on here.
First, I'm running FreeNAS 11.1-U5 on the following hardware, no monitor/keyboard/mouse attached:
CPU: Xeon E3-1246 v3 (4C/8T)
Motherboard: Supermicro X10SAE
RAM: 24GB DDR3 ECC unbuffered (2x4GB Axiom modules, 2x8GB Samsung modules)
Power Supply: Supermicro-branded 500W 80-plus bronze in the Supermicro tower case I've got everything in
Running 2 pools: 2x256GB Crucial SSD mirror, 4x4TB WD Red in a raidz2 pool.
However, I think the disks in this case are irrelevant. However, all pools report healthy via
Overall, the server is just doing some light homelab-type duties for desktop backup, media sharing, etc. It's not very stressed. I'm running about 5 iocage jails (plex, sabnzbd, sonarr, duplicity, unifi controller) and one bhyve VM (debian for unifi video). CPU basically sits between 10-20% most of the time, disk activity is very low with no sustained writes.
Last night, around 03:02 or so, I noticed that it rebooted. I suspect a kernel panic and a watchdog reset, but I don't see any events in the IPMI event log about a watchdog reset, so I can't prove that.
From
Here's a snippet of the end of the
The same file indicates
So
From here it looks like the panic came from PID 15597, running on CPU4, which happened to be a
Could anybody shed any light here? Are there any other dumps that would be useful? I can't seem to find a magic bullet, or rule out bad hardware by replicating a crash. If it's hardware, I'd suspect memory or CPU, but it happens arbitrarily so I don't know how I could effectively troubleshoot that.
Any help would be great!
I've recently been battling some issues with random reboots of my FreeNAS system. I've had two so far this month. I've been using FreeNAS for a couple years, and I've got a pretty good knowledge of FreeBSD (and other unix-like derivatives) for years before that. I'm wondering if somebody may be able to shed a light on what may be going on here.
First, I'm running FreeNAS 11.1-U5 on the following hardware, no monitor/keyboard/mouse attached:
CPU: Xeon E3-1246 v3 (4C/8T)
Motherboard: Supermicro X10SAE
RAM: 24GB DDR3 ECC unbuffered (2x4GB Axiom modules, 2x8GB Samsung modules)
Power Supply: Supermicro-branded 500W 80-plus bronze in the Supermicro tower case I've got everything in
Running 2 pools: 2x256GB Crucial SSD mirror, 4x4TB WD Red in a raidz2 pool.
However, I think the disks in this case are irrelevant. However, all pools report healthy via
zpool status
.Overall, the server is just doing some light homelab-type duties for desktop backup, media sharing, etc. It's not very stressed. I'm running about 5 iocage jails (plex, sabnzbd, sonarr, duplicity, unifi controller) and one bhyve VM (debian for unifi video). CPU basically sits between 10-20% most of the time, disk activity is very low with no sustained writes.
Last night, around 03:02 or so, I noticed that it rebooted. I suspect a kernel panic and a watchdog reset, but I don't see any events in the IPMI event log about a watchdog reset, so I can't prove that.
From
/data/crash/info.1
:Code:root@freenas:/data/crash # cat info.1 Dump header from device: /dev/ada4p1 Architecture: amd64 Architecture Version: 1 Dump Length: 659456 Blocksize: 512 Dumptime: Tue Aug 21 07:01:15 2018 Hostname: freenas Magic: FreeBSD Text Dump Version String: FreeBSD 11.1-STABLE #0 r321665+e0c4ca60dfc(freenas/11.1-stable): Wed May 30 14:18:20 EDT 2018 root@nemesis.tn.ixsystems.com:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/ Panic String: double fault Dump Parity: 2460622719 Bounds: 1 Dump Status: good
Here's a snippet of the end of the
msgbuf.txt
file from the resulting /data/crash/textdump.tar.0.gz
file:Code:... VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x89/frame 0xfffffe0806a3fe20 vgonel() at vgonel+0x2a0/frame 0xfffffe0806a3fea0 vrecycle() at vrecycle+0x4d/frame 0xfffffe0806a3fed0 zfs_freebsd_inactive() at zfs_freebsd_inactive+0xd/frame 0xfffffe0806a3fee0 VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x89/frame 0xfffffe0806a3ff10 vinactive() at vinactive+0xf2/frame 0xfffffe0806a3ff70 vputx() at vputx+0x2c5/frame 0xfffffe0806a3ffd0 null_reclaim() at null_reclaim+0xf6/frame 0xfffffe0806a40030 VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x89/frame 0xfffffe0806a40060 vgonel() at vgonel+0x2a0/frame 0xfffffe0806a400e0 vnlru_free_locked() at vnlru_free_locked+0x22c/frame 0xfffffe0806a40150 getnewvnode_reserve() at getnewvnode_reserve+0x77/frame 0xfffffe0806a40180 zfs_zget() at zfs_zget+0x27/frame 0xfffffe0806a40240 zfs_dirent_lookup() at zfs_dirent_lookup+0x15d/frame 0xfffffe0806a40290 zfs_dirlook() at zfs_dirlook+0x77/frame 0xfffffe0806a402d0 zfs_lookup() at zfs_lookup+0x432/frame 0xfffffe0806a403c0 zfs_freebsd_lookup() at zfs_freebsd_lookup+0x6d/frame 0xfffffe0806a40500 VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x83/frame 0xfffffe0806a40530 vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe0806a40590 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x83/frame 0xfffffe0806a405c0 lookup() at lookup+0x6c1/frame 0xfffffe0806a40660 namei() at namei+0x48f/frame 0xfffffe0806a40730 kern_statat() at kern_statat+0x98/frame 0xfffffe0806a408e0 sys_stat() at sys_stat+0x2d/frame 0xfffffe0806a40980 amd64_syscall() at amd64_syscall+0xa4a/frame 0xfffffe0806a40ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0806a40ab0 --- syscall (188, FreeBSD ELF64, sys_stat), rip = 0x800bd5d7a, rsp = 0x7fffffffe578, rbp = 0x7fffffffe690 --- KDB: enter: panic
The same file indicates
panic: double fault
in the same file, right above the huge crash dump. Nothing /var/log/messages
indicates any kind of problem prior to the reboot... just noting the configuration reload request at 00:00 and then about 3 hours later the standard dmesg
output when FreeBSD boots.So
msgbuf.txt
tells me that a kernel panic occurred, but doesn't seem to shed much light on what actually did it. However, ddb.txt
seems to:Code:db:0:kdb.enter.default> bt Tracing pid 15597 tid 102961 td 0xfffff800bad0f000 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe07739d0d80 vpanic() at vpanic+0x1a3/frame 0xfffffe07739d0e00 panic() at panic+0x43/frame 0xfffffe07739d0e60 dblfault_handler() at dblfault_handler+0x1de/frame 0xfffffe07739d0f30 Xdblfault() at Xdblfault+0xac/frame 0xfffffe07739d0f30 --- trap 0x17, rip = 0xffffffff80480dc8, rsp = 0xfffffe0806a3d000, rbp = 0xfffffe0806a3d050 --- zio_create() at zio_create+0x38/frame 0xfffffe0806a3d050 zio_vdev_child_io() at zio_vdev_child_io+0x105/frame 0xfffffe0806a3d100 ... cpuid = 4 dynamic pcpu = 0xfffffe07f1cc0700 curthread = 0xfffff800bad0f000: pid 15597 "sh" curpcb = 0xfffffe0806a40b80 fpcurthread = none idlethread = 0xfffff800090c5000: tid 100007 "idle: cpu4" curpmap = 0xfffff80037270138 tssp = 0xffffffff821b6eb0 commontssp = 0xffffffff821b6eb0 rsp0 = 0xfffffe0806a40b80 gs32p = 0xffffffff821bd708 ldt = 0xffffffff821bd748 tss = 0xffffffff821bd738 curvnet = 0 ... db:0:kdb.enter.default> ps pid ppid pgrp uid state wmesg wchan cmd 15652 15582 14477 0 DJ zio->io_ 0xfffff800a9a3a398 sh 15598 15588 14469 0 SJ piperd 0xfffff80037493000 mail 15597 15588 14469 0 RJ CPU 4 sh 15588 15587 14469 0 SJ wait 0xfffff801b3801000 sh 15587 14469 14469 0 SJ wait 0xfffff801bd240588 lockf 15583 15572 14477 0 SJ piperd 0xfffff80101fbd8e8 mail 15582 15572 14477 0 SJ wait 0xfffff80142d24000 sh 15572 15571 14477 0 SJ wait 0xfffff800a6d7d000 sh 15571 14477 14477 0 SJ wait 0xfffff80182c6d000 lockf 15464 15455 14478 0 SJ nanslp 0xffffffff81f7e3c2 sleep 15455 14980 14478 0 SJ wait 0xfffff803eb72d000 sh 15160 15158 14475 0 SJ piperd 0xfffff800373318e8 cat 15158 15153 14475 0 SJ wait 0xfffff8032adfc000 sh 15157 15153 14475 0 DJ zio->io_ 0xfffff802ae1867a8 find 15153 15151 14475 0 SJ wait 0xfffff8040a5f6000 sh 15152 15142 14475 0 SJ piperd 0xfffff80529e555f0 mail 15151 15142 14475 0 SJ wait 0xfffff800b01f5000 sh 15142 15141 14475 0 SJ wait 0xfffff800ae6c4588 sh 15141 15139 14475 0 SJ wait 0xfffff8013aace000 lockf 15139 15138 14475 0 SJ wait 0xfffff800681b9588 sh 15138 15039 14475 0 SJ wait 0xfffff8058eb2e588 sh 15040 15030 14475 0 SJ piperd 0xfffff8038ff252f8 mail 15039 15030 14475 0 SJ wait 0xfffff8006812c000 sh 15030 15029 14475 0 SJ wait 0xfffff80142d11000 sh 15029 14475 14475 0 SJ wait 0xfffff803d494c588 lockf 14981 14972 14478 0 SJ piperd 0xfffff801091d0be0 mail 14980 14972 14478 0 SJ wait 0xfffff800a3b1e000 sh 14972 14971 14478 0 SJ wait 0xfffff8058eb2e000 sh 14971 14969 14478 0 SJ wait 0xfffff8032adfb588 lockf 14969 14968 14478 0 SJ wait 0xfffff80080ef9000 sh 14968 14756 14478 0 SJ wait 0xfffff804f84ac000 sh 14758 14738 14478 0 SJ piperd 0xfffff80101fbe2f8 mail 14756 14738 14478 0 SJ wait 0xfffff8005862b588 sh 14738 14736 14478 0 SJ wait 0xfffff8010d3b5000 sh 14736 14478 14478 0 SJ wait 0xfffff8018085f588 lockf 14478 14463 14478 0 SsJ wait 0xfffff8005862c000 sh 14477 14465 14477 0 SsJ wait 0xfffff80058f49588 sh 14476 14467 14476 0 DsJ zio->io_ 0xfffff800d165abb8 sh 14475 14464 14475 0 SsJ wait 0xfffff8032adfc588 sh 14469 14461 14469 0 SsJ wait 0xfffff801bd06b588 sh 14467 3590 3590 0 SJ piperd 0xfffff80080f13be0 cron 14465 104 104 0 SJ piperd 0xfffff80376f68be0 cron 14464 2483 2483 0 SJ piperd 0xfffff801091d18e8 cron 14463 84369 84369 0 SJ piperd 0xfffff800375542f8 cron 14461 87756 87756 0 SJ piperd 0xfffff80037493be0 cron
From here it looks like the panic came from PID 15597, running on CPU4, which happened to be a
sh
process that links back to other sh
processes, a lockf
process, and eventually a cron
process. I checked all my jails for crontabs that may have been running, and none of them have any cron jobs except the duplicity jails, which runs an hour later than the crash occurred, so I'm not sure what would have caused it.Could anybody shed any light here? Are there any other dumps that would be useful? I can't seem to find a magic bullet, or rule out bad hardware by replicating a crash. If it's hardware, I'd suspect memory or CPU, but it happens arbitrarily so I don't know how I could effectively troubleshoot that.
Any help would be great!
Last edited: