Diagnosing random reboots

davbro

Cadet
Joined
Aug 21, 2018
Messages
9
Hi everybody,

I've recently been battling some issues with random reboots of my FreeNAS system. I've had two so far this month. I've been using FreeNAS for a couple years, and I've got a pretty good knowledge of FreeBSD (and other unix-like derivatives) for years before that. I'm wondering if somebody may be able to shed a light on what may be going on here.

First, I'm running FreeNAS 11.1-U5 on the following hardware, no monitor/keyboard/mouse attached:

CPU: Xeon E3-1246 v3 (4C/8T)
Motherboard: Supermicro X10SAE
RAM: 24GB DDR3 ECC unbuffered (2x4GB Axiom modules, 2x8GB Samsung modules)
Power Supply: Supermicro-branded 500W 80-plus bronze in the Supermicro tower case I've got everything in

Running 2 pools: 2x256GB Crucial SSD mirror, 4x4TB WD Red in a raidz2 pool.

However, I think the disks in this case are irrelevant. However, all pools report healthy via zpool status.

Overall, the server is just doing some light homelab-type duties for desktop backup, media sharing, etc. It's not very stressed. I'm running about 5 iocage jails (plex, sabnzbd, sonarr, duplicity, unifi controller) and one bhyve VM (debian for unifi video). CPU basically sits between 10-20% most of the time, disk activity is very low with no sustained writes.

Last night, around 03:02 or so, I noticed that it rebooted. I suspect a kernel panic and a watchdog reset, but I don't see any events in the IPMI event log about a watchdog reset, so I can't prove that.

From /data/crash/info.1:

Code:
root@freenas:/data/crash # cat info.1
Dump header from device: /dev/ada4p1
  Architecture: amd64
  Architecture Version: 1
  Dump Length: 659456
  Blocksize: 512
  Dumptime: Tue Aug 21 07:01:15 2018
  Hostname: freenas
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 11.1-STABLE #0 r321665+e0c4ca60dfc(freenas/11.1-stable): Wed May 30 14:18:20 EDT 2018
	root@nemesis.tn.ixsystems.com:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/
  Panic String: double fault
  Dump Parity: 2460622719
  Bounds: 1
  Dump Status: good

Here's a snippet of the end of the msgbuf.txt file from the resulting /data/crash/textdump.tar.0.gz file:
Code:
...
VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x89/frame 0xfffffe0806a3fe20
vgonel() at vgonel+0x2a0/frame 0xfffffe0806a3fea0
vrecycle() at vrecycle+0x4d/frame 0xfffffe0806a3fed0
zfs_freebsd_inactive() at zfs_freebsd_inactive+0xd/frame 0xfffffe0806a3fee0
VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x89/frame 0xfffffe0806a3ff10
vinactive() at vinactive+0xf2/frame 0xfffffe0806a3ff70
vputx() at vputx+0x2c5/frame 0xfffffe0806a3ffd0
null_reclaim() at null_reclaim+0xf6/frame 0xfffffe0806a40030
VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x89/frame 0xfffffe0806a40060
vgonel() at vgonel+0x2a0/frame 0xfffffe0806a400e0
vnlru_free_locked() at vnlru_free_locked+0x22c/frame 0xfffffe0806a40150
getnewvnode_reserve() at getnewvnode_reserve+0x77/frame 0xfffffe0806a40180
zfs_zget() at zfs_zget+0x27/frame 0xfffffe0806a40240
zfs_dirent_lookup() at zfs_dirent_lookup+0x15d/frame 0xfffffe0806a40290
zfs_dirlook() at zfs_dirlook+0x77/frame 0xfffffe0806a402d0
zfs_lookup() at zfs_lookup+0x432/frame 0xfffffe0806a403c0
zfs_freebsd_lookup() at zfs_freebsd_lookup+0x6d/frame 0xfffffe0806a40500
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x83/frame 0xfffffe0806a40530
vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe0806a40590
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x83/frame 0xfffffe0806a405c0
lookup() at lookup+0x6c1/frame 0xfffffe0806a40660
namei() at namei+0x48f/frame 0xfffffe0806a40730
kern_statat() at kern_statat+0x98/frame 0xfffffe0806a408e0
sys_stat() at sys_stat+0x2d/frame 0xfffffe0806a40980
amd64_syscall() at amd64_syscall+0xa4a/frame 0xfffffe0806a40ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0806a40ab0
--- syscall (188, FreeBSD ELF64, sys_stat), rip = 0x800bd5d7a, rsp = 0x7fffffffe578, rbp = 0x7fffffffe690 ---
KDB: enter: panic

The same file indicates panic: double fault in the same file, right above the huge crash dump. Nothing /var/log/messages indicates any kind of problem prior to the reboot... just noting the configuration reload request at 00:00 and then about 3 hours later the standard dmesg output when FreeBSD boots.

So msgbuf.txt tells me that a kernel panic occurred, but doesn't seem to shed much light on what actually did it. However, ddb.txt seems to:

Code:
db:0:kdb.enter.default>  bt
Tracing pid 15597 tid 102961 td 0xfffff800bad0f000
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe07739d0d80
vpanic() at vpanic+0x1a3/frame 0xfffffe07739d0e00
panic() at panic+0x43/frame 0xfffffe07739d0e60
dblfault_handler() at dblfault_handler+0x1de/frame 0xfffffe07739d0f30
Xdblfault() at Xdblfault+0xac/frame 0xfffffe07739d0f30
--- trap 0x17, rip = 0xffffffff80480dc8, rsp = 0xfffffe0806a3d000, rbp = 0xfffffe0806a3d050 ---
zio_create() at zio_create+0x38/frame 0xfffffe0806a3d050
zio_vdev_child_io() at zio_vdev_child_io+0x105/frame 0xfffffe0806a3d100

...

cpuid		= 4
dynamic pcpu = 0xfffffe07f1cc0700
curthread	= 0xfffff800bad0f000: pid 15597 "sh"
curpcb	   = 0xfffffe0806a40b80
fpcurthread  = none
idlethread   = 0xfffff800090c5000: tid 100007 "idle: cpu4"
curpmap	  = 0xfffff80037270138
tssp		 = 0xffffffff821b6eb0
commontssp   = 0xffffffff821b6eb0
rsp0		 = 0xfffffe0806a40b80
gs32p		= 0xffffffff821bd708
ldt		  = 0xffffffff821bd748
tss		  = 0xffffffff821bd738
curvnet	  = 0

...

db:0:kdb.enter.default>  ps
  pid  ppid  pgrp   uid   state   wmesg		 wchan		cmd
15652 15582 14477	 0  DJ	  zio->io_ 0xfffff800a9a3a398 sh
15598 15588 14469	 0  SJ	  piperd   0xfffff80037493000 mail
15597 15588 14469	 0  RJ	  CPU 4					   sh
15588 15587 14469	 0  SJ	  wait	 0xfffff801b3801000 sh
15587 14469 14469	 0  SJ	  wait	 0xfffff801bd240588 lockf
15583 15572 14477	 0  SJ	  piperd   0xfffff80101fbd8e8 mail
15582 15572 14477	 0  SJ	  wait	 0xfffff80142d24000 sh
15572 15571 14477	 0  SJ	  wait	 0xfffff800a6d7d000 sh
15571 14477 14477	 0  SJ	  wait	 0xfffff80182c6d000 lockf
15464 15455 14478	 0  SJ	  nanslp   0xffffffff81f7e3c2 sleep
15455 14980 14478	 0  SJ	  wait	 0xfffff803eb72d000 sh
15160 15158 14475	 0  SJ	  piperd   0xfffff800373318e8 cat
15158 15153 14475	 0  SJ	  wait	 0xfffff8032adfc000 sh
15157 15153 14475	 0  DJ	  zio->io_ 0xfffff802ae1867a8 find
15153 15151 14475	 0  SJ	  wait	 0xfffff8040a5f6000 sh
15152 15142 14475	 0  SJ	  piperd   0xfffff80529e555f0 mail
15151 15142 14475	 0  SJ	  wait	 0xfffff800b01f5000 sh
15142 15141 14475	 0  SJ	  wait	 0xfffff800ae6c4588 sh
15141 15139 14475	 0  SJ	  wait	 0xfffff8013aace000 lockf
15139 15138 14475	 0  SJ	  wait	 0xfffff800681b9588 sh
15138 15039 14475	 0  SJ	  wait	 0xfffff8058eb2e588 sh
15040 15030 14475	 0  SJ	  piperd   0xfffff8038ff252f8 mail
15039 15030 14475	 0  SJ	  wait	 0xfffff8006812c000 sh
15030 15029 14475	 0  SJ	  wait	 0xfffff80142d11000 sh
15029 14475 14475	 0  SJ	  wait	 0xfffff803d494c588 lockf
14981 14972 14478	 0  SJ	  piperd   0xfffff801091d0be0 mail
14980 14972 14478	 0  SJ	  wait	 0xfffff800a3b1e000 sh
14972 14971 14478	 0  SJ	  wait	 0xfffff8058eb2e000 sh
14971 14969 14478	 0  SJ	  wait	 0xfffff8032adfb588 lockf
14969 14968 14478	 0  SJ	  wait	 0xfffff80080ef9000 sh
14968 14756 14478	 0  SJ	  wait	 0xfffff804f84ac000 sh
14758 14738 14478	 0  SJ	  piperd   0xfffff80101fbe2f8 mail
14756 14738 14478	 0  SJ	  wait	 0xfffff8005862b588 sh
14738 14736 14478	 0  SJ	  wait	 0xfffff8010d3b5000 sh
14736 14478 14478	 0  SJ	  wait	 0xfffff8018085f588 lockf
14478 14463 14478	 0  SsJ	 wait	 0xfffff8005862c000 sh
14477 14465 14477	 0  SsJ	 wait	 0xfffff80058f49588 sh
14476 14467 14476	 0  DsJ	 zio->io_ 0xfffff800d165abb8 sh
14475 14464 14475	 0  SsJ	 wait	 0xfffff8032adfc588 sh
14469 14461 14469	 0  SsJ	 wait	 0xfffff801bd06b588 sh
14467  3590  3590	 0  SJ	  piperd   0xfffff80080f13be0 cron
14465   104   104	 0  SJ	  piperd   0xfffff80376f68be0 cron
14464  2483  2483	 0  SJ	  piperd   0xfffff801091d18e8 cron
14463 84369 84369	 0  SJ	  piperd   0xfffff800375542f8 cron
14461 87756 87756	 0  SJ	  piperd   0xfffff80037493be0 cron



From here it looks like the panic came from PID 15597, running on CPU4, which happened to be a sh process that links back to other sh processes, a lockf process, and eventually a cron process. I checked all my jails for crontabs that may have been running, and none of them have any cron jobs except the duplicity jails, which runs an hour later than the crash occurred, so I'm not sure what would have caused it.

Could anybody shed any light here? Are there any other dumps that would be useful? I can't seem to find a magic bullet, or rule out bad hardware by replicating a crash. If it's hardware, I'd suspect memory or CPU, but it happens arbitrarily so I don't know how I could effectively troubleshoot that.

Any help would be great!
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Have all crashes be from sh tracing back to cron? Have you tested your memory? I know its ECC but if your getting multi bit errors...
 

davbro

Cadet
Joined
Aug 21, 2018
Messages
9
Unfortunately I only have the crash dumps from the latest crash. I'll definitely correlate if/when it happens again. Out of curiosity, how can you determine it was a multi-bit error vs. a single bit? Probably beyond the knowledge I have of ECC memory.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Unfortunately I only have the crash dumps from the latest crash. I'll definitely correlate if/when it happens again. Out of curiosity, how can you determine it was a multi-bit error vs. a single bit? Probably beyond the knowledge I have of ECC memory.
I'm no expert on the subject but most DDR3 ECC is only single bit correcting some is dual but that get more expensive as its basically RAID6 on a stick of ram, same idea with over head. N-2 chips of capacity. If it is from a detected ECC Error, you may have a log in your BIOS. In anycase, if enough bits were stuck/flipped it would cause all sorts of lockups and panics. Or better, hard resets.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Out of curiosity, how can you determine it was a multi-bit error vs. a single bit? Probably beyond the knowledge I have of ECC memory.

Look, at the IPMI errors log, it should be here.

I'm no expert on the subject but most DDR3 ECC is only single bit correcting some is dual but that get more expensive as its basically RAID6 on a stick of ram, same idea with over head. N-2 chips of capacity. If it is from a detected ECC Error, you may have a log in your BIOS. In anycase, if enough bits were stuck/flipped it would cause all sorts of lockups and panics. Or better, hard resets.

Single bit will be corrected, multi bits will be detected and the system will be hard resetted.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
3:00AM is when the system runs daily logs. Some of them do a find that puts a moderate amount of stress on the system disk looking for permissions changes. If there is a weakness in the system, that additional bit of stress can trigger a failure. What is run is controlled by /etc/periodic.conf, although there are environment variables that also affect it. TLDR: if you can trigger the reboot by running periodic daily from the command line, it will allow testing with different configurations of RAM or power supply.
 

davbro

Cadet
Joined
Aug 21, 2018
Messages
9
wblock, you're definitely right, thanks for the insight. I forgot all about /etc/crontab. It crashed again last night, at the exact same time.

Looks like this particular cron entry seems to line up with the crash, in both time and the fact it's a find command.

Code:
0	   3	   *	   *	   *	   root	find /tmp/ -iname "sessionid*" -ctime +1d -delete > /dev/null 2>&1


So it's looking for anything named sessionid*, older than one day, and then deletes it. This looks like it's a stock FreeBSD cron entry... This looks like it's FreeNAS-specific -- does anybody know what this is trying to do? It's not space saving, the files in question are 200 bytes. If I disable it, what am I breaking? There's no comments around it.

It can't be a very involving command, there's only one file that matches that find command in the directory.

I could try and test with some more RAM... but man, RAM is spendy right now. I wonder if I just disable that cron entry until memory prices come back down... :)

Edit: Well, it could be that, or or it could be from one of the 4-5 find commands in /etc/periodic/daily, which runs at 03:01. Running periodic daily didn't seem to cause a crash... so must be a combination of that plus something else occurring at that time. Weird stuff.
 
Last edited:

davbro

Cadet
Joined
Aug 21, 2018
Messages
9
As an update, I've done a few things.

I had previously used the autotune feature to add ZFS tuning parameters to rc.conf. I've done some reading and it sounds like this is generally deprecated outside of a handful of scenarios where there's a huge amount of RAM involved. I disabled autotune and removed all the "created by autotune" tunables and rebooted.

So far, no more reboots @ 03:02. It's only been two nights, so not willing to say that was contributing in my case, but who knows. Fingers crossed.

Also, probably unrelated, one of my "older" 4TB reds seems to have some unreadable sectors according to SMART. ZFS noting 96K repaired on a scrub, and 1 Checksum error, though pool still reports healthy. Disk was under warranty, so waiting on a replacement. This is probably unrelated. (four disks in raidz2, probably the only configuration I'd consider raidz2 in vs. striped mirrors, as my IOPS needs aren't that high and I'll take the increased failure flexibility over more speed)

Finally, I updated to 11.1-U6. Looks like there was a laundry list of bug fixes, some maybe tangential to the issues I was seeing, so we'll see if things calm down.

If it happens again, I'm probably going to move to some memtest86 runs and see what happens.
 
Last edited:

davbro

Cadet
Joined
Aug 21, 2018
Messages
9
Since I promised to follow up:

Things had been rock solid since my last post, though last night I noticed it rebooted again, and right at 3:03am. So roughly the same time of the day it was doing it before. This time seemingly linked to a uwsgi process.

Since my last post, I've replaced a failing drive and resilvered the array with no issues. So looks like I'm not out of the weeds. Memtest86 I guess is my next step unless anybody thinks they have a magic bullet. I want to be skeptical about failing memory since the crash happens at the same time of the day every single time, and I'd think bad memory would introduce a random element into the mix, but who knows.
 

jliu83

Cadet
Joined
Sep 6, 2019
Messages
1
Hi Davbro, did you ever resolve this?

I am having the exact same issue. 3am, every night, my 11.2 U5 box reboots and it is a mystery.

periodic daily

Runs just fine from shell, does not reboot or crash

find /tmp/ -iname "sessionid*" -ctime +1d -delete > /dev/null 2>&1

Runs just fine from shell as well.

The crontab file is copied from the database everyday I believe. I tried modifying the two culprits to a different time slow (IE, set it to 2am to see if the reboot time changes), but on every reboot, the crontab file is reset to the original.

Any hints as to what you did would be really helpful.
 
Top