Diagnosing random reboots

davbro · Aug 21, 2018

Hi everybody,

I've recently been battling some issues with random reboots of my FreeNAS system. I've had two so far this month. I've been using FreeNAS for a couple years, and I've got a pretty good knowledge of FreeBSD (and other unix-like derivatives) for years before that. I'm wondering if somebody may be able to shed a light on what may be going on here.

First, I'm running FreeNAS 11.1-U5 on the following hardware, no monitor/keyboard/mouse attached:

CPU: Xeon E3-1246 v3 (4C/8T)
Motherboard: Supermicro X10SAE
RAM: 24GB DDR3 ECC unbuffered (2x4GB Axiom modules, 2x8GB Samsung modules)
Power Supply: Supermicro-branded 500W 80-plus bronze in the Supermicro tower case I've got everything in

Running 2 pools: 2x256GB Crucial SSD mirror, 4x4TB WD Red in a raidz2 pool.

However, I think the disks in this case are irrelevant. However, all pools report healthy via zpool status.

Overall, the server is just doing some light homelab-type duties for desktop backup, media sharing, etc. It's not very stressed. I'm running about 5 iocage jails (plex, sabnzbd, sonarr, duplicity, unifi controller) and one bhyve VM (debian for unifi video). CPU basically sits between 10-20% most of the time, disk activity is very low with no sustained writes.

Last night, around 03:02 or so, I noticed that it rebooted. I suspect a kernel panic and a watchdog reset, but I don't see any events in the IPMI event log about a watchdog reset, so I can't prove that.

From /data/crash/info.1:

Here's a snippet of the end of the msgbuf.txt file from the resulting /data/crash/textdump.tar.0.gz file:

The same file indicates panic: double fault in the same file, right above the huge crash dump. Nothing /var/log/messages indicates any kind of problem prior to the reboot... just noting the configuration reload request at 00:00 and then about 3 hours later the standard dmesg output when FreeBSD boots.

So msgbuf.txt tells me that a kernel panic occurred, but doesn't seem to shed much light on what actually did it. However, ddb.txt seems to:

From here it looks like the panic came from PID 15597, running on CPU4, which happened to be a sh process that links back to other sh processes, a lockf process, and eventually a cron process. I checked all my jails for crontabs that may have been running, and none of them have any cron jobs except the duplicity jails, which runs an hour later than the crash occurred, so I'm not sure what would have caused it.

Could anybody shed any light here? Are there any other dumps that would be useful? I can't seem to find a magic bullet, or rule out bad hardware by replicating a crash. If it's hardware, I'd suspect memory or CPU, but it happens arbitrarily so I don't know how I could effectively troubleshoot that.

Any help would be great!

kdragon75 · Aug 21, 2018

Have all crashes be from sh tracing back to cron? Have you tested your memory? I know its ECC but if your getting multi bit errors...

davbro · Aug 21, 2018

Unfortunately I only have the crash dumps from the latest crash. I'll definitely correlate if/when it happens again. Out of curiosity, how can you determine it was a multi-bit error vs. a single bit? Probably beyond the knowledge I have of ECC memory.

kdragon75 · Aug 21, 2018

davbro said:
Unfortunately I only have the crash dumps from the latest crash. I'll definitely correlate if/when it happens again. Out of curiosity, how can you determine it was a multi-bit error vs. a single bit? Probably beyond the knowledge I have of ECC memory.

I'm no expert on the subject but most DDR3 ECC is only single bit correcting some is dual but that get more expensive as its basically RAID6 on a stick of ram, same idea with over head. N-2 chips of capacity. If it is from a detected ECC Error, you may have a log in your BIOS. In anycase, if enough bits were stuck/flipped it would cause all sorts of lockups and panics. Or better, hard resets.

Bidule0hm · Aug 23, 2018

davbro said:
Out of curiosity, how can you determine it was a multi-bit error vs. a single bit? Probably beyond the knowledge I have of ECC memory.

Look, at the IPMI errors log, it should be here.

kdragon75 said:
I'm no expert on the subject but most DDR3 ECC is only single bit correcting some is dual but that get more expensive as its basically RAID6 on a stick of ram, same idea with over head. N-2 chips of capacity. If it is from a detected ECC Error, you may have a log in your BIOS. In anycase, if enough bits were stuck/flipped it would cause all sorts of lockups and panics. Or better, hard resets.

Single bit will be corrected, multi bits will be detected and the system will be hard resetted.

wblock · Aug 23, 2018

3:00AM is when the system runs daily logs. Some of them do a find that puts a moderate amount of stress on the system disk looking for permissions changes. If there is a weakness in the system, that additional bit of stress can trigger a failure. What is run is controlled by /etc/periodic.conf, although there are environment variables that also affect it. TLDR: if you can trigger the reboot by running periodic daily from the command line, it will allow testing with different configurations of RAM or power supply.

davbro · Aug 24, 2018

wblock, you're definitely right, thanks for the insight. I forgot all about /etc/crontab. It crashed again last night, at the exact same time.

Looks like this particular cron entry seems to line up with the crash, in both time and the fact it's a find command.

Code:

0	   3	   *	   *	   *	   root	find /tmp/ -iname "sessionid*" -ctime +1d -delete > /dev/null 2>&1

So it's looking for anything named sessionid*, older than one day, and then deletes it. ~~This looks like it's a stock FreeBSD cron entry...~~ This looks like it's FreeNAS-specific -- does anybody know what this is trying to do? It's not space saving, the files in question are 200 bytes. If I disable it, what am I breaking? There's no comments around it.

It can't be a very involving command, there's only one file that matches that find command in the directory.

I could try and test with some more RAM... but man, RAM is spendy right now. I wonder if I just disable that cron entry until memory prices come back down... :)

Edit: Well, it could be that, or or it could be from one of the 4-5 find commands in /etc/periodic/daily, which runs at 03:01. Running periodic daily didn't seem to cause a crash... so must be a combination of that plus something else occurring at that time. Weird stuff.

davbro · Aug 26, 2018

As an update, I've done a few things.

I had previously used the autotune feature to add ZFS tuning parameters to rc.conf. I've done some reading and it sounds like this is generally deprecated outside of a handful of scenarios where there's a huge amount of RAM involved. I disabled autotune and removed all the "created by autotune" tunables and rebooted.

So far, no more reboots @ 03:02. It's only been two nights, so not willing to say that was contributing in my case, but who knows. Fingers crossed.

Also, probably unrelated, one of my "older" 4TB reds seems to have some unreadable sectors according to SMART. ZFS noting 96K repaired on a scrub, and 1 Checksum error, though pool still reports healthy. Disk was under warranty, so waiting on a replacement. This is probably unrelated. (four disks in raidz2, probably the only configuration I'd consider raidz2 in vs. striped mirrors, as my IOPS needs aren't that high and I'll take the increased failure flexibility over more speed)

Finally, I updated to 11.1-U6. Looks like there was a laundry list of bug fixes, some maybe tangential to the issues I was seeing, so we'll see if things calm down.

If it happens again, I'm probably going to move to some memtest86 runs and see what happens.

davbro · Dec 1, 2018

Since I promised to follow up:

Things had been rock solid since my last post, though last night I noticed it rebooted again, and right at 3:03am. So roughly the same time of the day it was doing it before. This time seemingly linked to a uwsgi process.

Since my last post, I've replaced a failing drive and resilvered the array with no issues. So looks like I'm not out of the weeds. Memtest86 I guess is my next step unless anybody thinks they have a magic bullet. I want to be skeptical about failing memory since the crash happens at the same time of the day every single time, and I'd think bad memory would introduce a random element into the mix, but who knows.

jliu83 · Sep 6, 2019

Hi Davbro, did you ever resolve this?

I am having the exact same issue. 3am, every night, my 11.2 U5 box reboots and it is a mystery.

periodic daily

Runs just fine from shell, does not reboot or crash

find /tmp/ -iname "sessionid*" -ctime +1d -delete > /dev/null 2>&1

Runs just fine from shell as well.

The crontab file is copied from the database everyday I believe. I tried modifying the two culprits to a different time slow (IE, set it to 2am to see if the reboot time changes), but on every reboot, the crontab file is reset to the original.

Any hints as to what you did would be really helpful.

Important Announcement for the TrueNAS Community.

Diagnosing random reboots

davbro

Cadet

kdragon75

Wizard

davbro

Cadet

kdragon75

Wizard

Bidule0hm

Server Electronics Sorcerer

wblock

Documentation Engineer

davbro

Cadet

davbro

Cadet

davbro

Cadet

jliu83

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Diagnosing random reboots

Cadet

Wizard

Cadet

Wizard

Server Electronics Sorcerer

Documentation Engineer

Cadet

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Diagnosing random reboots"

Similar threads