FreeNAS 11.1 locking up roughly every 24 hours

wblock · Dec 22, 2017

For everyone reporting "lockup" problems, let's try to find something in common. Was the time consistent? Did it do it at 3:00AM each time, for instance? Did it do it while a scrub or resilver was running? Were you using the old or new GUI? Please also include the basics of your hardware, including CPU, motherboard, and RAM. Thanks!

Pointeo13 · Dec 22, 2017

wblock said:
For everyone reporting "lockup" problems, let's try to find something in common. Was the time consistent? Did it do it at 3:00AM each time, for instance? Did it do it while a scrub or resilver was running? Were you using the old or new GUI? Please also include the basics of your hardware, including CPU, motherboard, and RAM. Thanks!

Hardware:
IBM System x3650 M4
2x Intel Xeon CPU E5-2680 @ 2.70GHz
Memory: 393153MB
ZFS Pool Size: 278 TiB

I only started to have issues after a resilvered kicked in because of a replaced drive, it was stuck at 22.94% using the old GUI.

vvuk · Dec 22, 2017

wblock said:
That is... unexpected. Even if FreeNAS has a problem, it should not affect IPMI. Please post your hardware.

Sorry, should have been clearer. IPMI itself was fine; I have getty running on the redirected serial console, and I was not able to get to a login prompt when I saw the crash.

Hardware:
* Supermicro X8-series motherboard
* Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
* 16GB memory
* Samsung SSD 840 EVO 120GB as boot drive on mothreboard SATA
* SAS2008-based SATA controller

Last entries from /var/log/messages -- this was basically repeating since 14:50 the previous day, nothing interesting in log otherwise. If I filter out the service:nas-health warnings (capacity for a volume is at 86%, should be under 80%) and the Samba4Alert warning, then:

15:18 the previous day, I start getting 'Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out' and '[WARN] consul: error getting server health from "nas": context deadline exceeded'.
18:31 syslong-ng core dumps due to a SIGABRT
then the Samba4Alert, UpdateCheckAlert, and 'error getting server health' continue until the last message.
syslog rotated successfully at 00:00

tail end of syslog:

Code:

Dec 21 01:01:45 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out
Dec 21 01:02:46 nas daemon[3264]:	 2017/12/21 01:02:46 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 21 01:03:31 nas daemon[3264]:	 2017/12/21 01:03:30 [WARN] consul: error getting server health from "nas": context deadline exceeded
Dec 21 01:05:16 nas daemon[3264]:	 2017/12/21 01:05:16 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 21 01:06:18 nas /alert.py: [system.alert:400] Alert module '<samba4.Samba4Alert object at 0x8161c5d30>' failed: timed out
Dec 21 01:07:17 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out
Dec 21 01:07:47 nas daemon[3264]:	 2017/12/21 01:07:47 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 21 01:10:17 nas daemon[3264]:	 2017/12/21 01:10:17 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 21 01:11:49 nas /alert.py: [system.alert:400] Alert module '<samba4.Samba4Alert object at 0x8161c5d30>' failed: timed out
Dec 21 01:12:26 nas daemon[3264]:	 2017/12/21 01:12:26 [WARN] consul: error getting server health from "nas": context deadline exceeded
Dec 21 01:12:47 nas daemon[3264]:	 2017/12/21 01:12:47 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 21 01:13:01 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out
Dec 21 01:15:18 nas daemon[3264]:	 2017/12/21 01:15:18 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 21 01:17:11 nas /alert.py: [system.alert:400] Alert module '<samba4.Samba4Alert object at 0x8161c5d30>' failed: timed out
Dec 21 01:18:14 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out
Dec 21 01:19:12 nas daemon[3264]:	 2017/12/21 01:19:12 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'

arcanoia · Dec 24, 2017

Hardware

Asrock C226 WS
Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz
32GB ECC RAM
2xIBM M1015-IT FW 20.00.02.00
x1 Pool with x2 Raidz-2 vdevs (8 drives each). Total of 16 drives WDRED 4TB - Pool is now 82% full (usually 80%)
1 Winkon SLC 32GB Drive as LOG

I didn't see anything interresting in any LOG.. the System just locks, that's pretty muchh what happens.. I didn't want to risk my data, so I reverted to 11.0-U4 after the third time it locked.

Scrub was automatically running when I had the problems.. otherwise the system was pretty much idling or doing very basic tasks.

Reverting to 11.0-U4 fixed everything.. scrub finished and system is up and running for days already.

@edit: the time didn't affect it on my case.. i was always able to reproduce the lock just after some minutes / hours after rebooting. I was using the old UI

KingJ · Dec 24, 2017

arcanoia said:
Hardware

Asrock C226 WS
Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz
32GB ECC RAM
2xIBM M1015-IT FW 20.00.02.00
x1 Pool with x2 Raidz-2 vdevs (8 drives each). Total of 16 drives WDRED 4TB - Pool is now 82% full (usually 80%)
1 Winkon SLC 32GB Drive as LOG

I didn't see anything interresting in any LOG.. the System just locks, that's pretty muchh what happens.. I didn't want to risk my data, so I reverted to 11.0-U4 after the third time it locked.

Scrub was automatically running when I had the problems.. otherwise the system was pretty much idling or doing very basic tasks.

Reverting to 11.0-U4 fixed everything.. scrub finished and system is up and running for days already.

This is a similar hardware configuration and failure condition to me - an automatic scrub was in progress and it hung ~22 hours after boot.

Hardware in my case is;

AS Rock Rack E3C226D2I
Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz
16GB EEC RAM
1x Dell PERC H310 crossflashed as a generic SAS2008 in IT Mode
Primary Pool with 3x mirroed pairs of disks (2x pairs of WD Red 3TBs, 1x pair of Seagate Ironwolf 8TBs)
Secondary Pool 1x single SSD (for jail storage)
2x 32BG USB drives for the OS

jsylvia007 · Dec 24, 2017

Unfortunately I was remote at the time, so I really couldn't do too much troubleshooting... I'm not sure if a scrub was occurring at the time it kept freezing.

My hardware stats are below, but please keep in mind I am virtualized on ESXi 6.5.

Supermicro X9DRI-LN4F+
2x Xeon E5-2670 @ 2.6ghz 16 Cores (8-cores dedicated to FreeNAS)
192GB ECC Memory (64GB Dedicated to FreeNAS)
LSI Logic - LSI2008 - Firmware: 20.00.07.00 // NVData: 14.01.00.09 // x86-BIOS: 07.39.02.00 (Via PCI Passthrough)
Array 1: 6 x Seagate IronWolf 8TB in RAIDZ2 w/ SPCC SSD Underprovisioned to 18GB SLOG
Array 2: 4 x Seagate IronWolf 8TB in Striped Mirror w/ ADATA SP900 Underprovisioned to 17GB SLOG

deastick · Dec 26, 2017

This is pretty much exactly the same as what I got today. I never noticed the errors and the system eventually ran out of swap space. I had to do a hard reset on the box.

vvuk said:
Sorry, should have been clearer. IPMI itself was fine; I have getty running on the redirected serial console, and I was not able to get to a login prompt when I saw the crash.

Hardware:
* Supermicro X8-series motherboard
* Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
* 16GB memory
* Samsung SSD 840 EVO 120GB as boot drive on mothreboard SATA
* SAS2008-based SATA controller

Last entries from /var/log/messages -- this was basically repeating since 14:50 the previous day, nothing interesting in log otherwise. If I filter out the service:nas-health warnings (capacity for a volume is at 86%, should be under 80%) and the Samba4Alert warning, then:

15:18 the previous day, I start getting 'Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out' and '[WARN] consul: error getting server health from "nas": context deadline exceeded'.
18:31 syslong-ng core dumps due to a SIGABRT
then the Samba4Alert, UpdateCheckAlert, and 'error getting server health' continue until the last message.
syslog rotated successfully at 00:00

tail end of syslog:

Code:
Dec 21 01:01:45 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out Dec 21 01:02:46 nas daemon[3264]: 2017/12/21 01:02:46 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Dec 21 01:03:31 nas daemon[3264]: 2017/12/21 01:03:30 [WARN] consul: error getting server health from "nas": context deadline exceeded Dec 21 01:05:16 nas daemon[3264]: 2017/12/21 01:05:16 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Dec 21 01:06:18 nas /alert.py: [system.alert:400] Alert module '<samba4.Samba4Alert object at 0x8161c5d30>' failed: timed out Dec 21 01:07:17 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out Dec 21 01:07:47 nas daemon[3264]: 2017/12/21 01:07:47 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Dec 21 01:10:17 nas daemon[3264]: 2017/12/21 01:10:17 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Dec 21 01:11:49 nas /alert.py: [system.alert:400] Alert module '<samba4.Samba4Alert object at 0x8161c5d30>' failed: timed out Dec 21 01:12:26 nas daemon[3264]: 2017/12/21 01:12:26 [WARN] consul: error getting server health from "nas": context deadline exceeded Dec 21 01:12:47 nas daemon[3264]: 2017/12/21 01:12:47 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Dec 21 01:13:01 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out Dec 21 01:15:18 nas daemon[3264]: 2017/12/21 01:15:18 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Dec 21 01:17:11 nas /alert.py: [system.alert:400] Alert module '<samba4.Samba4Alert object at 0x8161c5d30>' failed: timed out Dec 21 01:18:14 nas /alert.py: [system.alert:400] Alert module '<update_check.UpdateCheckAlert object at 0x8161d3d30>' failed: timed out Dec 21 01:19:12 nas daemon[3264]: 2017/12/21 01:19:12 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'

JoeB · Dec 28, 2017

I'm also having the same 24hr lockup issues. Looking at ssh "top", "python 3.6" running as root is using all the CPU. Not sure if that's relevant or not? I cant figure out how to know if it's running in a stalled jail, or the FN kernel.

Edit:

My green bar at the bottom of the webGUI has this over and over and over, hundreds of times:

Code:

Dec 28 20:39:24 JOE-FREENAS daemon[2989]:	 2017/12/28 20:39:23 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:41:54 JOE-FREENAS daemon[2989]:	 2017/12/28 20:41:54 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:44:24 JOE-FREENAS daemon[2989]:	 2017/12/28 20:44:24 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:46:54 JOE-FREENAS daemon[2989]:	 2017/12/28 20:46:54 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:49:24 JOE-FREENAS daemon[2989]:	 2017/12/28 20:49:24 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:51:54 JOE-FREENAS daemon[2989]:	 2017/12/28 20:51:54 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:54:24 JOE-FREENAS daemon[2989]:	 2017/12/28 20:54:24 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:56:54 JOE-FREENAS daemon[2989]:	 2017/12/28 20:56:54 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 20:59:24 JOE-FREENAS daemon[2989]:	 2017/12/28 20:59:24 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 21:01:54 JOE-FREENAS daemon[2989]:	 2017/12/28 21:01:54 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'
Dec 28 21:04:24 JOE-FREENAS daemon[2989]:	 2017/12/28 21:04:24 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh'

EDIT 2:

I'm also at 87% storage. As this causes a warning, and i've unticked the warning for this, it may be relevant.
I've jexec'ed into all my jails, none seem to be using any (unusual) cpu.

EDIT 3:

I managed to kill the process like this:

kill -s ABRT 214

Then i refreshed the WebGUI and the side bar loaded up. I clicked reboot and it's doing that now.
Other times I've had to do a "hold the power button" shutdown, whatever the techie name for that is.

EDIT 4:

The reboot has completed and all is well again, for now. A note that the reboot was very quick compared to usual (1-2 mins compared to 5-10 mins), has that changed in 11.1 ?
Note the time now 9:24pm 2017-12-28 UK time.
I'll report back when it happens again.

JoeB · Dec 29, 2017

Right then, it's 11:18pm the next day, so about 26 hrs later and i've lost control of my server.

I can just about ssh into the box, the ssh replies take 30 secs or so and the key presses are not echo'ed at once, makes it hard to do much, here is the result of 'top':

Code:

28661 root	   1499  52	0  1072M   620M RUN	 0 246:59 194.15% python3.6

As you can see, python3.6 again has taken over the box.

JoeB · Dec 29, 2017

Not sure if this will help as i'm out of my depth, attached is the output from procstat -kk 28661.

EDIT: 12:15am

It now appears to have sorted itself out and is running fine again! Urmmm..

Hisma · Dec 29, 2017

there is something fundamentally flawed with the VM system as it is implemented in 11.1. I went several months in 11.0 with my ubuntu server VM running without any issues. Since upgrading to 11.1 it has randomly "stopped" several times, for no reason whatsoever. On top of that, my overall system performance seems to get dragged into the mud when the VM crashes. The system wouldn't necessarily "lock up", but the performance was so incredibly bad that the web UI was effectively useless and overall system performance was unacceptable. I have since resorted to disabling the VM, and since rebooting, performance has been stable once again.

I don't have any energy to troubleshoot this issue, as I've resorted to moving my ubuntu VM to another machine instead. Just something to bring up as I have seen a few other posts about VM instability in 11.1. These issues were not present in 11.0.

Moksh Mridul · Dec 31, 2017

I'm having the exact same issues. VM's worked fine in 11.0 but they seem to keep crashing in 11.1.

In order to try and resolve the issue, I even did a fresh install of the VM, but to no avail.

I also can't see any bug for this in the bug tracker. Or am I missing something?

Bobbiek04 · Jan 3, 2018

Just had that same issue. I mainly use my box for file serving and Plex. I reset using IPMI and now things seem normal. Here is my setup:

Intel E3-1220 v3 @ 3.10GHz
Supermicro X10SL7-F
32GB Samsung DDR3-1600 8GB1Gx72 ECC
WD Red Drives

JoeB · Jan 5, 2018

Bobbiek04 said:
Just had that same issue. I mainly use my box for file serving and Plex. I reset using IPMI and now things seem normal. Here is my setup:

Intel E3-1220 v3 @ 3.10GHz
Supermicro X10SL7-F
32GB Samsung DDR3-1600 8GB1Gx72 ECC
WD Red Drives

What is your percentage storage usage? There are a few that are seeing this when the storage is > 80%, but maybe unrelated...

jag131990 · Jan 5, 2018

I too am having this problem! Trying to roll-back now. Same symptoms, everything just goes cold, while troubleshooting the intial 24 hour period became like an hour for it to manifest again.

jag131990 · Jan 5, 2018

KingJ said:
This is a similar hardware configuration and failure condition to me - an automatic scrub was in progress and it hung ~22 hours after boot.

Hardware in my case is;

AS Rock Rack E3C226D2I
Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz
16GB EEC RAM
1x Dell PERC H310 crossflashed as a generic SAS2008 in IT Mode
Primary Pool with 3x mirroed pairs of disks (2x pairs of WD Red 3TBs, 1x pair of Seagate Ironwolf 8TBs)
Secondary Pool 1x single SSD (for jail storage)
2x 32BG USB drives for the OS

Very similar/same base hardware also (32GB ECC ram).

Bobbiek04 · Jan 5, 2018

My storage pool is only about 40% full, so I doubt that is it.

Once my machine restarted, I issued the "kill -s ABRT 214" that JoeB used. It has now been up and running for 36 hours without issue.

My suspicion is something with the changing jail backend in 11.1 is the culprit. Is anyone else using a GUI setup jail from previous versions?

deastick · Jan 5, 2018

Bobbiek04 said:
My storage pool is only about 40% full, so I doubt that is it.

Once my machine restarted, I issued the "kill -s ABRT 214" that JoeB used. It has now been up and running for 36 hours without issue.

My suspicion is something with the changing jail backend in 11.1 is the culprit. Is anyone else using a GUI setup jail from previous versions?

All my jails are from 9.3 or earlier. Only created with plugins.

Sent from my SM-G920W8 using Tapatalk

jsylvia007 · Jan 5, 2018

As another point of reference, my two pools are both about 40% full and I have no jails at all and therefore no plugins.

Spanner · Jan 9, 2018

Hi everyone.

I have been a FreeNAS user for many years and generally have never worried about the server at all except to install updates and every now and again change out a faulty hard drive.

After upgrading to 11.1, I have the same problem as experienced by theaddies who started this thread. I am also by no means experienced in Freenas even to know where to start looking for the fault. I have been trying to look at the various posts to see if there is some commonality between the issues.

Symptoms.
1. Multiple reporting of the following :-

Code:

Jan  9 15:46:25 freenas daemon[2895]:	 2018/01/09 15:46:25 [WARN] agent: Check 'service:nas-health' is now warning
Jan  9 15:48:35 freenas daemon[2895]:	 2018/01/09 15:48:35 [WARN] agent: Check 'service:nas-health' is now warning
Jan  9 15:50:46 freenas daemon[2895]:	 2018/01/09 15:50:46 [WARN] agent: Check 'service:nas-health' is now warning

2. Lose access to all my shared drives.

3. Most times have still access to GUI, but more often than not, the GUI hangs.

4. Drive Capacity is at 82%. I have turned off the warning about this being over 80%.

5. In the previous posts, it was also reporting that Scrubbing had started. In looking at my email reporting, I see that scrubbing had also taken place this week, but the process was completed and reported as OK.

6. In previous posts, reporting around 24 hrs working until the server stops working. In my case, I have been getting 48hrs before it stops working.

I don't know what else to mention and I am in the hands of the FreeNAS experts to provide guidance.

Regards
Andycatho

Important Announcement for the TrueNAS Community.

FreeNAS 11.1 locking up roughly every 24 hours

Documentation Engineer

Explorer

Cadet

Cadet

Cadet

Explorer

Dabbler

Contributor

Contributor

Contributor

Attachments

Explorer

Dabbler

Dabbler

Contributor

Explorer

Explorer

Dabbler

Dabbler

Explorer

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 11.1 locking up roughly every 24 hours"

Similar threads