Highly Unstable after 'upgrade' to TrueNAS 12 RC1

seb101 · Oct 9, 2020

Hi folks,

I 'upgraded' my FreeNAS 11.3 U5 to TrueNAS Core 12 RC1 two days ago. Since then the system has become wildly unstable. I have expereinced:

1. Whole system hangs (3 times).
2. VM hangs (all 4 VMs running on the system go unresponsive via network or serial). When trying to reboot them through the UI, the TrueNAS middleware hung, then the whole system becomes unresponsive and requires a hard-reboot. This has happened twice.

Can someone help me understand what's happening? The system was running perfectly with 11.3 for a long long time.

Where do I start looking? Happy to post log files etc..

Thanks!

seb101 · Oct 9, 2020

I managed to grab this screenshot from the console when it happened yesterday...

seb101 · Oct 9, 2020

From the Supermicro IPMI... the hangs appear to conincide with with the 'Timer Interrupt' messages... what could cause this?

EDIT: So this is the IPMI reporting that the host OS has hung (did not respond within the timeout period). I do not have my watchdog set to auto-reboot the system, so it just stays hung. Still no closer to finding out why it's hanging.

Patrick M. Hausen · Oct 9, 2020

Your system disk has got unreadable sectors. Since it seems to be /dev/da0 - is it a USB thumbdrive?

seb101 · Oct 9, 2020

Patrick M. Hausen said:
Your system disk has got unreadable sectors. Since it seems to be /dev/da0 - is it a USB thumbdrive?

Hi Patrick. That is not correct. The 'unreadable sectors' are on a data pool disk and have been there for a long time. My system disks are two mirrored SSDs on /dev/ada0 and /dev/ada1

seb101 · Oct 9, 2020

Code:

Dump header from device: /dev/da5p1
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 1611776
  Blocksize: 512
  Compression: none
  Dumptime: Thu Oct  8 11:45:25 2020
  Hostname: nas.lan
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-PRERELEASE 4912790fb32(HEAD) TRUENAS
  Panic String: spin lock held too long
  Dump Parity: 2374960721
  Bounds: 0
  Dump Status: good

Code:

spin lock 0xffffffff81f2f280 (callout) held by 0xfffff800047c9000 (tid 100003) too long
spin lock 0xfffffe00041d1080 (sched lock 6) held by 0xfffff807bc135000 (tid 103424) too long
spin lock 0xfffffe00041d1080 (sched lock 6) held by 0xfffff807bc135000 (tid 103424) too long
spin lock 0xfffffe00041d1080 (sched lock 6) held by 0xfffff807bc135000 (tid 103424) too long
spin lock 0xfffffe00041d1080 (sched lock 6) held by 0xfffff807bc135000 (tid 103424) too long
spin lock 0xfffffe00041d1080 (sched lock 6) held by 0xfffff807bc135000 (tid 103424) too long
spin lock 0xfffffe00041d1080 (sched lock 6) held by 0xfffff807bc135000 (tid 103424) too long
spin lock 0xffffffff81f2f280 (callout) held by 0xfffff800047c9000 (tid 100003) too long
panic: spin lock held too long
cpuid = 6
time = 1602153925
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d97c7f60
vpanic() at vpanic+0x17b/frame 0xfffffe00d97c7fb0
panic() at panic+0x43/frame 0xfffffe00d97c8010
_mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x69/frame 0xfffffe00d97c8020
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe00d97c8090
callout_lock() at callout_lock+0xb8/frame 0xfffffe00d97c80c0
callout_reset_sbt_on() at callout_reset_sbt_on+0x76/frame 0xfffffe00d97c8120
callout_schedule() at callout_schedule+0x2d/frame 0xfffffe00d97c8140
termcn_cnputc() at termcn_cnputc+0x83/frame 0xfffffe00d97c8170
cnputc() at cnputc+0x3c/frame 0xfffffe00d97c81a0
cnputs() at cnputs+0x68/frame 0xfffffe00d97c81c0
putchar() at putchar+0x140/frame 0xfffffe00d97c8240
kvprintf() at kvprintf+0x109/frame 0xfffffe00d97c8370
vprintf() at vprintf+0x80/frame 0xfffffe00d97c8440
printf() at printf+0x43/frame 0xfffffe00d97c84a0
_mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x5b/frame 0xfffffe00d97c84b0
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe00d97c8520
callout_lock() at callout_lock+0xb8/frame 0xfffffe00d97c8550
_callout_stop_safe() at _callout_stop_safe+0xa1/frame 0xfffffe00d97c85d0
sleepq_timedwait() at sleepq_timedwait+0x91/frame 0xfffffe00d97c8610
msleep_spin_sbt() at msleep_spin_sbt+0xe3/frame 0xfffffe00d97c8660
vm_run() at vm_run+0xa9c/frame 0xfffffe00d97c8750
vmmdev_ioctl() at vmmdev_ioctl+0x96f/frame 0xfffffe00d97c87e0
devfs_ioctl() at devfs_ioctl+0xb0/frame 0xfffffe00d97c8830
VOP_IOCTL_APV() at VOP_IOCTL_APV+0x7b/frame 0xfffffe00d97c8860
vn_ioctl() at vn_ioctl+0x16a/frame 0xfffffe00d97c8970
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00d97c8990
kern_ioctl() at kern_ioctl+0x2b7/frame 0xfffffe00d97c89f0
sys_ioctl() at sys_ioctl+0xfa/frame 0xfffffe00d97c8ac0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00d97c8bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d97c8bf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80080fb9a, rsp = 0x7fffde9f2e68, rbp = 0x7fffde9f2f20 ---
KDB: enter: panic

Patrick M. Hausen · Oct 9, 2020

Then we will probably need all the gory details about your hardware. You could try searching the FreeBSD bugtracker for problems introduced in RELENG_12 ...

seb101 · Oct 14, 2020

So I am questioning whether the issue regarding the VMs getting locked up is related to the spin-lock kernel panic or not. I've just has the scenario occur again, all VMs are locked up and inaccessible through Serial or VNC terminal. Trying to 'STOP' the VMs leads to a middleware lock-up/timeout, but I've managed keep the TrueNAS frontend up and running this time.

Now, if I go to the VM status page in the UI all the VMs show as being in 'ERROR' state and if I try to start/restart them I get the following error reported in the UI:

Code:

Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 137, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1202, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1106, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/vm.py", line 1511, in start
    self.ensure_libvirt_connection()
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/vm.py", line 1494, in ensure_libvirt_connection
    raise CallError('Failed to connect to libvirt')
middlewared.service_exception.CallError: [EFAULT] Failed to connect to libvirt

Is there any more debugging I can do on this particular scenario before I reboot the whole system to bring my VMs back online?

seb101 · Oct 16, 2020

Folks,

Really could do with some help on this... I haven't been able to keep the system up for more than a couple of days at a time and this also runs my CCTV security system (on a VM). Had no problems what-so-ever on 11.3 for years.

What system details do you need? Most of it is in my signature already...

Board: Supermicro X10SDV-4C-TLN2F
CPU: Intel Xeon D-1521 (4-core)
RAM: Kingston 64GB ECC
HBA: LSI 9211-8i (Firmware 20.00.07.00-IT)
Boot Pool: 2x Kingston A400 120GB SSD
System/Jails/VM Pool: 2x Samsung 250GB SSD
Pool 1 (Mirror): 2x 6TB HGST Ultrastar 7k6 (6TB usable)
Pool 2 (Mirror): 2x 4TB Ironwolf NAS (4TB usable)
Pool 3 (Mirror): 2x 2TB WD RED (2TB usable)

Brezlord · Oct 16, 2020

Make sure all settings to do with watchdog timer are off/disabled in the Supermicro bios. The watchdog function does not work with FreeBSD. Also make sure memory scrub is disabled.

seb101 · Oct 17, 2020

Brezlord said:
Make sure all settings to do with watchdog timer are off/disabled in the Supermicro bios. The watchdog function does not work with FreeBSD. Also make sure memory scrub is disabled.

Can confirm the watchdog is disabled. Not sure what you mean about disabling memory scrub? Where is that option, BIOS or TrueNAS?

seb101 · Oct 17, 2020

Bug https://jira.ixsystems.com/browse/NAS-107976 raised for this.

seb101 · Oct 17, 2020

OK. So it turns out the SuperMicro boards have TWO watchdogs (which makes sense now why it's called 'Watchdog 2' in the IPMI log). Watchdog 1 is a hardware device controlled by the BIOS, but Watchdog 2 is a feature of the IPMI BMC and made available to the OS through the IPMI interface - it is seemingly not possible to 'hard' disable the IPMI watchdog (no option exists in the config), all source material online suggests that to 'disable it' you simply don't load the kernel module for the IPMI interface, or don't start the Watchdog service.

I can confirm now, that the 2nd Watchdog is indeed being recognised by TrueNAS. Running the command ipmitool mc watchdog get
Gives:

Code:

root@nas[/]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   120 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      137 sec
Present Countdown:      132 sec

If I repeatedly run this command, the 'present countdown' ticks down to about 127 seconds, and resets to 137. This indicates the OS is correctly communicating with the IPMI watchdog and responding providing the correct responses every 10 seconds (under normal circumstances) to prevent the Watchdog declaring an error state.

This leaves, as far as I can tell, 2 possibilities:

1. The IPMI watchdog and the TrueNAS system are working as expected, the 'spin lock' errors are unrelated kernel panics and the watchdog is correctly timing out when the system hangs due to the Panic. In which case, there is a genuine bug related to the 'spin-lock' issue that needs to be addressed for Release.
2. The IPMI integration between TrueNAS and the Supermicro IPMI is broken and is, infact, the thing causing the 'spin-lock' issues in the first place. As I have done no special config to enable this integration (TrueNAS/FreeBSD must have auto-enabled this) then this is also a genuine bug and needs to fixed for Release.

In order to test the theory on (2) - does anyone know how I can manually disable the Watchdog integration between FreeBSD and the IPMI?

Brezlord · Oct 17, 2020

There is a jumper on the motherboard to disable the watchdog.

seb101 · Oct 17, 2020

Brezlord said:
There is a jumper on the motherboard to disable the watchdog.

Thanks for your continued help. The jumper on the motherboard controls the same Watchdog as the BIOS control (Watchdog 1). It does not affect the operation of the IPMI watchdog Watchdog 2. In any case, I can confirm the watchdog is disabled on the motherboard (Open pins).

Brezlord · Oct 17, 2020

I had the same issue with a supermicro board running pfSense and I had to disable the watchdog in the bios, the mother board and disable memory scrubs, not sure where in your bios it is so download the manual and search for key words.

seb101 · Oct 19, 2020

Brezlord said:
I had the same issue with a supermicro board running pfSense and I had to disable the watchdog in the bios, the mother board and disable memory scrubs, not sure where in your bios it is so download the manual and search for key words.

Are you referring to this feature in the BIOS?

Patrol Scrub Patrol Scrubbing is a process that allows the CPU to correct correctable memory errors detected on a memory module and send the correction to the requestor (the original source). When this item is set to Enabled, the IO hub will read and write back one cache line every 16K cycles, if there is no delay caused by internal processing. By using this method, roughly 64 GB of memory behind the IO hub will be scrubbed every day. The options are Enable and Disable.

Patrol Scrub Interval This feature allows you to decide how many hours the system should wait before the next complete patrol scrub is performed. Use the keyboard to enter a value from 0-24. The Default setting is 24.

This would seem to align with the approx 24hr occurance of my issue. I will turn it off and see if I get more stability.

It still does not explain why this was not an issue under FreeNAS 11.3, unless there is an issue between the FreeBSD 12 kernel and the scrubbing mechanism that was not present in FreeBSD 11. If this was the case, I'd expect a lot more noise on the issue, as X10 Supermicro boards are very common and this is switched on by default.

In any case... I will try...

Brezlord · Oct 19, 2020

Yes that's it. All of my Dell and IBM servers have that feature disabled by default but my Supermicro which I use for pfSense had this feature enabled by default and was causing kernel panics and reboots. Dissable memory scrub, bios watchdog and hardware watchdog.

Brezlord · Oct 19, 2020

I had issues with FreeBSD 11.3-STABLE pfSense 2.4.5-RELEASE-p1

seb101 · Oct 20, 2020

No joy on this. Same issue occured despite disabling this.

Important Announcement for the TrueNAS Community.

Highly Unstable after 'upgrade' to TrueNAS 12 RC1

Contributor

Contributor

Contributor

Hall of Famer

Contributor

Contributor

Hall of Famer

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Similar threads