New TrueNAS core build crashing and rebooting

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
I built a new TrueNAS Core machine recently, and I'm having trouble with it regularly crashing and rebooting. I first built it with a fresh install of TrueNAS-13.0-U3.1. Then I spent a week or so replicating all the datasets from my previous TrueNAS machine (currently running on TrueNAS-12.0-U6.1). After the replication finished I've exported the config and the jails from the old machine and imported both on the new machine and also upgraded the jails. This is about when the crashes began. At first I thought it was the jails causing problems being mismatched which is one of the reasons I upgraded them, but the issues continued. I tried disabling different combinations of jails and it seems to have affected the rate at which the crash happens, but finally it crashed earlier today with no jails running. At the time I was doing a bit of a stress test against it by using SMB to copy a few large files and running WinDirStat over the network (measuring disk space usage by subtree) and running a scrub. It crashed relatively quickly after running those things - within 10 minutes or so, so I'm thinking maybe network load has something to do with it. OTOH it didn't have this problem during the week of dataset replication. I have tried running MemTest86 through a few rounds and it didn't find any problems. I haven't yet tried reverting to the state from before importing the config, though it's the next thing I can think of to try.

I've included all the crash dumps I've been saving off. Except for #4, they are all page faults. Most (but not all) of them have ether_nh_input in the stack. And "epair_task" and "swi5: fast taskq" also seems to be showing up a lot - but I'm naive when it comes to analyzing BSD core dumps and might be looking in the wrong places.

Hardware
Motherboard: ASRock B550 PG Velocita
CPU: AMD Ryzen 5 5600G
RAM: 32GB
Hard Drives (storage pool): 6 x WD Red Pro 16TB in Raid Z1
Hard Drive (boot): Teamgroup MP33 512GB
 

Attachments

  • textdump.tar.0.gz
    60 KB · Views: 187
  • textdump.tar.1.gz
    69.3 KB · Views: 188
  • textdump.tar.2.gz
    60.2 KB · Views: 189
  • textdump.tar.3.gz
    60 KB · Views: 176
  • textdump.tar.4.gz
    59.7 KB · Views: 182
  • textdump.tar.5.gz
    60.6 KB · Views: 171
  • textdump.tar.7.gz
    64.5 KB · Views: 173
  • textdump.tar.8.gz
    64.6 KB · Views: 183
  • textdump.tar.9.gz
    59.2 KB · Views: 193
  • textdump.tar.10.gz
    58.3 KB · Views: 162
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
hardware info is missing from your post.
 

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
I've updated the initial post with hardware information. Sorry I left that out. I've done some digging, and what I'm seeing looks a lot like what is discussed at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267704

I'm not super conversant in BSD, so I'm not sure if pkg info shows what I have installed or what my package database knows about, but

pkg info realtek-re-kmod

shows

Code:
realtek-re-kmod-v196.04_3
Name           : realtek-re-kmod
Version        : v196.04_3
Installed on   : Fri Nov 11 09:11:59 2022 PST
Origin         : net/realtek-re-kmod
Architecture   : FreeBSD:13:amd64
Prefix         : /usr/local
Categories     : net kld
Licenses       : BSD4CLAUSE
Maintainer     : ale@FreeBSD.org
WWW            : https://www.realtek.com/en/component/zoo/category/network-interface-controllers-10-100-1000m-gigabit-ethernet-pci-express-software
Comment        : Kernel driver for Realtek PCIe Ethernet Controllers
Annotations    :
        FreeBSD_version: 1301000
        repo_type      : binary
        repository     : local
Flat size      : 5.35MiB
Description    :
Realtek PCIe FE / GBE / 2.5G / Gaming Ethernet Family Controller
kernel driver.


This is the official driver from Realtek and can be loaded instead of
the FreeBSD driver built into the GENERIC kernel if you experience
issues with it (eg. watchdog timeouts), or your card is not supported.


Supported devices:


* 2.5G Gigabit Ethernet
  - RTL8125 / RTL8125B(S)


* 10/100/1000M Gigabit Ethernet
  - RTL8111B / RTL8111C / RTL8111D / RTL8111E / RTL8111F / RTL8111G(S)
    RTL8111H(S) / RTL8118(A)(S) / RTL8119i / RTL8111L / RTL8111K
  - RTL8168B / RTL8168E / RTL8168H
  - RTL8111DP / RTL8111EP / RTL8111FP
  - RTL8411 / RTL8411B


* 10/100M Fast Ethernet


It could be I'm out of luck due to having a poor NIC. I'd still like to experiment to see if I can find a way to stabilize on this hardware before going out and replacing it. Is there any way to update a package or a port in TrueNAS?
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
It could be I'm out of luck due to having a poor NIC.
yup.
this is one of the reasons for the requirement to have the hardware, so we aren't wasting time trying to diagnose crap.
and, unfortunately, crap is what you have for a NIC, and usually boards with realtek aren't great for the rest of the components either. if they are going to check out on the NIC, they are hardly likely to spend on the rest of the parts either.

asrock rack are usually OK, but this is a gaming board first. the realtek and other "meh" parts are fine for gaming, but not TrueNAS.

you could likely use one of the other nas systems, like unraid or OMV, with less issues.
you might also have more success with SCALE, as the debian base tends to have more updated drives for hardware FreeBSD doesn't bother keeping working.

(also, the pool info should have the type: stripe/mirrors/raidz1/2/3)
 

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
For the existing NIC I followed the instructions at https://www.truenas.com/docs/core/corereleasenotes/ under "known issues", but obviously that's not going to save me if I hit a driver bug. That does leave me with a few questions:

  1. Am I reading the stack traces correctly? Does this look like an issue in the NIC driver?
  2. If I unlock pkg and update the realtek-re-kmod package has this any reasonable chance of working? And if I do this will it "stick" or is there some logic in TrueNAS that will restore the original version out from under me? I understand that it is not supported or intended to use package management, but I'm willing to give it a try for a single package, especially if it's a different version of something already present.
  3. Would it be a supported configuration to add a generic PCIE NIC with the Intel 1225-V or the Intel 82576 chipset and disable the motherboard's built-in?

Some example stacks from the textdumps:

Code:
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100045 td 0xfffffe00e09aa740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00e06da9d0
vpanic() at vpanic+0x1b0/frame 0xfffffe00e06daa20
panic() at panic+0x43/frame 0xfffffe00e06daa80
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00e06daae0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e06dab40
calltrap() at calltrap+0x8/frame 0xfffffe00e06dab40
--- trap 0xc, rip = 0xffffffff80c55c6c, rsp = 0xfffffe00e06dac10, rbp = 0xfffffe00e06dac60 ---
ether_nh_input() at ether_nh_input+0x1c/frame 0xfffffe00e06dac60
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00e06dacb0
ether_input() at ether_input+0x69/frame 0xfffffe00e06dad10
re_rxeof() at re_rxeof+0x2ad/frame 0xfffffe00e06dad80
re_int_task_8125() at re_int_task_8125+0xb4/frame 0xfffffe00e06dadc0
taskqueue_run_locked() at taskqueue_run_locked+0x181/frame 0xfffffe00e06dae40
taskqueue_run() at taskqueue_run+0x68/frame 0xfffffe00e06dae60
ithread_loop() at ithread_loop+0x25a/frame 0xfffffe00e06daef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00e06daf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e06daf30
--- trap 0x80b7b350, rip = 0xffffffff80aa32cf, rsp = 0, rbp = 0x3414000 ---
mi_startup() at mi_startup+0xdf/frame 0x3414000
db:0:kdb.enter.default>  show allpcpu
Current CPU: 4
...
pid 12 tid 100045 critnest 1 "swi5: fast taskq"



Code:
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100045 td 0xfffffe00e09aa740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00e06da650
vpanic() at vpanic+0x1b0/frame 0xfffffe00e06da6a0
panic() at panic+0x43/frame 0xfffffe00e06da700
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00e06da760
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e06da7c0
calltrap() at calltrap+0x8/frame 0xfffffe00e06da7c0
--- trap 0xc, rip = 0xffffffff80bbb558, rsp = 0xfffffe00e06da890, rbp = 0xfffffe00e06da8d0 ---
sbcut_internal() at sbcut_internal+0xa8/frame 0xfffffe00e06da8d0
tcp_do_segment() at tcp_do_segment+0x18c8/frame 0xfffffe00e06da9b0
tcp_input_with_port() at tcp_input_with_port+0xb61/frame 0xfffffe00e06daae0
tcp_input() at tcp_input+0xb/frame 0xfffffe00e06daaf0
ip_input() at ip_input+0x11f/frame 0xfffffe00e06dab80
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00e06dabd0
ether_demux() at ether_demux+0x138/frame 0xfffffe00e06dac00
ether_nh_input() at ether_nh_input+0x355/frame 0xfffffe00e06dac60
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00e06dacb0
ether_input() at ether_input+0x69/frame 0xfffffe00e06dad10
re_rxeof() at re_rxeof+0x2ad/frame 0xfffffe00e06dad80
re_int_task_8125() at re_int_task_8125+0xb4/frame 0xfffffe00e06dadc0
taskqueue_run_locked() at taskqueue_run_locked+0x181/frame 0xfffffe00e06dae40
taskqueue_run() at taskqueue_run+0x68/frame 0xfffffe00e06dae60
ithread_loop() at ithread_loop+0x25a/frame 0xfffffe00e06daef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00e06daf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e06daf30
--- trap 0x80b7b350, rip = 0xffffffff80aa32cf, rsp = 0, rbp = 0x3408000 ---
mi_startup() at mi_startup+0xdf/frame 0x3408000
db:0:kdb.enter.default>  show allpcpu
Current CPU: 4
...
curthread    = 0xfffffe00e09aa740: pid 12 tid 100045 critnest 1 "swi5: fast taskq"


Code:
db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100206 td 0xfffffe010617c560
kdb_enter() at kdb_enter+0x37/frame 0xfffffe01064edab0
vpanic() at vpanic+0x1b0/frame 0xfffffe01064edb00
panic() at panic+0x43/frame 0xfffffe01064edb60
trap_fatal() at trap_fatal+0x385/frame 0xfffffe01064edbc0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe01064edc20
calltrap() at calltrap+0x8/frame 0xfffffe01064edc20
--- trap 0xc, rip = 0xffffffff80c55c6c, rsp = 0xfffffe01064edcf0, rbp = 0xfffffe01064edd40 ---
ether_nh_input() at ether_nh_input+0x1c/frame 0xfffffe01064edd40
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe01064edd90
ether_input() at ether_input+0x69/frame 0xfffffe01064eddf0
epair_tx_start_deferred() at epair_tx_start_deferred+0x177/frame 0xfffffe01064ede40
taskqueue_run_locked() at taskqueue_run_locked+0x181/frame 0xfffffe01064edec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe01064edef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe01064edf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01064edf30
--- trap 0x3407000, rip = 0xffffffff80aa32cf, rsp = 0, rbp = 0x3407000 ---
mi_startup() at mi_startup+0xdf/frame 0x3407000
db:0:kdb.enter.default>  show allpcpu
Current CPU: 0
...
curthread    = 0xfffffe010617c560: pid 0 tid 100206 critnest 1 "epair_task"
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
And if I do this will it "stick"
nope. truenas is an appliance and has no provisions for installing drivers. you will have to reinstall with every update; maybe every reboot depending.
Would it be a supported configuration to add a generic PCIE NIC with the Intel 1225-V or the Intel 82576 chipset and disable the motherboard's built-in?
no, because the motherboard is not supported. it would get you to a less unsupported config.
6 x WD Red Pro 16TB in Raid Z1
this is a really not good idea. raidz1 on drives over 2TB is not recomended. you are over that by 8x
Am I reading the stack traces correctly? Does this look like an issue in the NIC driver?
I am honestly not sure, and frankly, not interested in investigating. I, and many in the forums, consider unsupported hardware a waste of our donated time when there are plenty of resources explaining what is supported, and importantly, why its supported.
there are other NAS projects designed and intended to work on consumer hardware. TrueNAS is not one of them, being a storage appliance built for Enterprise hardware that the company makes available for free. it can be amazing, but you have to feed it properly.
 

trevaaar

Cadet
Joined
Aug 11, 2013
Messages
5
Just so you're clear, there are two different Realtek drivers included with TrueNAS Core 13.0-U3.1. There's the default one provided by the base FreeBSD system, and also the one in the realtek-re-kmod package. Both have known issues, which is why NICs from other manufacturers are strongly recommended here.

If the crashes you experienced were before you had set the if_re_load and if_re_name tunables, they were caused by the default FreeBSD if_re driver. If they happened after you set them, they were caused by realtek-re-kmod 196.04.
 

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
Thanks for helping me clarify. I'm pretty confident that it's the realtek-re-kmod driver working. I added the tunables early on. Before I did that the network didn't work at all. I had to bootstrap by using kldload if_re from the physical console to be able to get to the UI to add the tunables. I'm a FreeBSD novice, but to double check does this also mean I'm using it?

kldstat | grep if_re
13 1 0xffffffff83156000 ff4b8 if_re.ko

Does anyone know if the contents of /boot/modules were to change, would that change persist between reboots, or is there some shadow image somewhere that gets moved or mounted on boot? I understand an OS update would almost certainly change it; I'm just asking about a reboot.
 

trevaaar

Cadet
Joined
Aug 11, 2013
Messages
5
I added the tunables early on. Before I did that the network didn't work at all. I had to bootstrap by using kldload if_re from the physical console to be able to get to the UI to add the tunables.
In that case you're definitely using realtek-re-kmod. It looks like the only NIC on your motherboard is an RTL8125BG, which is not recognised by the default driver.

Does anyone know if the contents of /boot/modules were to change, would that change persist between reboots, or is there some shadow image somewhere that gets moved or mounted on boot? I understand an OS update would almost certainly change it; I'm just asking about a reboot.
I can't say for sure, but /boot seems to be a real part of the boot pool, as opposed to some other locations like /etc and /var which are a tmpfs. I do know that /root isn't overwritten on reboot, if you need somewhere to store another version of the driver to test with.
 

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
Just to update. I enabled the FreeBSD package database and upgraded realtek-re-kmod which fetched 197.00. I rebooted and it seems to have stuck with the 197.00 at /boot/modules/if_re.ko. pkg info calls it 196.04_3, but I think that's because the package database is blown away on reboot. I hashed the if_re.ko file at each step and it's the new one that is currently present. So far it's been solid; no crashes in a time far longer than 196.04_3 ever survived under similar conditions. I expect it to be blown away whenever I update TrueNAS, and I know I'm well outside supported configuration. I just want to let other people including the maintainers know what has worked for me.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
blown away whenever I update TrueNAS
it also might not work at all with the next update, depending on what changes. but it's your choice.
I would highly recomend looking into replatforming. this could be a backup machine.
 

trevaaar

Cadet
Joined
Aug 11, 2013
Messages
5
Version 198.00 is available from pkg if you use the "latest" repo instead of "quarterly", but I have to agree you'll probably have a better time either adding a PCIe NIC or migrating to Scale.

According to the release notes, TrueNAS Core has included the igc driver for Intel i225 NICs since 12.0-U8, if that helps your decision.
 
Top