SOS: kernel panic Help!!

Itay1778 · Oct 11, 2022

Hi everyone, I've been having kernel panic for a few weeks. It happens every 3 days or so and I can't find a solution for it. And I'm afraid it will cause data loss at this point.
Things I know don't cause this problem.
- It's not a RAM problem, I did a memtest86 6 pass tests and they all passed without a single error.
- No problem with the drives, their SMART is working.

I checked in /data/crash/textdump (the latest one) and in the file msgbuf.txt at the end of the file:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 05
fault virtual address    = 0x0
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff8265fcde
stack pointer            = 0x28:0xfffffe011f269750
frame pointer            = 0x28:0xfffffe011f269800
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 54949 (smbd)
trap number        = 12
panic: page fault
cpuid = 3
time = 1665443007
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe011f269510
vpanic() at vpanic+0x17f/frame 0xfffffe011f269560
panic() at panic+0x43/frame 0xfffffe011f2695c0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe011f269620
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe011f269680
calltrap() at calltrap+0x8/frame 0xfffffe011f269680
--- trap 0xc, rip = 0xffffffff8265fcde, rsp = 0xfffffe011f269750, rbp = 0xfffffe011f269800 ---
dnode_hold_impl() at dnode_hold_impl+0x31e/frame 0xfffffe011f269800
dmu_object_alloc_impl() at dmu_object_alloc_impl+0x23c/frame 0xfffffe011f2698b0
dmu_object_alloc_dnsize() at dmu_object_alloc_dnsize+0x1c/frame 0xfffffe011f2698e0
zfs_mknode() at zfs_mknode+0x1d3/frame 0xfffffe011f269a20
zfs_create() at zfs_create+0x389/frame 0xfffffe011f269ac0
zfs_freebsd_create() at zfs_freebsd_create+0xee/frame 0xfffffe011f269b10
VOP_CREATE_APV() at VOP_CREATE_APV+0x24/frame 0xfffffe011f269b30
uipc_bindat() at uipc_bindat+0x336/frame 0xfffffe011f269d60
sobind() at sobind+0x33/frame 0xfffffe011f269d80
kern_bindat() at kern_bindat+0xc4/frame 0xfffffe011f269dc0
sys_bind() at sys_bind+0x75/frame 0xfffffe011f269e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe011f269f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe011f269f30
--- syscall (104, FreeBSD ELF64, sys_bind), rip = 0x804bfe19a, rsp = 0x7fffffffe3d8, rbp = 0x7fffffffe490 ---
KDB: enter: panic

And according to this, it seems that smb causes a kernel panic, and every time it crashes it is always the smbd process that appears there, and from there I started to focus on the smb.
I checked all its settings (through the GUI) and everything looks fine there
I even recreated all the shares in smb (because most of them are still from FreeNAS version 10 so I thought maybe that's the problem)

Nothing fixed the problem.
I don't know what to do anymore. Everything over five years was stable and working great, I don't know what suddenly happened.
Impossible to use a file server like that.

I also mention that I have jails on the TrueNAS but they don't seem to be causing this problem.
Plex
NextCloud
qbit
nginx proxy

Specs:
CPU i3 540
RAM: 16GB
SSD Boot 120GB
3X WD Red (CMR)

I hope you can help me because I alone don't know where to look to find a solution to this.

jgreco · Oct 11, 2022

Itay1778 said:
It's not a RAM problem

Meaningless and potentially incorrect speculation.

Itay1778 said:
I did a memtest86 6 pass tests and they all passed without a single error.

Helpful, but still not really helpful.

Here's the problem. There's been debate over ECC on FreeNAS for at least a decade. One of the architectural issues with ZFS is that, due to the massive potential size of a pool, there is no tool like "fsck" or "chkdsk" that can be used to look for and correct metadata errors in the pool. ZFS is 100% absolutely reliant on the pool being in a healthy and consistent state. Once an error is introduced into the pool, there is no reliable mechanism to correct that error... it's been written to disk for everyone to stumble over in the future.

Some of us understand this to mean that if bits go bad in memory, which can happen because ZFS likes to cache lots of data in ARC, and then gets flushed out to pool, the pool is then in an inconsistent state. Other people try to handwave that obvious conclusion off as unlikely, etc. Some of us rely on ECC to help ensure that memory is much less likely to be successfully corrupted by the stray cosmic ray. For you, you really do not know that it was not a RAM problem a few months back when this started, a cosmic ray choosing to bitflip something in your memory.

So there's good, bad, and ugly here.

The ugly is that I *suspect* your "supervisor read data, page not present" is an attempt to read something that is not there; this error usually presents when there are memory problems in a system. But I believe we've also seen it when ZFS is trying to access a page it expects to be there which isn't there. This would tend to suggest a corrupt pool. No amount of memory testing TODAY will detect a corrupt pool that was corrupted LAST MONTH by a cosmic ray.

The bad is that this probably means you need to evacuate the pool (take all the data off the pool before you manage to lose more) which may be difficult or inconvenient. I am sorry.

The good is that this offers an opportunity for you to replace this system; I suggest trying to find an older Xeon mainboard of the same generation, such as a Supermicro X9SCM, with ECC memory. These are often findable on the used market for cheap. But also be aware that the expected lifespan of a system is in the range of seven or maybe up to ten years, and your Clarkdale chip is well beyond EOL or even end of support from Intel. Even with an X9SCM and Intel Xeon with ECC, failures of these systems will increase with age; the main difference is that they're much more easily detectable and much less likely to result in pool corruption.

You also have the option to go with "Report a Bug" above to see if you can get a developer's opinion. I am not an expert kernel debugger; most of my work tends towards replacing potentially faulty hardware if such is suspected. I am basing my opinion on three things I see here, which are the page not present error, which I read as an effort to reference a page by address which is not currently mapped, the call to zfs_create which should be trying to create a new directory entry, the call to zfs_mknode which looks like an attempt to create a new object in the DMU, and I don't care to go reading the code this morning to try to decipher what's going on. I leave that to the developers. But this feels in-line with the idea that there's corrupt metadata of some sort in the pool. Developers might be interested in debugging this at a more serious level than I am.

freqlabs · Oct 11, 2022

This is a NULL pointer access in the kernel, which is a bug. Please do "Report a Bug" so we can have a closer look at the issue.

Itay1778 · Oct 11, 2022

freqlabs said:
This is a NULL pointer access in the kernel, which is a bug. Please do "Report a Bug" so we can have a closer look at the issue.

I created, as I wrote there, I have no idea exactly what specific information you need to investigate this in-depth and solve it because it is obviously a serious problem for me and I want to help solve it as quickly as possible.

Jira

ixsystems.atlassian.net

Itay1778 · Oct 11, 2022

jgreco said:
Meaningless and potentially incorrect speculation.

Helpful, but still not really helpful.

Here's the problem. There's been debate over ECC on FreeNAS for at least a decade. One of the architectural issues with ZFS is that, due to the massive potential size of a pool, there is no tool like "fsck" or "chkdsk" that can be used to look for and correct metadata errors in the pool. ZFS is 100% absolutely reliant on the pool being in a healthy and consistent state. Once an error is introduced into the pool, there is no reliable mechanism to correct that error... it's been written to disk for everyone to stumble over in the future.

Some of us understand this to mean that if bits go bad in memory, which can happen because ZFS likes to cache lots of data in ARC, and then gets flushed out to pool, the pool is then in an inconsistent state. Other people try to handwave that obvious conclusion off as unlikely, etc. Some of us rely on ECC to help ensure that memory is much less likely to be successfully corrupted by the stray cosmic ray. For you, you really do not know that it was not a RAM problem a few months back when this started, a cosmic ray choosing to bitflip something in your memory.

So there's good, bad, and ugly here.

The ugly is that I *suspect* your "supervisor read data, page not present" is an attempt to read something that is not there; this error usually presents when there are memory problems in a system. But I believe we've also seen it when ZFS is trying to access a page it expects to be there which isn't there. This would tend to suggest a corrupt pool. No amount of memory testing TODAY will detect a corrupt pool that was corrupted LAST MONTH by a cosmic ray.

The bad is that this probably means you need to evacuate the pool (take all the data off the pool before you manage to lose more) which may be difficult or inconvenient. I am sorry.

The good is that this offers an opportunity for you to replace this system; I suggest trying to find an older Xeon mainboard of the same generation, such as a Supermicro X9SCM, with ECC memory. These are often findable on the used market for cheap. But also be aware that the expected lifespan of a system is in the range of seven or maybe up to ten years, and your Clarkdale chip is well beyond EOL or even end of support from Intel. Even with an X9SCM and Intel Xeon with ECC, failures of these systems will increase with age; the main difference is that they're much more easily detectable and much less likely to result in pool corruption.

You also have the option to go with "Report a Bug" above to see if you can get a developer's opinion. I am not an expert kernel debugger; most of my work tends towards replacing potentially faulty hardware if such is suspected. I am basing my opinion on three things I see here, which are the page not present error, which I read as an effort to reference a page by address which is not currently mapped, the call to zfs_create which should be trying to create a new directory entry, the call to zfs_mknode which looks like an attempt to create a new object in the DMU, and I don't care to go reading the code this morning to try to decipher what's going on. I leave that to the developers. But this feels in-line with the idea that there's corrupt metadata of some sort in the pool. Developers might be interested in debugging this at a more serious level than I am.

Thanks for the long and detailed answer. I'm now looking to upgrade the server to something newer (around 2012-2014) and even though the prices are low it's still expensive for me (I don't currently have much of a budget) in addition to that I currently have nowhere to move all the data, I have backups but everything is backed up in the cloud and downloaded It will all take months, so I really hope it's a bug and it can be solved this way. But I'm still looking for a new server and a way to get the data out to recreate the pool, although I don't know if that will help solve it because when it crashes there's nothing that accesses the files or anything like that...
But if it comes to that I'll buy a single 12TB drive and just temporarily move there to get it all out but I'm trying to avoid that right now.

jgreco · Oct 11, 2022

Itay1778 said:
so I really hope it's a bug

It's unlikely to be a bug in the sense that you're hoping for. Most of this class of "bug" turn out to be null pointer dereferences, which are generally an indicator that something is seriously wrong.

It is possible to check a pointer for null prior to dereferencing, which avoids the panic if it's handled somehow, but this is not a "fix" in that the condition that caused it to happen likely remains. Such fixes may in fact cause second order problems. In this case, if my analysis above was correct, and ZFS is trying to add a directory entry, a second order problem might be that the coder chooses to return an error up the stack, and that the attempted direntry addition doesn't get added to the directory. This might come as a shock to a userland program whose design carefully makes sure all the preconditions for a new file creation are correct for success, and regardless gets a failure. This can then cause further problems.

The upside in your case could be that if this is really a reproducible fault while writing, then there's actually a good chance that the pool is (mostly?) salvageable in a read-only mode. It could be that the current directory is damaged, and retrieval of some contents might be problematic. I'm not familiar with all the possibilities there. ZFS is rather complicated and failure modes are unpredictable.

It's also possible that someone familiar with the internal structures can zdb around and spot the particular problem you're running into. This might suggest some potential fix for your existing pool.

Jailer · Oct 11, 2022

jgreco said:
It's also possible that someone familiar with the internal structures can zdb around and spot the particular problem you're running into.

Where's @allanjude when you need him.

jgreco · Oct 11, 2022

Jailer said:
Where's @allanjude when you need him.

Probably off making misleading statements about how ZFS is wonderful without ECC memory. Which still doesn't pass the sniff test, and puts me in the awful position of having to agree with Linus Torvalds. (eyeroll)

Jailer · Oct 11, 2022

How could any ZFS "guru" advocate for non ECC memory with ZFS when ZFS is all about data integrity?

ETA: In my completely unqualified opinion I think anyone who uses ZFS without ECC memory is playing with fire.

jgreco · Oct 11, 2022

Jailer said:
How could any ZFS "guru" advocate for non ECC memory with ZFS when ZFS is all about data integrity?

ETA: In my completely unqualified opinion I think anyone who uses ZFS without ECC memory is playing with fire.

As I've previously explained elsewhere, I think it is more nuanced.

The claim I've often seen credited to Allan Jude is something to the effect that ZFS is no more susceptible to data corruption due to non-ECC than other filesystems. The original quote is a weasel-worded comment that contains a loophole big enough to drive a petabyte through. And it's this:

When your Windows NTFS filesystem gets a corruption, you run chkdsk on it, or maybe at worst you get another HDD out and copy everything you can off it onto the new disk. In a few weird cases you might not be able to recover.

When your ZFS filesystem gets a corruption committed, it is very likely going to be there for a long time. There is no fsck or chkdsk, and if the pool is damaged, you might have to copy everything off it and then back on. This would be extremely inconvenient if you had a petabyte (or more) of data on the ZFS pool.

Therefore I see it as very important to try to maintain the integrity of the ZFS pool, since there "ain't no fixin' it if it breaks". To me, this makes ECC a factor that improves the system's memory reliability by, oh, I dunno, maybe an order of magnitude. Is it perfect? Hell no. But it's much better.

HoneyBadger · Oct 11, 2022

jgreco said:
The claim I've often seen credited to Allan Jude is something to the effect that ZFS is no more susceptible to data corruption due to non-ECC than other filesystems.

If it's the one I'm thinking of, the quote should be credited to Matt Ahrens:

Ars walkthrough: Using the ZFS next-gen filesystem on Linux

If brtfs interested you, start your next-gen trip with a step-by-step guide to ZFS. Read the whole story

arstechnica.com

There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.

I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.

jgreco · Oct 11, 2022

HoneyBadger said:
If it's the one I'm thinking of, the quote should be credited to Matt Ahrens:

Ok, that sounds correct. I stand corrected.

Itay1778 · Oct 17, 2022

jgreco said:
The good is that this offers an opportunity for you to replace this system;

Hi, so I decided to listen to you because it sounds like you know more than me.
I would love for you to tell me what you think about what I chose, everything was done on a very small budget.

Motherboard: https://www.supermicro.com/products/archive/motherboard/x9dri-f
CPU: Intel Xeon E5-2650 v2 X2
RAM: 32GB ECC RDIMM DDR3
Case: https://www.newegg.com/global/il-en/rosewill-rsv-l4500u-black/p/N82E16811147328
Cooling for the CPUs: https://www.coolermaster.com/catalog/coolers/cpu-air-coolers/hyper-h412r/
PSU: I haven't decided yet, but probably something high quality and cheap (from be quiet! or Corsair)

mav@ · Feb 2, 2023

Just as followup, this seems to be a case of corrupted pool, improperly handled by ZFS code. https://github.com/openzfs/zfs/pull/14454 should hopefully fix the panic, but the pool may still need to be recreated, since there can be other artifacts.

Important Announcement for the TrueNAS Community.

SOS: kernel panic Help!!

Itay1778

Patron

jgreco

Resident Grinch

freqlabs

iXsystems

Itay1778

Patron

Jira

Itay1778

Patron

jgreco

Resident Grinch

Jailer

Not strong, but bad

jgreco

Resident Grinch

Jailer

Not strong, but bad

jgreco

Resident Grinch

HoneyBadger

actually does care

Ars walkthrough: Using the ZFS next-gen filesystem on Linux

jgreco

Resident Grinch

Itay1778

Patron

mav@

iXsystems

Similar threads

Important Announcement for the TrueNAS Community.

SOS: kernel panic Help!!

Patron

Resident Grinch

iXsystems

Patron

Patron

Resident Grinch

Not strong, but bad

Resident Grinch

Not strong, but bad

Resident Grinch

actually does care

Resident Grinch

Patron

iXsystems

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SOS: kernel panic Help!!"

Similar threads