Server crashes every hour

Ger

Dabbler
Joined
Mar 15, 2019
Messages
15
Hello everyone,
I have a problem with a TrueNas TrueNAS-12.0-U5.1 installation and specifically the server systematically shuts down every hour as soon as the scheduled tasks for dataseet snapshots start. This happens without any message and immediately reboots. The server has been running for at least 4 years. I have always made system updates and have never had any problems.

This is the configuration:
HP proliant Ml 10 Jan 9
Intel (R) Xeon (R) CPU E3-1225 v5 @ 3.30GHz
2x8 GB RAM ETC
1 SSD 240GB x booot
2 x 1 TB Seagate Hard Drive

This is the message at the end of the msgbuf.txt file in / data / crash:
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x48
fault code = supervisor read data, page not present
instruction pointer = 0x20: 0xffffffff82be5dd1
stack pointer = 0x28: 0xfffffe00e0ec21b0
frame pointer = 0x28: 0xfffffe00e0ec2260
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, grand 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 18 (txg_thread_enter)
trap number = 12
panic: page fault
cpuid = 2
time = 1631681230
KDB: backtrace stack:
db_trace_self_wrapper () at db_trace_self_wrapper + 0x2b / frame 0xfffffe00e0ec1e70
vpanic () at vpanic + 0x17b / frame 0xfffffe00e0ec1ec0
panic () at panic + 0x43 / frame 0xfffffe00e0ec1f20
trap_fatal () at trap_fatal + 0x391 / frame 0xfffffe00e0ec1f80
trap_pfault () at trap_pfault + 0x4f / frame 0xfffffe00e0ec1fd0
trap () at trap + 0x286 / frame 0xfffffe00e0ec20e0
calltrap () at calltrap + 0x8 / frame 0xfffffe00e0ec20e0
--- trap 0xc, rip = 0xffffffff82be5dd1, rsp = 0xfffffe00e0ec21b0, rbp = 0xfffffe00e0ec2260 ---
dsl_deadlist_remove_key () at dsl_deadlist_remove_key + 0x81 / frame 0xfffffe00e0ec2260
dsl_destroy_snapshot_sync_impl () at dsl_destroy_snapshot_sync_impl + 0x7b6 / frame 0xfffffe00e0ec2330
dsl_destroy_snapshot_sync () at dsl_destroy_snapshot_sync + 0x4e / frame 0xfffffe00e0ec2370
zcp_synctask_destroy () at zcp_synctask_destroy + 0xb0 / frame 0xfffffe00e0ec23b0
zcp_synctask_wrapper () at zcp_synctask_wrapper + 0xee / frame 0xfffffe00e0ec2400
luaD_precall () at luaD_precall + 0x268 / frame 0xfffffe00e0ec24d0
luaV_execute () at luaV_execute + 0xf5e / frame 0xfffffe00e0ec2550
luaD_call () at luaD_call + 0x1b1 / frame 0xfffffe00e0ec2590
luaD_rawrunprotected () at luaD_rawrunprotected + 0x53 / frame 0xfffffe00e0ec2630
luaD_pcall () at luaD_pcall + 0x37 / frame 0xfffffe00e0ec2680
lua_pcallk () at lua_pcallk + 0xa7 / frame 0xfffffe00e0ec26c0
zcp_eval_impl () at zcp_eval_impl + 0xbc / frame 0xfffffe00e0ec26f0
dsl_sync_task_sync () at dsl_sync_task_sync + 0xb4 / frame 0xfffffe00e0ec2720
dsl_pool_sync () at dsl_pool_sync + 0x44b / frame 0xfffffe00e0ec27a0
spa_sync () at spa_sync + 0xa50 / frame 0xfffffe00e0ec29e0
txg_sync_thread () at txg_sync_thread + 0x413 / frame 0xfffffe00e0ec2ab0
fork_exit () at fork_exit + 0x7e / frame 0xfffffe00e0ec2af0
fork_trampoline () at fork_trampoline + 0xe / frame 0xfffffe00e0ec2af0

Can anyone help me understand where the problem lies?
Thanks
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
Have you tested RAM/CPU since this issue started? That would be first on my list if a server started behaving like this.
 

Ger

Dabbler
Joined
Mar 15, 2019
Messages
15
Thank you for your answer.
Unfortunately the server is working at my client and except for this problem it is fully functional.
I should take it to the lab for the weekend and be able to test CPU / RAM.
The strange thing is that before this problem another one occurred, the boot disk had the damaged EFI partition which I repaired with Gpart.
After mounting the disk again and updating to the latest version of TrueNas I started having trouble with snapshots.
Thanks for your help.
 
Joined
Oct 5, 2021
Messages
5
Here's an interesting one. I just had a similar happen twice in one day on two completely separate servers. When the first one started doing it, I made the DR node live, and now the DR node is doing it, too. These servers replicate data between one another, but I'm assuming it's not as much about the data as it is about the fact that I recently created squeaky clean new pools with encryption enabled. Does anyone have ideas as to what kind of issue we're looking at?
 
Joined
Oct 5, 2021
Messages
5
(both servers are running 12.0-U5.1 FWIW)

my stack trace references the following functions: vpanic, spl_panic, avl_add, dsl_livelist_iterate, bpobj_iterate_blkptrs, bpobj_iterate_impl, dsl_process_sub_livelist, spa_livelist_condense_cb, zthr_procedures
 
Joined
Oct 5, 2021
Messages
5
OK, an update for what it's worth:

I tried upgrading to 12.0-U6 - no dice.
I reinstalled TrueNAS and imported the pool. Everything was fine until I attempted to unlock it. It kernel panic'ed and rebooted almost immediately.
 

Ger

Dabbler
Joined
Mar 15, 2019
Messages
15
Hello,
a few days ago I discovered that the problem occurs when, at every hour, ZFS takes a snapshot of a Dataset within the pool configured on the server.

I found this out by doing numerous tests and found that it only does it with that single Dataset.

With other snapshots configured it doesn't.
It started doing this after I had to reinstall Freenas due to a boot disk problem and subsequently upgraded to TrueNas.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I would backup the dataset, the delete it and start it again
 
Joined
Oct 5, 2021
Messages
5
Thanks, but unfortunately we hit the same bug on the backup system after a few hours, and then we had 2 busted systems with completely dissimilar hardware. I also discovered that vanilla FreeBSD 13 had the same problem. I was finally able to mount the data set in read-only mode using Ubuntu to recover the data, but in the meantime, I also discovered that the production TrueNAS, which would allow me to boot into it for a few minutes before crashing (unlike the backup system which crashed every time during boot), quit crashing after I disabled dedupe on the pool. I'm assuming there's some sort of upstream bug in FreeBSD's implementation that only manifests itself when dedupe is enabled.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Unless you have north of 128 GB of RAM, you shouldn't ever enable dedupe.
 
Top