TrueNAS SCALE 22.12.3.2 Restarting by itself every night

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
So this has been driving me absolutely crazy.

My TrueNAS Scale 22.12.3.2 box has been restarting itself every night around 3 AM without any indication as to why it is happening.
This is not an "unscheduled system reboot" either, as I don't get a notification like that once it starts back up.



The syslog is not very helpful, last entries are an email sent regarding a cronjob at 3AM and then syslog starting up again at 3:07 AM:

Code:
Jul 16 03:00:03 freenas middlewared[7951]: sending mail to REDACTED
Content-Type: multipart/mixed
MIME-Version: 1.0
Subject: REDACTED: CronTask Run
From: FreeNAS.superhome <freenas.superhome@biochip.pw>
To: REDACTED
Date: Sun, 16 Jul 2023 00:00:01 -0000
Message-ID: <REDACTED>
Jul 16 03:07:09 freenas syslog-ng[22760]: syslog-ng starting up; version='3.28.1'
Jul 16 03:04:13 freenas kernel: microcode: microcode updated early to revision 0x49, date = 2021-08-11
Jul 16 03:04:13 freenas kernel: Linux version 5.15.107+truenas (root@tnsbuilds01.tn.ixsystems.net) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Tue Jul 4 16:27:21 UTC 2023
Jul 16 03:04:13 freenas kernel: Command line: BOOT_IMAGE=/ROOT/22.12.3.2@/boot/vmlinuz-5.15.107+truenas root=ZFS=boot-pool/ROOT/22.12.3.2 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Jul 16 03:04:13 freenas kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jul 16 03:04:13 freenas kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jul 16 03:04:13 freenas kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jul 16 03:04:13 freenas kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256

At 3AM the following should be running:
2 Cron Jobs:
- DDNS update (every 10 mins)
- A custom bash script which no longer exists (script not found, needs to be replaced, keeping it as a reminder to get emailed)

2 Snapshot Tasks:
- A recursive on an NVMe pool
- A simple on an HDD pool

2 Replication Tasks:
- NVMe to HDD
- NVMe to remote

Curiously enough the NVMe to HDD task says finished (no logs available), while the NVMe to remote says error (that one should take a while).
I'm running around ~15 KVM VMs, ~20 apps, iSCSI shares, SMB shares, NFS shares. Restarting every night is catastrophic for many reasons.

To mention, all of this was in place exactly as before in 22.12.2 and did not cause any issues. The issues started one day after the 22.12.3.2 installation.

How can I troubleshoot and find out why the server is restarting?
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
You could try booting into the old boot environment, and verifying it is stable there for a while?
Vdeo camera /web cam aimed at the console? to catch any system messages?
Regular script running ps to show which processes are running?
Are any of the VMs doing anything that mighr be correleated, or any of the apps? 15+20 possible memory/cpu issues?
Something to track memory usage? Are the reporting graphs any use? Or do you need to find something like vmstat or iostat or some other stat to generate a time series.
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
You could try booting into the old boot environment, and verifying it is stable there for a while?
I'm trying to avoid running the old environment at all costs due to the issues I had last time with apps disappearing and it taking 3 days to be able to restore them after doing that.

Vdeo camera /web cam aimed at the console? to catch any system messages?
I'm most likely gonna be doing a screenrecord of the console tonight to see what is happening.

Regular script running ps to show which processes are running?
I'll set something up regarding that.

Are any of the VMs doing anything that mighr be correleated, or any of the apps? 15+20 possible memory/cpu issues?

Nothing happening really that should be correlated, also I don't have any memory or CPU issues (2x E5-2680v3 (12C/24T) and 256GB DDR4-ECC RAM)

Something to track memory usage? Are the reporting graphs any use? Or do you need to find something like vmstat or iostat or some other stat to generate a time series.

Memory and CPU usage is in no way abnormal, the breaking points in the memory graph are the restarts at 3AM.
1689493228677.png


1689493043447.png
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
CPU/mem don't seem to be in contention. Check the ARC too.

If it an event on the system, maybe it is happening to fast get caught in the reporting? But I guess we need some idea where to focus attention.

If it is something that gets logged in the system log, then TNS seems to be configured to disable looking at the previous boot logs, ie journalctl --list-boots, but /var/log does have some data, as you have found.

Any kernel events logged in kern.log? If it is a kernel crash then probably not, and neither of are probably up to dealing with a kernel crash dump.

Maybe you could use the netconsole kernel module to a log events, maybe even OOPS, and it would have a chance of getting out before the system crashed, if it is crashing.

Power glitch, some motor on the circuit kickin in at 0300 ish?, might be more difficult? Are you using a UPS?
 
Joined
Jun 2, 2019
Messages
591
I had a similar issue on 1 of 4 TrueNAS machines. The system would randomily reboot. Turned out to be a bad memory module, which I was able to isolate with https://www.memtest86.com/
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
CPU/mem don't seem to be in contention. Check the ARC too.

If it an event on the system, maybe it is happening to fast get caught in the reporting? But I guess we need some idea where to focus attention.

If it is something that gets logged in the system log, then TNS seems to be configured to disable looking at the previous boot logs, ie journalctl --list-boots, but /var/log does have some data, as you have found.

Any kernel events logged in kern.log? If it is a kernel crash then probably not, and neither of are probably up to dealing with a kernel crash dump.

Maybe you could use the netconsole kernel module to a log events, maybe even OOPS, and it would have a chance of getting out before the system crashed, if it is crashing.

Power glitch, some motor on the circuit kickin in at 0300 ish?, might be more difficult? Are you using a UPS?

ARC is OK.

I'm using an On-Line UPS for the system.

I recorded the console this night to see what is happening, and it was basically just like a reset button pressed - no errors or any kind of shutdown, just goes straight into initialization.

However, a second TrueNAS machine which was on the temporarily same UPS (only added 1 day before the issue started happening) also restarted around the same time (came back online at 3:02:25). So it seems to be power related.

I can see a tiny fluctuation in the -input- voltage of the UPS at that time, however output voltage stays the same. I've ran a UPS self-test.

1689574866915.png


I'm going to remove the second TrueNAS box from the UPS for now and see if the issue persists.


I had a similar issue on 1 of 4 TrueNAS machines. The system would randomily reboot. Turned out to be a bad memory module, which I was able to isolate with https://www.memtest86.com/

I'm not sure if this applies due to ECC RAM, however, I'll check either way.
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
If the replicated datasets are encrypted and sent to another encrypted destination, it is probably due to the zfs panic on receive replication while it try to delete old snapshots.

Probably, following posts can help:

I hope it gets resolved soon
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
If the replicated datasets are encrypted and sent to another encrypted destination, it is probably due to the zfs panic on receive replication while it try to delete old snapshots.

Probably, following posts can help:

I hope it gets resolved soon
Ok, this is definitely happening to my backup system, as whenever I try to restart the replication, as I did just today two times, the system rebooted.
I recently brought the off-site backup system on-site to redo replications from scratch as there was a massive data volume to be trasnferred. I've noticed issues with the system (I had to manually destroy the dataset which too way longer than it should)

However, should this also cause my sending system to panic? I'm looking into it, this seems to be the issue.

My target box is running TrueNAS Core 13.0-U5.2 and its causing this.
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
https://github.com/openzfs/zfs/issues/13445 seems to include some work around too, if that is indeed the issue

are both machines that crashed receiving raw encrypted snapshots?

Might be worth changing the scheduling +30min so see if the crash/reboot moves +30 min?
I just double checked, the main machine also receives raw encrypted snapshots, as the NVMe to HDD replication when run manually also crashes the system.

So this is definitely the issue. Now only to figure out the workaround
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
Locally trying to destroy the dataset (the one created by the replication) on Core 13.0-U5.2 also causes a crash
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
Manually destroying a dataset also seems to cause a crash. When using recursive zfs destroy, on the first dataset it crashes (while snapshots get destroyed normally).

Anyone have experience and can guide on how directly apply this ZFS patch to either Core or SCALE? https://github.com/openzfs/zfs/pull/15039
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
I don't know anyone who has build a TN kernel. I haven't build a kernel in 10+ years, and not keen to go down the rabbit hole.

First thing I would try is booting successively older Ubuntu desktop images over the last couple years, trying to find a version of ZFS that doesn't crash when I remove a snapshot. The pools version features might not match the Ubuntu ZFS version supported features, either of which might not work or you might get crashes same as with TN. Reading the PR might give you clues as to what might work.

If I had to build a kernel I would try for Ubuntu or Debian, as TN has it's own limitations and code base that I don't know enough about, and I wouldn't be trying to integrate the build kernel into a TN install.
  • Kernel build should be relatively straight forward, but there will be lots of learning along the way.
  • Ubuntu has an advantage in that they normally build ZFS modules anyway, so some of thir testing kernels might be easy.
  • Debian has an advantage that they use DKMS and integrating the patches into DKMS ZFS might be an easier and smaller job.
I have no idea of how the Ubuntu desktop ZFS will interact with your pools, or the testing on the patch, so you might totally corrupt your pool anyway.
 

LarsR

Guru
Joined
Oct 23, 2020
Messages
719
iX didnt build the kernel. Scale uses a Debian LTS as the base os and then builds a custom middleware and gui.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Code:
# df /
Filesystem               1K-blocks    Used Available Use% Mounted on
boot-pool/ROOT/22.12.3.2   7872128 2941056   4931072  38% /
# grep PRETTY /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
# ls -l /usr/src
total 26
drwxr-xr-x 6 root root  9 Jul  5 03:27 linux-headers-5.15.107+truenas
drwxr-xr-x 8 root root 13 Jul  5 03:29 nvidia-current-515.65.01
drwxr-xr-x 6 root root 37 Jul  5 03:33 wireguard-1.0.20210219
# dkms status
nvidia-current, 515.65.01, 5.15.107+truenas, x86_64: installed
wireguard, 1.0.20210219: added
# find /lib/modules -name zfs.ko
/lib/modules/5.15.107+truenas/extra/zfs/zfs/zfs.ko

So from /usr/src linux-headers it seems iX modified or repackaged the kernel some what, so maybe they have some patches that aren't generally released as yet. Just looking in /lib/modules/5.15.107+truenas/extra you can see extra non zfs modules too.

Since /usr/src and dkms are present you could theoretically add patched vanilla zfs to /usr/src and build with dkms, but that sounds like a good recipe problems because if you use dkms install you will be wiping out the existing installed modules, which would also be required for startup since boot-pool is zfs, so you would also need to rebuild initramfs.

I would be doing this with an Debian install on another disk, mostly on another computer, then either figure out how to get the modules and use on the base desktop USB, or do a USB install and then use the new modules there. After booting that USB you could try cleaning out the bad snapshots. If you can clean them out, then don't send any more, then you don't need to hack the TN kernel.

Only when you have a working solution on non TN would I even think about using DKMS on TN to build new modules, and then you would have to have a plan for installing, initramfs, boot failure / crash, rolling back, managing a different boot environment maybe to isolate the changes while testing, and also have a plan for dealing with upgrades.

Hopefully iX will release a patch soon, but quickly releasing not fully tested patches will result in instability for all customers, not just those affected, so it would not be something I would be in a hurry to do.

Not sure how iX is handling this, maybe if more people open JIRA tickets they will know it is affecting more people?
 
Last edited:

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
I also can't understand how this isn't affecting more people - not being able to replicate encrypted datasets should be affecting a huge percentage of people. Maybe they just haven't updated to a broken version yet.
Hopefully there's a patch very soon..
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Seems like this has been as issue since at least openzfs 2.1.6.

But maybe most large enterprise customers are like me and avoid updating production too soon, wait for those on the bleeding edge to bleed out first.
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
I also can't understand how this isn't affecting more people - not being able to replicate encrypted datasets should be affecting a huge percentage of people. Maybe they just haven't updated to a broken version yet.
Hopefully there's a patch very soon..

I agree with you, maybe few people use zfs encryption today...
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
If you proceed with building a patched zfs, testing is probably better done inside some linux VMs with ZFS root, using small prepared pools that can be restored and retested.
 
Joined
Jul 17, 2023
Messages
8
Just out of curiosity, have you logged into shell and run sudo journalctl | grep shutdown Look for any halts, or other things that would give an explanation as to why. You can also see what times by using last -x. This should identify a pattern of what times, and then start looking for cronjobs, scripts, etc.
 
Top