TrueNAS SCALE 22.12.3.2 Restarting by itself every night

TempleHasFallen · Jul 16, 2023

So this has been driving me absolutely crazy.

My TrueNAS Scale 22.12.3.2 box has been restarting itself every night around 3 AM without any indication as to why it is happening.
This is not an "unscheduled system reboot" either, as I don't get a notification like that once it starts back up.

The syslog is not very helpful, last entries are an email sent regarding a cronjob at 3AM and then syslog starting up again at 3:07 AM:

Code:

Jul 16 03:00:03 freenas middlewared[7951]: sending mail to REDACTED
Content-Type: multipart/mixed
MIME-Version: 1.0
Subject: REDACTED: CronTask Run
From: FreeNAS.superhome <freenas.superhome@biochip.pw>
To: REDACTED
Date: Sun, 16 Jul 2023 00:00:01 -0000
Message-ID: <REDACTED>
Jul 16 03:07:09 freenas syslog-ng[22760]: syslog-ng starting up; version='3.28.1'
Jul 16 03:04:13 freenas kernel: microcode: microcode updated early to revision 0x49, date = 2021-08-11
Jul 16 03:04:13 freenas kernel: Linux version 5.15.107+truenas (root@tnsbuilds01.tn.ixsystems.net) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Tue Jul 4 16:27:21 UTC 2023
Jul 16 03:04:13 freenas kernel: Command line: BOOT_IMAGE=/ROOT/22.12.3.2@/boot/vmlinuz-5.15.107+truenas root=ZFS=boot-pool/ROOT/22.12.3.2 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Jul 16 03:04:13 freenas kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jul 16 03:04:13 freenas kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jul 16 03:04:13 freenas kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jul 16 03:04:13 freenas kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256

At 3AM the following should be running:
2 Cron Jobs:
- DDNS update (every 10 mins)
- A custom bash script which no longer exists (script not found, needs to be replaced, keeping it as a reminder to get emailed)

2 Snapshot Tasks:
- A recursive on an NVMe pool
- A simple on an HDD pool

2 Replication Tasks:
- NVMe to HDD
- NVMe to remote

Curiously enough the NVMe to HDD task says finished (no logs available), while the NVMe to remote says error (that one should take a while).
I'm running around ~15 KVM VMs, ~20 apps, iSCSI shares, SMB shares, NFS shares. Restarting every night is catastrophic for many reasons.

To mention, all of this was in place exactly as before in 22.12.2 and did not cause any issues. The issues started one day after the 22.12.3.2 installation.

How can I troubleshoot and find out why the server is restarting?

samarium · Jul 16, 2023

You could try booting into the old boot environment, and verifying it is stable there for a while?
Vdeo camera /web cam aimed at the console? to catch any system messages?
Regular script running ps to show which processes are running?
Are any of the VMs doing anything that mighr be correleated, or any of the apps? 15+20 possible memory/cpu issues?
Something to track memory usage? Are the reporting graphs any use? Or do you need to find something like vmstat or iostat or some other stat to generate a time series.

TempleHasFallen · Jul 16, 2023

samarium said:
You could try booting into the old boot environment, and verifying it is stable there for a while?

I'm trying to avoid running the old environment at all costs due to the issues I had last time with apps disappearing and it taking 3 days to be able to restore them after doing that.

samarium said:
Vdeo camera /web cam aimed at the console? to catch any system messages?

I'm most likely gonna be doing a screenrecord of the console tonight to see what is happening.

samarium said:
Regular script running ps to show which processes are running?

I'll set something up regarding that.

samarium said:
Are any of the VMs doing anything that mighr be correleated, or any of the apps? 15+20 possible memory/cpu issues?

Nothing happening really that should be correlated, also I don't have any memory or CPU issues (2x E5-2680v3 (12C/24T) and 256GB DDR4-ECC RAM)

samarium said:
Something to track memory usage? Are the reporting graphs any use? Or do you need to find something like vmstat or iostat or some other stat to generate a time series.

Memory and CPU usage is in no way abnormal, the breaking points in the memory graph are the restarts at 3AM.

samarium · Jul 16, 2023

CPU/mem don't seem to be in contention. Check the ARC too.

If it an event on the system, maybe it is happening to fast get caught in the reporting? But I guess we need some idea where to focus attention.

If it is something that gets logged in the system log, then TNS seems to be configured to disable looking at the previous boot logs, ie journalctl --list-boots, but /var/log does have some data, as you have found.

Any kernel events logged in kern.log? If it is a kernel crash then probably not, and neither of are probably up to dealing with a kernel crash dump.

Maybe you could use the netconsole kernel module to a log events, maybe even OOPS, and it would have a chance of getting out before the system crashed, if it is crashing.

Power glitch, some motor on the circuit kickin in at 0300 ish?, might be more difficult? Are you using a UPS?

elvisimprsntr · Jul 16, 2023

I had a similar issue on 1 of 4 TrueNAS machines. The system would randomily reboot. Turned out to be a bad memory module, which I was able to isolate with https://www.memtest86.com/

TempleHasFallen · Jul 16, 2023

samarium said:
CPU/mem don't seem to be in contention. Check the ARC too.

If it an event on the system, maybe it is happening to fast get caught in the reporting? But I guess we need some idea where to focus attention.

If it is something that gets logged in the system log, then TNS seems to be configured to disable looking at the previous boot logs, ie journalctl --list-boots, but /var/log does have some data, as you have found.

Any kernel events logged in kern.log? If it is a kernel crash then probably not, and neither of are probably up to dealing with a kernel crash dump.

Maybe you could use the netconsole kernel module to a log events, maybe even OOPS, and it would have a chance of getting out before the system crashed, if it is crashing.

Power glitch, some motor on the circuit kickin in at 0300 ish?, might be more difficult? Are you using a UPS?

ARC is OK.

I'm using an On-Line UPS for the system.

I recorded the console this night to see what is happening, and it was basically just like a reset button pressed - no errors or any kind of shutdown, just goes straight into initialization.

However, a second TrueNAS machine which was on the temporarily same UPS (only added 1 day before the issue started happening) also restarted around the same time (came back online at 3:02:25). So it seems to be power related.

I can see a tiny fluctuation in the -input- voltage of the UPS at that time, however output voltage stays the same. I've ran a UPS self-test.

I'm going to remove the second TrueNAS box from the UPS for now and see if the issue persists.

elvisimprsntr said:
I had a similar issue on 1 of 4 TrueNAS machines. The system would randomily reboot. Turned out to be a bad memory module, which I was able to isolate with https://www.memtest86.com/

I'm not sure if this applies due to ECC RAM, however, I'll check either way.

ikarlo · Jul 17, 2023

If the replicated datasets are encrypted and sent to another encrypted destination, it is probably due to the zfs panic on receive replication while it try to delete old snapshots.

Probably, following posts can help:

Kernel Panic when trying to destroy dataset

Hi, I have 2 truenas scale's with the "truenas1" replicating 2 datasets to the "truenas2". This has worked without any issue for months and survived all the upgrades. Currently on 22.12.3.2. Yesterday I woke up to "truenas2" being in a reboot loop. Long story short I narrowed it down to of all...

www.truenas.com

ZFS Receive of encrypted incremental data stream causes a PANIC · Issue #13445 · openzfs/zfs

System information Type Version/Name Distribution Name Debian Distribution Version Bullseye Kernel Version 5.10.109+truenas Architecture amd64 OpenZFS Version zfs-2.1.2-95_g1d2cdd23b zfs-kmod-2.1.2...

github.com

I hope it gets resolved soon

TempleHasFallen · Jul 17, 2023

ikarlo said:
If the replicated datasets are encrypted and sent to another encrypted destination, it is probably due to the zfs panic on receive replication while it try to delete old snapshots.

Probably, following posts can help:

Kernel Panic when trying to destroy dataset

Hi, I have 2 truenas scale's with the "truenas1" replicating 2 datasets to the "truenas2". This has worked without any issue for months and survived all the upgrades. Currently on 22.12.3.2. Yesterday I woke up to "truenas2" being in a reboot loop. Long story short I narrowed it down to of all...

www.truenas.com

ZFS Receive of encrypted incremental data stream causes a PANIC · Issue #13445 · openzfs/zfs

System information Type Version/Name Distribution Name Debian Distribution Version Bullseye Kernel Version 5.10.109+truenas Architecture amd64 OpenZFS Version zfs-2.1.2-95_g1d2cdd23b zfs-kmod-2.1.2...

github.com

I hope it gets resolved soon

Ok, this is definitely happening to my backup system, as whenever I try to restart the replication, as I did just today two times, the system rebooted.
I recently brought the off-site backup system on-site to redo replications from scratch as there was a massive data volume to be trasnferred. I've noticed issues with the system (I had to manually destroy the dataset which too way longer than it should)

However, should this also cause my sending system to panic? I'm looking into it, this seems to be the issue.

My target box is running TrueNAS Core 13.0-U5.2 and its causing this.

samarium · Jul 17, 2023

https://github.com/openzfs/zfs/issues/13445 seems to include some work around too, if that is indeed the issue

are both machines that crashed receiving raw encrypted snapshots?

Might be worth changing the scheduling +30min so see if the crash/reboot moves +30 min?

TempleHasFallen · Jul 17, 2023

samarium said:
https://github.com/openzfs/zfs/issues/13445 seems to include some work around too, if that is indeed the issue

are both machines that crashed receiving raw encrypted snapshots?

Might be worth changing the scheduling +30min so see if the crash/reboot moves +30 min?

I just double checked, the main machine also receives raw encrypted snapshots, as the NVMe to HDD replication when run manually also crashes the system.

So this is definitely the issue. Now only to figure out the workaround

TempleHasFallen · Jul 17, 2023

Locally trying to destroy the dataset (the one created by the replication) on Core 13.0-U5.2 also causes a crash

TempleHasFallen · Jul 17, 2023

Manually destroying a dataset also seems to cause a crash. When using recursive zfs destroy, on the first dataset it crashes (while snapshots get destroyed normally).

Anyone have experience and can guide on how directly apply this ZFS patch to either Core or SCALE? https://github.com/openzfs/zfs/pull/15039

samarium · Jul 17, 2023

I don't know anyone who has build a TN kernel. I haven't build a kernel in 10+ years, and not keen to go down the rabbit hole.

First thing I would try is booting successively older Ubuntu desktop images over the last couple years, trying to find a version of ZFS that doesn't crash when I remove a snapshot. The pools version features might not match the Ubuntu ZFS version supported features, either of which might not work or you might get crashes same as with TN. Reading the PR might give you clues as to what might work.

If I had to build a kernel I would try for Ubuntu or Debian, as TN has it's own limitations and code base that I don't know enough about, and I wouldn't be trying to integrate the build kernel into a TN install.

Kernel build should be relatively straight forward, but there will be lots of learning along the way.
Ubuntu has an advantage in that they normally build ZFS modules anyway, so some of thir testing kernels might be easy.
Debian has an advantage that they use DKMS and integrating the patches into DKMS ZFS might be an easier and smaller job.

I have no idea of how the Ubuntu desktop ZFS will interact with your pools, or the testing on the patch, so you might totally corrupt your pool anyway.

LarsR · Jul 18, 2023

iX didnt build the kernel. Scale uses a Debian LTS as the base os and then builds a custom middleware and gui.

samarium · Jul 18, 2023

Code:

# df /
Filesystem               1K-blocks    Used Available Use% Mounted on
boot-pool/ROOT/22.12.3.2   7872128 2941056   4931072  38% /
# grep PRETTY /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
# ls -l /usr/src
total 26
drwxr-xr-x 6 root root  9 Jul  5 03:27 linux-headers-5.15.107+truenas
drwxr-xr-x 8 root root 13 Jul  5 03:29 nvidia-current-515.65.01
drwxr-xr-x 6 root root 37 Jul  5 03:33 wireguard-1.0.20210219
# dkms status
nvidia-current, 515.65.01, 5.15.107+truenas, x86_64: installed
wireguard, 1.0.20210219: added
# find /lib/modules -name zfs.ko
/lib/modules/5.15.107+truenas/extra/zfs/zfs/zfs.ko

So from /usr/src linux-headers it seems iX modified or repackaged the kernel some what, so maybe they have some patches that aren't generally released as yet. Just looking in /lib/modules/5.15.107+truenas/extra you can see extra non zfs modules too.

Since /usr/src and dkms are present you could theoretically add patched vanilla zfs to /usr/src and build with dkms, but that sounds like a good recipe problems because if you use dkms install you will be wiping out the existing installed modules, which would also be required for startup since boot-pool is zfs, so you would also need to rebuild initramfs.

I would be doing this with an Debian install on another disk, mostly on another computer, then either figure out how to get the modules and use on the base desktop USB, or do a USB install and then use the new modules there. After booting that USB you could try cleaning out the bad snapshots. If you can clean them out, then don't send any more, then you don't need to hack the TN kernel.

Only when you have a working solution on non TN would I even think about using DKMS on TN to build new modules, and then you would have to have a plan for installing, initramfs, boot failure / crash, rolling back, managing a different boot environment maybe to isolate the changes while testing, and also have a plan for dealing with upgrades.

Hopefully iX will release a patch soon, but quickly releasing not fully tested patches will result in instability for all customers, not just those affected, so it would not be something I would be in a hurry to do.

Not sure how iX is handling this, maybe if more people open JIRA tickets they will know it is affecting more people?

TempleHasFallen · Jul 18, 2023

I also can't understand how this isn't affecting more people - not being able to replicate encrypted datasets should be affecting a huge percentage of people. Maybe they just haven't updated to a broken version yet.
Hopefully there's a patch very soon..

samarium · Jul 18, 2023

Seems like this has been as issue since at least openzfs 2.1.6.

But maybe most large enterprise customers are like me and avoid updating production too soon, wait for those on the bleeding edge to bleed out first.

ikarlo · Jul 18, 2023

TempleHasFallen said:
I also can't understand how this isn't affecting more people - not being able to replicate encrypted datasets should be affecting a huge percentage of people. Maybe they just haven't updated to a broken version yet.
Hopefully there's a patch very soon..

I agree with you, maybe few people use zfs encryption today...

samarium · Jul 18, 2023

If you proceed with building a patched zfs, testing is probably better done inside some linux VMs with ZFS root, using small prepared pools that can be restored and retested.

jessicabrown8110 · Jul 19, 2023

Just out of curiosity, have you logged into shell and run sudo journalctl | grep shutdown Look for any halts, or other things that would give an explanation as to why. You can also see what times by using last -x. This should identify a pattern of what times, and then start looking for cronjobs, scripts, etc.

Desc	Model	OS	Size/Speed	Boot/Pool	Other
NAS-1	QNAP TS-453A	TrueNAS CORE	2x8GB	16GB SLC eUSB DOM	APACHE, LACP, RSYNC, SMB, TFTP, TM
- Hourly	Seagate ST2000VN00[04]		4x2TB	RZ2	SATA
- Daily	Google Drive		15GB		Offsite
NAS-2	QNAP TS-253A	TrueNAS CORE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM, VM
- Daily	Seagate ST4000VN008		2x4TB	RZ1	SATA
- Weekly	Crucial X8		500GB	RZ0	USB
NAS-3	QNAP TS-453A	TrueNAS SCALE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM
- Weekly	Seagate ST2000[DM,VN]00[46]		4x2TB	RZ2	SATA
- Monthly	Crucial X8		500GB	RZ0	USB
NAS-4	QNAP TS-253A	TrueNAS SCALE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM, VM
- Testing	WD40EFRX		2x4TB	RZ1	SATA
VM-1	QNAP TS-253A		2x8GB	16GB SLC eUSB DOM	KVM, LXC
- PVE	Segate ST4000VN000	Proxmox VE	2x4TB	RZ1	SATA
FW	Protectli FW4C	pfSense CE	8GB	256GB TLC mSATA	DHCP/DNS, IDS/IPS, GPS/PPS, NTP, VPN
- WAN-1	Arris NVG599		375Mbps		ATT Fiber
~~- WAN-2~~	~~Netgear LB1120~~		~~150Mbps~~		~~SpeedTalk LTE~~
- GPS	Garmin 18X LVC				RS232, PPS
- LCD	Crystalfontz XES635	LCDproc			USB, NTP, UPS
- UPS	APC BX1500M		1500VA		USB, Master
SW	Linksys LGS326		52Gbps		LACP, VLAN
- AP-[123]	EnGenius EWS377APv3		3.6Gbps		WPA2/3
- SK-1	EnGenius SkyKeyIv1.1		1GB	4GB MLC eMMC	400GB mSD
NVR
- DB-1	Lorex B862AJD		8MP	32GB mSDXC	RTSP
- DB-[23]	Lorex B451AJD		4MP	32GB mSDXC	RTSP
- CAM-[12]	Lorex W461ASC		4MP	32GB mSDXC	RTSP

Important Announcement for the TrueNAS Community.

TrueNAS SCALE 22.12.3.2 Restarting by itself every night

Dabbler

Contributor

Dabbler

Contributor

Guru

Dabbler

Dabbler

Dabbler

Contributor

Dabbler

Dabbler

Dabbler

Contributor

Guru

Contributor

Dabbler

Contributor

Dabbler

Contributor

Cadet

Similar threads