TrueNAS Scale crash since 2 days always at 0:00 - Bug?

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
Hello,

I somehow only have problems with Truenas... first the RAM was broken and now I think it is the software. Because I had 2 crashes in the past 2 days. Always at 0:00 o clock. I exported the log and checked everything.... syslog, cronlog, error, vm logs, daemonlog, ... nothing at 0:00 or minute before or after. Also no temperature issue. I checked all graphs. What is the reason? And its always at 0:00!

I think its because of the latest Truenas version? Maybe a bug? I don't know what I should do now. This is so frustrating. I just want to run my Truenas and I don't want to touch the server every month.

This is my syslog, but you can see nothing. Just some normal logs before the crash and then the typical logs when Truenas is booting again:

Nov 30 01:30:41 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 30 01:40:41 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 30 01:40:41 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 30 01:40:41 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 30 01:50:41 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 30 01:50:41 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 30 01:50:41 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 30 01:58:01 truenas CRON[1533606]: (root) CMD (midclt call update.download > /dev/null 2>&1)
Nov 30 02:06:36 truenas syslog-ng[4043]: syslog-ng starting up; version='3.28.1'
Nov 30 02:05:48 truenas kernel: microcode: microcode updated early to revision 0x41c, date = 2022-03-24
Nov 30 02:05:48 truenas kernel: Linux version 5.15.131+truenas (root@tnsbuilds01.tn.ixsystems.net) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Fri Oct 13 19:46:10 UTC 2023
Nov 30 02:05:48 truenas kernel: Command line: BOOT_IMAGE=/ROOT/22.12.4.2@/boot/vmlinuz-5.15.131+truenas root=ZFS=boot-pool/ROOT/22.12.4.2 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Nov 30 02:05:48 truenas kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Nov 30 02:05:48 truenas kernel: BIOS-provided physical RAM map:

I am on Truenas Scale 22.12.4.2 and I use a Intel CPU (i7-1260P) with 64GB DDR4 RAM and m.2 pcie ssds for system and pool with sata connection.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Hello,

I somehow only have problems with Truenas... first the RAM was broken and now I think it is the software. Because I had 2 crashes in the past 2 days. Always at 0:00 o clock. I exported the log and checked everything.... syslog, cronlog, error, vm logs, daemonlog, ... nothing at 0:00 or minute before or after. Also no temperature issue. I checked all graphs. What is the reason? And its always at 0:00!

I think its because of the latest Truenas version? Maybe a bug? I don't know what I should do now. This is so frustrating. I just want to run my Truenas and I don't want to touch the server every month.

This is my syslog, but you can see nothing. Just some normal logs before the crash and then the typical logs when Truenas is booting again:



I am on Truenas Scale 22.12.4.2 and I use a Intel CPU (i7-1260P) with 64GB DDR4 RAM and m.2 pcie ssds for system and pool with sata connection.

Unless someone else is seeing the same issues, I would assume its related to your specific setup.

Can you document your motherboard and BIOS. Date seems to be off on Nov 30
What VMs and Apps are you running?

You might want to update to 23.10... see if the behavior is unchanged.
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
Can you document your motherboard and BIOS.
I use a mini pc: GIGABYTE Brix Extreme GB-BEi7HS-1260
Date seems to be off on Nov 30
I don't know why the logs are always 2 hours in the future. When I go to system settings and then to General, then I see the correct time.
What VMs and Apps are you running?
Only VMs. Ubuntu both. One with Pi Hole, one with Nextcloud.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Logs are probably set by Bios clock...

Mini-pC probably doesn't have ECC. but midnight is unusual for a regular crash.

You might want to turn off each VM before midnight and see if that changes behaviour.
 

somethingweird

Contributor
Joined
Jan 27, 2022
Messages
183
Core™ i7-1260P Processors, 12 Cores (4P+8E) - does truenas scale/core - support P/E cores?
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
I mean it worked over a month without any problems.

Btw. its crashed again at 0:00. And I got a new error, but only when the server is booting and I think its an error we can ignore:
  • NTP health check failed - No Active NTP peers: [{'195.186.4.100': 'REJECT'}, {'156.106.214.52': 'REJECT'}, {'5.148.175.134': 'REJECT'}]
Will try to power off the vms before the next day.
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
I powered off all vms before 0:00. But the crash came anyway... so there is something wrong with the Truenas system itself. But with no error logs? This is so strange. Maybe its something because of the backup jobs? Maybe one backup job is not configured well. This will explain why it always crash at 0:00. But there is no job at 0:00. I took this screenshot at 0:07 after truenas booted again. But maybe there is a pending job? I mean a job which failed days ago and now truenas want to run this job again at 0:00? Maybe an error happened in the past and then Truenas will try to run it again or want to fix this at 0:00? Maybe someone know if such processes are included in Truenas' code?

If you have any idea, please let me know. If not, I will try the upgrade to Truenas 23.10. And when the error is still there after the upgrade I need to switch to proxmox... :(
 

Attachments

  • 1701385607802.png
    1701385607802.png
    889.3 KB · Views: 53
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I powered off all vms before 0:00. But the crash came anyway... so there is something wrong with the Truenas system itself. But with no error logs? This is so strange. Maybe its something because of the backup jobs? Maybe one backup job is not configured well. This will explain why it always crash at 0:00. But there is no job at 0:00. I took this screenshot at 0:07 after truenas booted again. But maybe there is a pending job? I mean a job which failed days ago and now truenas want to run this job again at 0:00? Maybe an error happened in the past and then Truenas will try to run it again or want to fix this at 0:00? Maybe someone know if such processes are included in Truenas' code?

If you have any idea, please let me know. If not, I will try the upgrade to Truenas 23.10. And when the error is still there after the upgrade I need to switch to proxmox... :(

I don't know of anyone reporting a similar issue...... so its probably something incompatible on your local system??? Did you manage to fix the local system time and date?
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
Did you manage to fix the local system time and date?
No, because I don't think its an issue. It is like this since the installation. But I will do it later. But I adjusted the settings few weeks ago in the bios, so its strange that I ignored the wrong time. I will check, if its the bios clock or not.
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
Upraded to 23.10 and Truenas is still crashing. Today I will check the bios and adjust the time if possible. This is my last possible solution. Otherwise I need to switch to Proxmox. It's so annoying as everything has gone perfectly over the last few weeks. I haven't changed anything. And now there's an error.
 

florihupf

Cadet
Joined
Jan 13, 2022
Messages
7
I had a similar issue and it turned out to be a corrupted ZFS pool. The issue would log a stack to the physically attached monitor containing "zfs panic". This panic will stop all writes to the pool and results in system hang.

Not sure if that is the issue you are having but the timing was exactly the same.
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
I had a similar issue and it turned out to be a corrupted ZFS pool.
Is there a option in Truenas to fix zfs pools automatically? And I get this new message after the upgrade to 23.10. But I do not know how do I upgrade my zfs pools. I guess I need to use the terminal and just enter "zpool upgrade <pool name>"?
 

Attachments

  • 1701633835151.png
    1701633835151.png
    13.6 KB · Views: 48

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Is there a option in Truenas to fix zfs pools automatically? And I get this new message after the upgrade to 23.10. But I do not know how do I upgrade my zfs pools. I guess I need to use the terminal and just enter "zpool upgrade <pool name>"?
No need to upgrade the pool. That prevents rollback.

If it is a corrupted pool, its worth reporting the issue. @florihupf did you report a bug or find a resolution?

Looking at your hardware, I don't think it has ECC, that makes it much more possible for corrupted data to be written to the pool.
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
Now I updated the bios clock (was 1h behind) and updated the bios. The system is still always crashing at 0:00. But now I can see something in the reporting of the cpu, disk, .... So the system was this night already down at 23:43, until truenas rebootet and was available again at 0:06.

I don't think its because of missing ecc ram, because normal ram do not fail that often and it also make no sense that the system always crash to a specific time.

But I think I need to switch to Proxmox now, this is too much for me. No informations in the logs and I need to reboot my vms every night. This is not cool.
 

Attachments

  • 1701818641760.png
    1701818641760.png
    972.4 KB · Views: 43

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Now I updated the bios clock (was 1h behind) and updated the bios. The system is still always crashing at 0:00. But now I can see something in the reporting of the cpu, disk, .... So the system was this night already down at 23:43, until truenas rebootet and was available again at 0:06.

I don't think its because of missing ecc ram, because normal ram do not fail that often and it also make no sense that the system always crash to a specific time.

But I think I need to switch to Proxmox now, this is too much for me. No informations in the logs and I need to reboot my vms every night. This is not cool.


Memory errors don't occur regularly... so lack of ECC is not causing any issue at midnight.

Lack of ECC allows something in memory to be corrupted and then committed to storage. This can corrupt a file system. Its a rare event, but very difficult to diagnose or remedy.

Its not clear what the cause is here, but it does not appear to be a software bug that is causing issues on many systems... likely something in the local system. Erasing the pool/systems and reloading may remove the underlying cause.... regardless of whether its proxmox or SCALE.
 

Kienaba

Explorer
Joined
May 24, 2022
Messages
52
Hm, but I don't want to set everything up again...

Maybe some things on the truenas system are broken or outdated? Is it possible to update the system? Or is it only possible when a new truenas version is out?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Hm, but I don't want to set everything up again...

Maybe some things on the truenas system are broken or outdated? Is it possible to update the system? Or is it only possible when a new truenas version is out?

The problem is we don't know whether the file system is corrupted.. a software update doesn't fix this.

If it was a software issue.. we'd expect to see other users with a similar issue.

SCALE 20.10.1 should be available next week... but I don't see any bug fixes that are relevant to your issue.
 
Top