TrueNAS Scale keeps crashing & rebooting. Tried everything but MB/CPU. Need expert help.

dazzywazzy · Dec 2, 2023

Hi everyone.

Setup below:

ASROCK Z790 STEEL LEGEND WIFI DDR5 LGA1700 ATX
INTEL 13TH GEN CORE I5-13500 14 CORES 20 THREADS 4.8GHZ
PROCESSOR
CORSAIR VENGEANCE 64GB(2X32GB) DDR5 6000MHZ CL40 RAM - WHITE - Memtested twice and it passed
2 x Seagate IronWolf 12TB NAS HDD (ST12000VNZ008) - Mirror
1 x IronWolf NAS 4TB ST4000VN008 HDD - Used as hot disk or temp drive for stuff I don't care about.
1 x Crucial P2 500GB 3D NAND NVMe PCIe M.2 SSD,. CT500P2SSD8 - Boot Pool
Nil
Realtek 2.5G LAN
Intel ETH CONVERGED X520-SR2 Network Adapter+2xTransceiver intel E65689-001 10G
Only apps I currently run are Plex and Tailscale, both from Truecharts
I also run HASS as a VM

I consider myself pretty handy with IT. I've had TrueNAS Scale working for a couple years, however I upgraded my hardware a couple months ago, which seemed to work at the start, but my stability has gotten worse and worse.

Things I have tried:

Format all drives and reinstall TrueNAS - Then restore everything from backup.
Recreated all my datasets
Run memtest - Twice. For reference, my memory is not overclocked.
Stopped my HASS which is running as a VM - I had a suspicion that passthrough cpu could've been causing some kind of issue, however the system seems to crash even with this stopped.
Recreated pool for apps
Reinstalled Plex

Even just running with only Plex, the system will crash and reboot. This seems to happen when it's under more load than just sitting idle. For instance, the latest crash was in the middle of a Plex stream, however it has also happened randomly when doing data transfers from SMB shares, or sometimes even in scrubbing of pools. Sometimes the system can stay up for 3 days, other times it can crash and reboot every few hours.

I caught the crash dump below, and am at my whits end.

If anyone wants to see more logs, just let me know and I will sanitise and send/attach to this post.

ANY help on this is appreciated. I have the feeling most will likely say I need to change my MB away from a gaming board, but figured it was worth a shot to put this info out to the experts.

Code:

root@NAS[~]# 2023 Dec  3 07:22:08 NAS Process 3606 (netdata) of user 999 dumped core.


Stack trace of thread 4415:
#0  0x000055e7d77c0c4a n/a (netdata + 0xc0c4a)
#1  0x000055e7d77c15bc n/a (netdata + 0xc15bc)
#2  0x000055e7d77c251e n/a (netdata + 0xc251e)
#3  0x000055e7d7949237 n/a (netdata + 0x249237)
#4  0x000055e7d791a80e n/a (netdata + 0x21a80e)
#5  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#6  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#7  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4411:
#0  0x00007f5291763fff __GI___poll (libc.so.6 + 0xfbfff)
#1  0x000055e7d7760436 n/a (netdata + 0x60436)
#2  0x000055e7d790206c n/a (netdata + 0x20206c)
#3  0x000055e7d7902e61 n/a (netdata + 0x202e61)
#4  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#5  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#6  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4412:
#0  0x00007f5291737345 __GI___clock_nanosleep (libc.so.6 + 0xcf345)
#1  0x000055e7d7739cfc n/a (netdata + 0x39cfc)
#2  0x000055e7d7739fbb n/a (netdata + 0x39fbb)
#3  0x000055e7d77b3e89 n/a (netdata + 0xb3e89)
#4  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#5  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#6  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4386:
#0  0x00007f5291770c06 epoll_wait (libc.so.6 + 0x108c06)
#1  0x00007f5291986135 n/a (libuv.so.1 + 0x23135)
#2  0x00007f52919729e4 uv_run (libuv.so.1 + 0xf9e4)
#3  0x000055e7d78de34d n/a (netdata + 0x1de34d)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 6261:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 3606:
#0  0x00007f529173bbf2 __libc_pause (libc.so.6 + 0xd3bf2)
#1  0x000055e7d77262d5 n/a (netdata + 0x262d5)
#2  0x000055e7d77169ce n/a (netdata + 0x169ce)
#3  0x00007f529168f18a __libc_start_call_main (libc.so.6 + 0x2718a)
#4  0x00007f529168f245 __libc_start_main_impl (libc.so.6 + 0x27245)
#5  0x000055e7d77176a1 n/a (netdata + 0x176a1)


Stack trace of thread 6266:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 6267:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4418:
#0  0x00007f529176007c __GI___libc_read (libc.so.6 + 0xf807c)
#1  0x00007f52916ea00e _IO_new_file_underflow (libc.so.6 + 0x8200e)
#2  0x00007f52916eb002 __GI__IO_default_uflow (libc.so.6 + 0x83002)
#3  0x00007f52916dee2a __GI__IO_getline_info (libc.so.6 + 0x76e2a)
#4  0x00007f52916ddf6e _IO_fgets (libc.so.6 + 0x75f6e)
#5  0x000055e7d79035f3 n/a (netdata + 0x2035f3)
#6  0x000055e7d77a468a n/a (netdata + 0xa468a)
#7  0x000055e7d779ebf6 n/a (netdata + 0x9ebf6)
#8  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#9  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#10 0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 6257:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 3658:
#0  0x00007f5291770c06 epoll_wait (libc.so.6 + 0x108c06)
#1  0x00007f5291986135 n/a (libuv.so.1 + 0x23135)
#2  0x00007f52919729e4 uv_run (libuv.so.1 + 0xf9e4)
#3  0x000055e7d7906016 n/a (netdata + 0x206016)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4408:
#0  0x00007f5291737345 __GI___clock_nanosleep (libc.so.6 + 0xcf345)
#1  0x000055e7d7739cfc n/a (netdata + 0x39cfc)
#2  0x000055e7d7739fbb n/a (netdata + 0x39fbb)
#3  0x000055e7d7726a24 n/a (netdata + 0x26a24)
#4  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#5  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#6  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4413:
#0  0x00007f5291737345 __GI___clock_nanosleep (libc.so.6 + 0xcf345)
#1  0x000055e7d7739cfc n/a (netdata + 0x39cfc)
#2  0x000055e7d78f357d n/a (netdata + 0x1f357d)
#3  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 6260:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4423:
#0  0x00007f5291763fff __GI___poll (libc.so.6 + 0xfbfff)
#1  0x000055e7d7760436 n/a (netdata + 0x60436)
#2  0x000055e7d790206c n/a (netdata + 0x20206c)
#3  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4422:
#0  0x00007f5291763fff __GI___poll (libc.so.6 + 0xfbfff)
#1  0x000055e7d7760436 n/a (netdata + 0x60436)
#2  0x000055e7d790206c n/a (netdata + 0x20206c)
#3  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4407:
#0  0x00007f5291737345 __GI___clock_nanosleep (libc.so.6 + 0xcf345)
#1  0x00007f529173bc53 __GI___nanosleep (libc.so.6 + 0xd3c53)
#2  0x00007f529173bb8a __sleep (libc.so.6 + 0xd3b8a)
#3  0x000055e7d779f37e n/a (netdata + 0x9f37e)
#4  0x000055e7d77651e4 n/a (netdata + 0x651e4)
#5  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#6  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4416:
#0  0x00007f5291770c06 epoll_wait (libc.so.6 + 0x108c06)
#1  0x00007f5291986135 n/a (libuv.so.1 + 0x23135)
#2  0x00007f52919729e4 uv_run (libuv.so.1 + 0xf9e4)
#3  0x000055e7d77278b6 n/a (netdata + 0x278b6)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 6263:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 6262:
#0  0x00007f52916edd36 __futex_abstimed_wait_common64 (libc.so.6 + 0x85d36)
#1  0x00007f52916f03f8 __pthread_cond_wait_common (libc.so.6 + 0x883f8)
#2  0x00007f5291980169 uv_cond_wait (libuv.so.1 + 0x1d169)
#3  0x00007f529196defe n/a (libuv.so.1 + 0xaefe)
#4  0x00007f52916f0fd4 start_thread (libc.so.6 + 0x88fd4)
#5  0x00007f52917715bc __clone3 (libc.so.6 + 0x1095bc)


Stack trace of thread 4417:
#0  0x00007f5291737345 __GI___clock_nanosleep (libc.so.6 + 0xcf345)
#1  0x000055e7d7739cfc n/a (netdata + 0x39cfc)
#2  0x000055e7d7739fbb n/a (netdata + 0x39fbb)
#3  0x000055e7d7

dazzywazzy · Dec 3, 2023

Another reboot this morning, and caught this error

Dec 4 04:27:22 NAS kernel: Error: Driver 'pcspkr' is already registered, aborting...
Dec 4 04:27:55 NAS blkmapd[3377]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
Dec 4 04:27:55 NAS upsd[3417]: Fatal error: at least one UPS must be defined in ups.conf
Dec 4 04:27:55 NAS systemd[1]: Failed to start nslcd.service - LSB: LDAP connection daemon.
Dec 4 04:27:56 NAS smartd[3429]: Device: /dev/nvme0n1, number of Error Log entries increased from 3500 to 3501
Dec 4 04:27:58 NAS libvirtd[3720]: invalid argument: cannot find architecture arm
Dec 4 04:27:58 NAS haproxy[5226]: backend be_22 has no server available!
Dec 4 04:27:59 NAS haproxy[5226]: backend be_24 has no server available!

One error there points to the boot drive /dev/nvme0n1, so I ran a smartctl -a /dev/nvme0n1. Everything says passed

=== START OF INFORMATION SECTION ===
Model Number: CT500P2SSD8
Firmware Version: P2CR012
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 46f000000f
Local Time is: Mon Dec 4 08:25:06 2023 ACDT
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.50W - - 0 0 0 0 0 0
1 + 2.70W - - 1 1 1 1 0 0
2 + 2.16W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 1000 1000
4 - 0.0020W - - 4 4 4 4 5000 55000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 30 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 1%
Data Units Read: 19,186,610 [9.82 TB]
Data Units Written: 20,038,660 [10.2 TB]
Host Read Commands: 156,686,730
Host Write Commands: 265,749,488
Controller Busy Time: 3,964
Power Cycles: 484
Power On Hours: 16,832
Unsafe Shutdowns: 204
Media and Data Integrity Errors: 0
Error Information Log Entries: 3,501
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 16 entries)
No Errors Logged

Also found this in the middware.log in the leadup to the crash

[2023/12/04 03:38:15] (DEBUG) UsageService.start():62 - Scheduled next run in 66419 seconds
[2023/12/04 03:38:39] (WARNING) middlewared.__periodic_task_wrapper():1023 - Exception while calling periodic task
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1880, in _execute_context
self.dialect.do_executemany(
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 733, in do_executemany
cursor.executemany(statement, parameters)
sqlite3.ProgrammingError: Binding 1 has no name, but you supplied a dictionary (which has only names).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1021, in __periodic_task_wrapper
await self._call(method_name, service_obj, method, [])
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1352, in _call
return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1251, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/smb_/sharesec.py", line 327, in check_share_info_tdb
self.middleware.call_sync('smb.sharesec.synchronize_acls')
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1420, in call_sync
return self.run_coroutine(methodobj(*prepared_call.args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1460, in run_coroutine
return fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 177, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/smb_/sharesec.py", line 350, in synchronize_acls
await self.middleware.call(
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1398, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1341, in _call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 177, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/datastore/write.py", line 117, in update
result = await self.middleware.call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1398, in call
return await self._call(
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1352, in _call
return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1251, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/datastore/connection.py", line 106, in execute_write
result = self.connection.execute(sql, binds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1365, in execute
return self._exec_driver_sql(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1669, in _exec_driver_sql
ret = self._execute_context(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
self._handle_dbapi_exception(
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
util.raise_(
File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1880, in _execute_context
self.dialect.do_executemany(
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 733, in do_executemany
cursor.executemany(statement, parameters)
sqlalchemy.exc.ProgrammingError: <exception str() failed>
[2023/12/04 03:45:01] (DEBUG) PoolScrubService.__run():290 - Pool 'boot-pool' last scrub datetime.datetime(2023, 11, 28, 3, 45, 2)
[2023/12/04 04:27:34] (INFO) middlewared.__init__():850 - Starting TrueNAS-SCALE-23.10.0.1 middleware

Arwen · Dec 3, 2023

I would check several things;

Make sure your system board is not over-clocking anything, not CPU, not RAM and certainly not PCIe.
For the RAM, don't use XMM, or what ever it's called. Use the CPU specified max rate. Or even more conservative.
I would try temporarily replacing the NVMe that has logged faults. It may not be the cause, but 3500 errors is a bit too many to just overlook.
Re-seat anything that can be re-seated.

If you need to resort to using USB for your boot-pool temporarily, make sure you install a new copy. And in the beginning, don't import your existing configuration. Just run it plain and see if it crashes. Now to be clear, cheap USB drives tend to be, well, cheap and un-reliable. But, perhaps over a few days for your testing it will work.

dazzywazzy · Dec 3, 2023

Thanks for the ideas.

Whilst I was certain I am not overclocking anything or using XMP memory profiles, a little digging now says these gaming boards can overclock certain aspects even if you've not set them up / told them not to. I'll give that a shot.

Will give the boot-pool HDD a shot also. Might clone it to another drive + try the USB option with a fresh install.

Will report back in few days likely.

Arwen said:
I would check several things;

Make sure your system board is not over-clocking anything, not CPU, not RAM and certainly not PCIe.

For the RAM, don't use XMM, or what ever it's called. Use the CPU specified max rate. Or even more conservative.

I would try temporarily replacing the NVMe that has logged faults. It may not be the cause, but 3500 errors is a bit too many to just overlook.

Re-seat anything that can be re-seated.

If you need to resort to using USB for your boot-pool temporarily, make sure you install a new copy. And in the beginning, don't import your existing configuration. Just run it plain and see if it crashes. Now to be clear, cheap USB drives tend to be, well, cheap and un-reliable. But, perhaps over a few days for your testing it will work.

dazzywazzy · Dec 14, 2023

Adding an update here in case anyone ever searches a similar issue.

After trying everything suggested, I still couldn't get a stable system.

Into page 25 on Google and everything seemed to point to Memory, so I ran another Memtest. Still came back 100% no fault, HOWEVER I did notice the DIMM slots, which tweaked something for me. I looked up the dimm slots in the MB manual, and found that I had my dual channel RAM in the incorrect DIMM slots, so they were only being recognised as single channel.

Moving the RAM as per the MB specifications, instantly fixed the issue and system has now been up for 5 days without issue.

So a human fault here, and just something for people to add to their checklists if they are having a similar issue.

Cheers,

Arwen · Dec 14, 2023

Glad it was solved.

Weird issue... 2 DIMMs in one channel and none in the other, causing crashes.

Important Announcement for the TrueNAS Community.

TrueNAS Scale keeps crashing & rebooting. Tried everything but MB/CPU. Need expert help.

dazzywazzy

Cadet

dazzywazzy

Cadet

Arwen

MVP

dazzywazzy

Cadet

dazzywazzy

Cadet

Arwen

MVP

Similar threads