Is deactivating the split_lock protection a solution?
Code:
sudo sysctl kernel.split_lock_mitigate=0
Deactivation apparently did not help.
Code:
Nov 20 00:59:39 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Nov 20 08:49:38 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Since I executed this command manually in the shell and did not use an init script, the command was not active during the second crash.
I will now activate a post init script with the command and continue to monitor this.
In the meantime, probably worth a bug ticket on our end so we can investigate if there's some other fix / workaround we should apply fleet wide.
Contact me if you need any information.
Edit:
I am no longer sure whether split lock detection and the kernel crash are the cause of the occasional reboot of my server. I think it's more the consequence of a reboot.
Because I have noticed that the messages also appear when I restart the server manually.
Edit:
I may have found the cause of the restarts. At least I have the following suspicion:
Unfortunately, TrueNAS Scale does not activate ASPM for the NIC. Under Unraid and Fedora, however, this works out-of-the-box.
Code:
admin@truenas1[~]$ sudo lspci -vv | awk '/ASPM/{print $0}' RS= | grep --color -P '(^[a-z0-9:.]+|ASPM )'
00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 05) (prog-if 00 [Normal decode])
LnkCap: Port #5, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
00:1a.0 PCI bridge: Intel Corporation Device 7a48 (rev 11) (prog-if 00 [Normal decode])
LnkCap: Port #25, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
00:1c.0 PCI bridge: Intel Corporation Device 7a38 (rev 11) (prog-if 00 [Normal decode])
LnkCap: Port #1, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
00:1c.2 PCI bridge: Intel Corporation Device 7a3a (rev 11) (prog-if 00 [Normal decode])
LnkCap: Port #3, Speed 8GT/s, Width x1, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 (prog-if 02 [NVM Express])
pcilib: sysfs_read_vpd: read failed: No such device
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
This leads to higher power consumption. I have therefore activated ASPM manually with an init script. The power consumption is lower.
Code:
sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:04:00.0/link/l1_aspm"
I suspect that this command makes the system unstable. I deactivated the init script after the last crash and so far there has been no reboot. I'll see if it stays that way.
On the other hand, ASPM is activated for the same hardware out-of-the-box on the other operating systems mentioned. Why is that the case?