truenas kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

cap

Contributor
Joined
Mar 17, 2016
Messages
122
Unfortunately, I have noticed that my TrueNAS Scale system reboots from time to time.
I have found the following information:

Code:
admin@truenas1[/var/log]$ sudo cat /var/log/messages | grep crashing
[...]
Nov 19 13:54:37 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks


What does that actually mean? Cause? Solution?



My system:

TrueNAS-SCALE-23.10.0.1
bequiet pure power 12 m
B760M Gaming X DDR4
Intel Core i3-12100
970 EVO Plus
980 Pro
Crucial DIMM 32GB, DDR4-3200, CL22-22-22
Seagate Exos X16
TOSHIBA MG07ACA12TE


Edit:
Is deactivating the split_lock protection a solution?
Code:
sudo sysctl kernel.split_lock_mitigate=0
 
Last edited:

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
You can try disabling that to see if it helps the stability overall. In the meantime, probably worth a bug ticket on our end so we can investigate if there's some other fix / workaround we should apply fleet wide.
 
  • Like
Reactions: cap

cap

Contributor
Joined
Mar 17, 2016
Messages
122
Is deactivating the split_lock protection a solution?
Code:
sudo sysctl kernel.split_lock_mitigate=0

Deactivation apparently did not help.

Code:
Nov 20 00:59:39 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Nov 20 08:49:38 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks


Since I executed this command manually in the shell and did not use an init script, the command was not active during the second crash.

I will now activate a post init script with the command and continue to monitor this.
In the meantime, probably worth a bug ticket on our end so we can investigate if there's some other fix / workaround we should apply fleet wide.
Contact me if you need any information.

Edit:
I am no longer sure whether split lock detection and the kernel crash are the cause of the occasional reboot of my server. I think it's more the consequence of a reboot.
Because I have noticed that the messages also appear when I restart the server manually.

Edit:

I may have found the cause of the restarts. At least I have the following suspicion:


Unfortunately, TrueNAS Scale does not activate ASPM for the NIC. Under Unraid and Fedora, however, this works out-of-the-box.
Code:
admin@truenas1[~]$ sudo lspci -vv | awk '/ASPM/{print $0}' RS= | grep --color -P '(^[a-z0-9:.]+|ASPM )'

00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 05) (prog-if 00 [Normal decode])
                LnkCap: Port #5, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
00:1a.0 PCI bridge: Intel Corporation Device 7a48 (rev 11) (prog-if 00 [Normal decode])
                LnkCap: Port #25, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
00:1c.0 PCI bridge: Intel Corporation Device 7a38 (rev 11) (prog-if 00 [Normal decode])
                LnkCap: Port #1, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
00:1c.2 PCI bridge: Intel Corporation Device 7a3a (rev 11) (prog-if 00 [Normal decode])
                LnkCap: Port #3, Speed 8GT/s, Width x1, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 (prog-if 02 [NVM Express])
pcilib: sysfs_read_vpd: read failed: No such device
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+

This leads to higher power consumption. I have therefore activated ASPM manually with an init script. The power consumption is lower.

Code:
sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:04:00.0/link/l1_aspm"

I suspect that this command makes the system unstable. I deactivated the init script after the last crash and so far there has been no reboot. I'll see if it stays that way.
On the other hand, ASPM is activated for the same hardware out-of-the-box on the other operating systems mentioned. Why is that the case?
 
Last edited:
Top