truenas kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

cap · Nov 19, 2023

Unfortunately, I have noticed that my TrueNAS Scale system reboots from time to time.
I have found the following information:

Code:

admin@truenas1[/var/log]$ sudo cat /var/log/messages | grep crashing
[...]
Nov 19 13:54:37 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

What does that actually mean? Cause? Solution?

Detecting and handling split locks

The Intel architecture allows misaligned memory access in situations where other architectures [...]

lwn.net

27. Bus lock detection and handling — The Linux Kernel documentation

My system:

TrueNAS-SCALE-23.10.0.1
bequiet pure power 12 m
B760M Gaming X DDR4
Intel Core i3-12100
970 EVO Plus
980 Pro
Crucial DIMM 32GB, DDR4-3200, CL22-22-22
Seagate Exos X16
TOSHIBA MG07ACA12TE

Edit:
Is deactivating the split_lock protection a solution?

Code:

sudo sysctl kernel.split_lock_mitigate=0

Kris Moore · Nov 20, 2023

You can try disabling that to see if it helps the stability overall. In the meantime, probably worth a bug ticket on our end so we can investigate if there's some other fix / workaround we should apply fleet wide.

cap · Nov 20, 2023

cap said:
Is deactivating the split_lock protection a solution?

Code:
sudo sysctl kernel.split_lock_mitigate=0

Deactivation apparently did not help.

Code:

Nov 20 00:59:39 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Nov 20 08:49:38 truenas1 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

Since I executed this command manually in the shell and did not use an init script, the command was not active during the second crash.

I will now activate a post init script with the command and continue to monitor this.

Kris Moore said:
In the meantime, probably worth a bug ticket on our end so we can investigate if there's some other fix / workaround we should apply fleet wide.

Contact me if you need any information.

Edit:
I am no longer sure whether split lock detection and the kernel crash are the cause of the occasional reboot of my server. I think it's more the consequence of a reboot.
Because I have noticed that the messages also appear when I restart the server manually.

Edit:

I may have found the cause of the restarts. At least I have the following suspicion:

Unfortunately, TrueNAS Scale does not activate ASPM for the NIC. Under Unraid and Fedora, however, this works out-of-the-box.

Code:

admin@truenas1[~]$ sudo lspci -vv | awk '/ASPM/{print $0}' RS= | grep --color -P '(^[a-z0-9:.]+|ASPM )'

00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 05) (prog-if 00 [Normal decode])
                LnkCap: Port #5, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
00:1a.0 PCI bridge: Intel Corporation Device 7a48 (rev 11) (prog-if 00 [Normal decode])
                LnkCap: Port #25, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
00:1c.0 PCI bridge: Intel Corporation Device 7a38 (rev 11) (prog-if 00 [Normal decode])
                LnkCap: Port #1, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
00:1c.2 PCI bridge: Intel Corporation Device 7a3a (rev 11) (prog-if 00 [Normal decode])
                LnkCap: Port #3, Speed 8GT/s, Width x1, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 (prog-if 02 [NVM Express])
pcilib: sysfs_read_vpd: read failed: No such device
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+

This leads to higher power consumption. I have therefore activated ASPM manually with an init script. The power consumption is lower.

Code:

sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:04:00.0/link/l1_aspm"

I suspect that this command makes the system unstable. I deactivated the init script after the last crash and so far there has been no reboot. I'll see if it stays that way.
On the other hand, ASPM is activated for the same hardware out-of-the-box on the other operating systems mentioned. Why is that the case?

Important Announcement for the TrueNAS Community.

truenas kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

cap

Contributor

Detecting and handling split locks

Kris Moore

SVP of Engineering

cap

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

truenas kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

cap

Contributor

Detecting and handling split locks

Kris Moore

SVP of Engineering

cap

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "truenas kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks"

Similar threads