SOLVED Windows VM crashing whole system. Possibly caused by kernel split/lock error?

ans40

Cadet
Joined
Oct 9, 2023
Messages
4
Hi, I have a bare-metal install of TrueNAS-SCALE-22.12.3.3. I'm using a windows VM to remotely edit large video files locally on the system so I don't have to transfer huge amounts of data across the internet. Usually when processing files (even if I limit the VM CPU to one core and one thread), the entire TrueNAS box will crash and reboot. I'm not sure why. The timing is sort of random -- sometimes it happens quickly, sometimes the file will intensely process for an hour and then crash.

Early, I thought it might be a hardware issue so I've tried:
  1. Memtest - ran a full memtest overnight, passed/successful. Also with the VM running, I usually have 20+GB RAM free.
  2. Measured power draw - the max power draw of my entire box under heavy load is about 250 watts. The PSU 12V rail is rated for 576 so I'm well within spec. Not to say the PSU couldn't be the issue but it seems unlikely.
Recently, I've been reading through a bunch of logs more /var/log/messages. I saw that for most crashes, the last logged message involves a split/lock detection. I also saw that on rebooting, one of the first logs includes some split/lock info.
Example (crashed around 13:18:00, startup one minute later):
Code:
Oct  9 13:12:45 TrueNAS-Server kernel: x86/split lock detection: #AC: CPU 3/KVM/2244533 took a split_lock trap at address: 0x404756
Oct  9 13:20:04 TrueNAS-Server syslog-ng[4240]: syslog-ng starting up; version='3.28.1'
Oct  9 13:19:17 TrueNAS-Server kernel: Linux version 5.15.107+truenas (root@tnsbuilds01.tn.ixsystems.net) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Tue Jul 25 00:05:02 UTC 2023
Oct  9 13:19:17 TrueNAS-Server kernel: Command line: BOOT_IMAGE=/ROOT/22.12.3.3@/boot/vmlinuz-5.15.107+truenas root=ZFS=boot-pool/ROOT/22.12.3.3 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N i915.force_probe=4692
Oct  9 13:19:17 TrueNAS-Server kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks


I was reading some forums (a few in proxmox and unraid but couldn't find any for TrueNAS) and saw some people mentioning this feature can just be turned off with the kernel parameter split_lock_detect=off (documentation), but I'm not sure how to implement that because sysctl is telling me it doesn't exist in the kernel (double-check me on that).

I've also tried looking through the debug logs and I don't see any crash files or anything like that. But I'm not super familiar with logging in TrueNAS so direction here could be helpful.

This is where I'm at now. Any help with diagnosing the issue further would be great. Any common problems I'm overlooking? If split/lock isn't a red herring, could someone help me disable it and see if the problems resolve? Thanks!
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
You can add it as a custom boot kernel option I beleive, similar to say:

 

ans40

Cadet
Joined
Oct 9, 2023
Messages
4
You can add it as a custom boot kernel option I beleive, similar to say:

I was able to disable split lock with that code, thanks. midclt call system.advanced.update '{"kernel_extra_options": "i915.force_probe=4692 split_lock_detect=off"}'

Confirmed disabled in the logs Oct 9 17:37:00 TrueNAS-Server kernel: Command line: BOOT_IMAGE=/ROOT/23.10-RC.1@/boot/vmlinuz-6.1.50-production+truenas root=ZFS=boot-pool/ROOT/23.10-RC.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N split_lock_detect=off Oct 9 17:37:00 TrueNAS-Server kernel: x86/split lock detection: disabled

Unfortunately that did not fix the issue. I also upgraded to Cobia with no luck -- not sure where to go from here, hm.
 

ans40

Cadet
Joined
Oct 9, 2023
Messages
4
Lovely, turned out to be a power supply issue. Replaced and no issues now :)
 

Lipsum Ipsum

Dabbler
Joined
Aug 31, 2022
Messages
22
Lovely, turned out to be a power supply issue. Replaced and no issues now :)
Did you have anything indicating it was the power supply or was replacing it just a guess?
What model was the old power supply?
Was 12V a single or split rail design?

I'm guessing that just a simple logging script monitoring and recording voltages wouldn't be fast enough to catch whatever fluctuation were causing the restart. Even if I still threw it out, the electronics hobbyist in me would do a postmortem analysis of sorts to see if I could figure out what really was happening.

It's theoretically possible that it's not the PS itself that's bad, but swapping a new one fixed what the cause was. For instance, a oxidized or slightly loose pin in a connector could create an intermittent connection, causing a ripple on the rail. Swapping PS could be enough to scraped off the oxide, or was a bit tighter of a pin so the problem didn't happen.
 

ans40

Cadet
Joined
Oct 9, 2023
Messages
4
It was a hunch, based on how stress testing individual components revealed nothing, and there were no logs at all to suggest an incoming power-down. Just a full and immediate crash. What you're saying is possible with loose/poor connections, but I think unlikely because I reseated everything a couple times and they were all brand new parts -- but not impossible to rule out I guess.

My BIOS had a setting turned on that automatically reboots after power-loss. I thought the fact the system was turning back on after crashing suggested the power-down was software-initiated. But now I'm pretty sure it was the PSU clunking out, and the BIOS pushing for a restart after. The PSU was name brand but budget-oriented against the advice of everyone with experience :) lesson learned. Been almost 2 weeks now with the upgraded PSU and everything's been great.
 
Top