Stolen_Walnut
Dabbler
- Joined
 - Jan 1, 2023
 
- Messages
 - 10
 
I appear to be tripping over a boot issue in Truenas scale 22.12.0 that apparently wasn't present in 22.02.4.
High level issue description:
HW:
Behaviors seen / debugging attempted / related forum posts found:
Welp... I was literally about to give up after that & just hit post thread... Umm...........
Is this expect?
Is this an acceptable fix to roll into a perm fix for TrueNAS?
Is there a preferred method to patch this for ourselves?
Thanks in advance!
(Note: Ohhhhhhhhh, silly me... I bet it's not hung; I bet it's actually sitting on the cli to configure the network -- When I was checking for vital signs, I was plugging in an Ethernet cord and waiting to see a MAC come through on the switch, so I was thinking it hadn't finished booting......... huh... At any rate. 20+ min boot is no good, especially given that the configuration is probably going to serial... which *double checks manual* this motherboard doesn't have to my knowledge.)
	
		
			
		
		
	
			
			High level issue description:
- 10-20 seconds into the boot process (after Linux kernel launch), boot becomes sluggish:
- Boot process being output to screen stops taking milliseconds and starts taking tens of seconds for each line
 
 - System does not complete boot process. (waited 8+ hours)
- Last output line of boot process "[ 1335.985838] systemd[1]: Started Journal Service."
 - System still logs USB plug/unplug events to screen
 
 
HW:
- Motherboard make and model
- MSI MEG Z690I UNIFY
 
 - CPU make and model
- Intel Core i5-12500
 
 - RAM quantity
- G.Skill Ripjaws S5 64GB (2 x 32GB) DDR5-5200
 
 - Hard drives, quantity, model numbers, and RAID configuration, including boot drives
- 3x:
- boot: SATA SSD - Samsung 870 EVO Series 2.5" 250GB
 - no-pool-assignment NVMe: Inland Premium 2TB SSD M.2
 - no-pool-assignment NVMe: Crucial - P3 2TB Internal SSD
 
 
 - 3x:
 - Hard disk controllers
- 3x integrated/on-board SSD/NVMe controllers:
- lspci -nnn | grep -E "0106|0108"
- 00:17.0 SATA controller [0106]: Intel Corporation Device [8086:7ae2] (rev 11)
 - 09:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology Device [c0a9:540a] (rev 01)
 - 13:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation PS5013 E13 NVMe Controller [1987:5013] (rev 01)
 
 
 - lspci -nnn | grep -E "0106|0108"
 
 - 3x integrated/on-board SSD/NVMe controllers:
 - Network cards
- 3x:
- lspci -nnn | grep 0200
- 08:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)
 - 0b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
 
 - USB based:
- Belkin USB 2.0 Ethernet Adapter F4U047
- from dmesg on plugin:
- [ 7071.443790] usb 1-2: new high-speed USB device number 11 using xhci_hcd
 - [ 7071.613305] usb 1-2: New USB device found, idVendor=0b95, idProduct=7720, bcdDevice= 0.01
 - [ 7071.623043] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 - [ 7071.631132] usb 1-2: Product: AX88x72A
 - [ 7071.635473] usb 1-2: Manufacturer: ASIX Elec. Corp.
 - [ 7071.641140] usb 1-2: SerialNumber: 0DFF1F
 - [ 7072.067353] asix 1-2:1.0 eth0: register 'asix' at usb-0000:00:14.0-2, ASIX AX88772 USB 2.0 Ethernet, 00:50:b6:0d:ff:1f
 - [ 7072.082502] asix 1-2:1.0 enx0050b60dff1f: renamed from eth0
 
 
 - from dmesg on plugin:
 
 - Belkin USB 2.0 Ethernet Adapter F4U047
 
 - lspci -nnn | grep 0200
 - Note: USB based network card was used during the process
 
 - 3x:
 
Behaviors seen / debugging attempted / related forum posts found:
- Assembled PC (no pools of any kind on any device)
 - Installed windows onto SATA SSD to checkout thermals while tweaking fan curves / PL1 /PL2
- PL1 set to 65W
 - PL2 set to 95W
 - Note: I wouldn't expect these to mess with anything, but it's definitely something I went monkeying with so figured I'd mention it
 - Note: Prime95 / CinebenchR23 never broke 70C during PL1, and only broke about ~80C momentarily during PL2; I.e. I don't think there's any thermal throttling happening
 
 - Installed TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install
 - Rebooted
 - Sluggishness experienced
- Note: Didn't capture this at the time, (some logs later)
 - issue description
- About 10-20 seconds after Linux kernel launch, boot process started crawling -- each line printed to the screen would take tens of seconds after the last instead of milliseconds.
 - Note: only gave this about 5-10 mins before giving up
 
 - Note: What I remember:
- Sluggishness started a few lines after some stuff about raid (hmm.. I'm not using raid, what is this? Do I have something funny enabled in the bios? I should reset that and try again later)
 - systemd printed logs for the first time after sluggishness started
 - Figured I'd consult the forums before scrutinizing too much
 
 
 - Debug - Remove USB
- Found a forum post where people were reporting a similar issue being resolved after migrating away from a USB boot device
 - Removed ALL USB devices; Reboot
 - Sluggishness still present
 - Plugged back in USB devices
 
 - Debug - Legacy USB in BIOS
- Found a forum post of people reporting that setting Legacy USB Support in BIOS to disabled fixed their issue
 - Disabled "Legacy USB Support" in BIOS; Reboot
 - Sluggishness still present
 - Re-enabled settings in BIOS
 
 - Debug - Install previous version
- Installed TrueNAS Scale 22.02.4 on the SATA SSD via flash drive without issue during the install
 - Rebooted
 - Sluggishness resolved
- booted at a reasonable speed, got to cli, set a few network settings, pulled web interface, clicked around, rebooted, boot within ~ a min or two
 
 - Hunting through journalctl for hints -- copied some suspicious looking / colored / bold logs to come back to and start attacking once I run out of ideas...
- Jan 01 15:10:35 tardinas kernel: pmd_set_huge: Cannot satisfy [mem 0xc0000000-0xc0200000] with a huge-page mapping due to MTRR override.
 - pnp 00:03: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pre>
 - Jan 01 15:10:35 tardinas kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
 - Jan 01 15:10:35 tardinas kernel: acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
 - Jan 01 15:10:35 tardinas kernel: nvme nvme0: missing or invalid SUBNQN field.
 - Jan 01 15:10:35 tardinas kernel: nvme nvme1: missing or invalid SUBNQN field.
 - Jan 01 15:10:35 tardinas kernel: ata8.00: supports DRM functions and may not be fully accessible
 - Jan 01 15:10:35 tardinas kernel: ata8.00: supports DRM functions and may not be fully accessible
 - Jan 01 15:10:35 tardinas kernel: thunderbolt: probe of 0000:0e:00.0 failed with error -110
 - (raid6 stuff here)
 - Jan 01 15:10:35 tardinas kernel: spl: loading out-of-tree module taints kernel.
 - Jan 01 15:10:35 tardinas kernel: znvpair: module license 'CDDL' taints kernel.
 - Jan 01 15:10:35 tardinas kernel: Disabling lock debugging due to kernel taint
 - Jan 01 15:10:35 tardinas blkmapd[794]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
 - Jan 01 15:10:40 tardinas systemd-modules-load[2088]: Failed to find module 'nvidia-drm'
 - Jan 01 15:10:43 tardinas kernel: Error: Driver 'pcspkr' is already registered, aborting...
 - Jan 01 15:10:55 tardinas systemd[1]: exim4-base.timer: Refusing to start, unit exim4-base.service to trigger not loaded.
 - Jan 01 15:10:55 tardinas systemd[1]: Failed to start Daily exim4-base housekeeping.
 - Jan 01 15:10:55 tardinas kernel: ipmi_si: Unable to find any System Interface(s)
 - Jan 01 15:10:55 tardinas systemd[1]: nslcd.service: Failed with result 'exit-code'.
 - Jan 01 15:10:55 tardinas systemd[1]: Failed to start LSB: LDAP connection daemon.
 - Jan 01 15:10:55 tardinas systemd[1]: openipmi.service: Failed with result 'exit-code'.
 - Jan 01 15:10:55 tardinas systemd[1]: Failed to start LSB: OpenIPMI Driver init script.
 - Jan 01 15:11:02 tardinas systemd[1]: ntp.service: Failed with result 'exit-code'.
 - Jan 01 15:11:06 tardinas avahi-daemon[2591]: XML_ParseBuffer() failed at line 1: no element found.
 - Jan 01 15:11:06 tardinas avahi-daemon[2591]: Failed to load service group file /services/DEV_INFO.service, ignoring.
 - Jan 01 15:11:06 tardinas avahi-daemon[2591]: XML_ParseBuffer() failed at line 1: no element found.
 - Jan 01 15:11:06 tardinas avahi-daemon[2591]: Failed to load service group file /services/HTTP.service, ignoring.
 - Jan 01 15:11:41 tardinas kernel: snd_hda_intel 0000:00:1f.3: Cannot probe codecs, giving up
 - Jan 01 15:15:36 tardinas systemd-xdg-autostart-generator[7326]: Not generating service for XDG autostart app-at\x2dspi\x2ddbus\x2dbus-autostart.service, erro>
 
 
 - Debug - upgrade from previous version
- With TrueNAS Scale 22.02.4 still installed, upgraded to TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install (Install in new boot environment)
- Error noticed "cannot create 'boot-pool/grub': dataset already exists"
- Seems benign
 
 - Reboot
 - Sluggishness present
 - Let sit over night (8.7 hours)
 - Boot never completed
- Last line of output was "[ 1335.xxxxxx] systemd[1]: Started Journal Service."
- Comment: So it apparently gave up after 22 mins?
 
 - Unplugged / replugged a USB device, which was logged to the screen immediately without delay
- Comment: Seems like at least some parts of the kernel are up & are responsive..
 
 
 - Last line of output was "[ 1335.xxxxxx] systemd[1]: Started Journal Service."
 - Reboot
 - Hand typed out some of the output to contrast 22.02.4 vs 22.12.0 w meld:
- 
	
		
			
		
		
	
		
		
	
	
- Notes/comments from log comparison between TrueNAS versions:
- Processes logging around sluggishness start
- ZFS was one of the last things to load before sluggishness started
 - systemd is the first thing to start outputting after sluggishness started
- Unclear from logs when systemd kicked off
 
 
 - Early boot timing difference
- ZFS loads 15 seconds earlier
 
 - Version updates between versions
- ZFS v2.1.5-1 -> v2.1.6-1
 - systemd 247.3-6 -> 247.3.7+deb11u1
 - fuse 7.32 -> 7.34
 
 - New systemd logs
- "Mounting NFSD configuration filesystem"
 - "Starting Wait on Disk Enumeration..."
 
 
 - Processes logging around sluggishness start
 
 - Notes/comments from log comparison between TrueNAS versions:
 
 - 
	
		
			
		
		
	
		
		
	
	
 - Reboot
 - From grub, pressing 'e' to edit (view) boot parameters of v22.12.0
- Notes/comments from Linux kernel boot param comparison between TrueNAS versions:
- Removed in v22.12.0: "systemd.unified_cgroup_hierarchy=0"
 - Added in v22.12.0: "nvme_core.multipath=N"
 
 
 - Notes/comments from Linux kernel boot param comparison between TrueNAS versions:
 
 - Error noticed "cannot create 'boot-pool/grub': dataset already exists"
 
 - With TrueNAS Scale 22.02.4 still installed, upgraded to TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install (Install in new boot environment)
 - Debug - kernel parameters
- Add back in systemd option
- From grub,
- press e on v22.12.0 (temporary edit)
 - To kernel params (line starting with Linux)
- Append " systemd.unified_cgroup_hierarchy=0" to end of line
 
 - press ctrl-x to launch
 
 - Sluggishness present
 - Reboot
 
 - From grub,
 - Remove nvme option
- From grub,
- press e on v22.12.0 (temporary edit)
 - To kernel params (line starting with Linux)
- Delete "nvme_core.multipath=N"
 
 - press Ctrl-x to launch
 
 - Sluggishness present
 - Reboot
 
 - From grub,
 - Add in systemd & remove nvme option
- Sluggishness present
 
 
 - Add back in systemd option
 - Debug - Pulling hardware
- ALL USB devices (already ran this test, but re-baselining...)
- Removed:
- USB Keyboard
 - USB Mouse
 - USB to Eth
 - (removed flash drive a while ago, but that's out too)
 
 - Sluggishness present
 
 - Removed:
 - PCIe card (Leaving previous HW removed)
- Removed
- PCIe: Synology 10Gb Ethernet and M.2 Adapter card E10M20-T1
- Note: NVMe drives are not populated on this. I.e., I'm removing 10G ethernet, but not the NVMe devices
 
 
 - PCIe: Synology 10Gb Ethernet and M.2 Adapter card E10M20-T1
 - Sluggishness present
 
 - Removed
 - NVMe Drive (Leaving previous HW removed)
- Removed
- NVMe: Crucial - P3 2TB Internal SSD
 
 - Sluggishness present
 
 - Removed
 - NVMe Drive (Leaving previous HW removed)
- Removed
- NVMe: Crucial - Inland Premium 2TB SSD M.2
 
 - Sluggishness present
 
 - Removed
 - SATA Drive (Leaving previous HW removed)
- Waited ~3m after Linux kernel launch
 - Removed
- SATA: Samsung 870 EVO Series 2.5" 250GB
- Llogs show ata8 down @ 209s
 
 
 - SATA: Samsung 870 EVO Series 2.5" 250GB
 - Sluggishness present
- Logs show much the same from 209s - 502s when it finally needed the drive again and complained boot-pool was missing
 
 
 - Plugged SATA Drive back in..
 - Front panel internal IO connectors (Leaving previous HW removed)
- Removed
- HD Audio
 - USB 3.2 Gen 2x2 Type-C
 - USB 3.2 Gen 1
 
 - Sluggishness present
 
 - Removed
 - So at this point, the exhaustive list of motherboard connections are:
- Power Supply
 - Processor
 - RAM
 - SATA to boot drive
 - 3x PWM fans
 - Front Panel header Power pins
 - HDMI
 
 - Uhh... I guess I could drop down 32g of ram
 - RAM (Leaving previous HW removed)
- Removed
- 1x stick 32g ram
 
 - Derp.. motherboard reset my bios since it detected change in ram... blew away secure boot disabled setting & prevented grub from launching Linux kernel.. that was going to be my next step anyways...
- Plugging in keyboard
 - In bios, Setting secure boot to Disabled
 - Removing keyboard
 
 - Sluggishness present
 - Since this is as stripped as I'm able to get my system, waiting the 20 mins to see if it is able to boot any further...
- Issue still present
- 41 mins later output shows
- [ 1331.884581] systemd[1]: Started Journal Service.
 
 - Plugged in & removed USB device & got some printing out to the screen
 
 - 41 mins later output shows
 
 - Issue still present
 
 - Removed
 
 - ALL USB devices (already ran this test, but re-baselining...)
 - Debug - Reset BIOS
- (Issue present - See last section)
 
 - Debug - BIOS - Disable integrated peripherals
- Settings > Advanced > Integrated Peripherals
- Onboard LAN Controller: changed to "Disabled"
 - Onboard CNVi Module Control: change to "Disable Integrated"
 - HD Audio Controller: changed to "Disabled"
 - Note: everything is disabled on this screen except for "VGA Detection" set to "Auto" instead of "Ignore"
 
 - Save changes & reboot
 - Sluggishness present
 
 - Settings > Advanced > Integrated Peripherals
 - Debug - BIOS - Enable Thunderbolt
- Note: something thunderbolt was being logged as an error just prior the sluggishness starting; in BIOS it was set to disabled
 - Settings > Advanced > Intel (R) Thunderbolt
- Discrete Thunderbolt(TM) Support: changed to "Enabled"
 
 - Save changes & reboot
- Note: Says there's a whole slew of changes associated -- didn't realize this was switching from integrated to (huh... wouldn't you know it...) "discrete" lol
 
 - After grub, got "error out of memory"
 - Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
 - Reset BIOS
- Re-disabled secure boot
 - Re-disabled Integrated peripherals
 
 - Back to sluggish behavior (Ya-haaa, I'm a llama again!)
 
 - Debug - Retrying 22.02.4 with 22.12.0 upgraded on top of it
- (as mentioned in #9 above, "Install in new boot environment" was selected during install)
 - Booted to grub
 - Sluggishness present
 - Still hangs on Started Journal Service (hit "Started Journal Service" at 1286; I waited until 1927, so 10m after that point)
 - Comment: Alright... I thought I had a rough idea what "in a new boot environment" meant, but I hadn't anticipated it stomping on something related to this issue... hmm
 - Comment: Luckily my dmesg capture had the boot args of the 22.02.04 version before the install on top of it..
 - Meld of before / after boot environment
- 
	
		
			
		
		
	
- Comment: I hadn't realized the install clobbered this one with the systemd & nvme deltas... huh
 - Comment: In all of the installs I've selected the 115200, not sure if that's of note... Hmm... I wouldn't think that order of consoles would matter, but I can't say I've tried that yet....
 - Oh. My. Goodness.
 - I just temp swapped the order in grub in 22.02.4, and it booted right up!
 
 
 - 
	
		
			
		
		
	
 
 - Debug - Swapping consoles in grub
- Boot to grub
 - Temp edit grub
- press 'e' on 22.12.0 option
 - On the line starting with Linux,
- Delete " console=tty1"
 - After "console=ttyS0,115200", type " console=tty1"
 
 - Press Ctrl+X
 
 - Holy heck. It worked
 - lolwut
 
 
Welp... I was literally about to give up after that & just hit post thread... Umm...........
Is this expect?
Is this an acceptable fix to roll into a perm fix for TrueNAS?
Is there a preferred method to patch this for ourselves?
Thanks in advance!
(Note: Ohhhhhhhhh, silly me... I bet it's not hung; I bet it's actually sitting on the cli to configure the network -- When I was checking for vital signs, I was plugging in an Ethernet cord and waiting to see a MAC come through on the switch, so I was thinking it hadn't finished booting......... huh... At any rate. 20+ min boot is no good, especially given that the configuration is probably going to serial... which *double checks manual* this motherboard doesn't have to my knowledge.)