SOLVED Random unscheduled restarts, under ESXI with 8 other VMs unaffected - out of ideas.

ClimbingKId

Cadet
Joined
Aug 25, 2021
Messages
6
First post here, and as a long time lab'r outside my professional IT job, I am in awe of what TrueNAS is doing for me. Expert in a few IT areas, but definitly a novice with FreeBSD, so please treat me gently :smile:

I replaced my old virtualised box in Feb this year with bigger and beefier whitebox solution, and following weeks of carefull bench testing each area for a few weeks, I brought it into service. This runs 8 VMs, covering pFsense, Windows Server, CCTV, Plex and of course TrueNas 12 for NAS duties with a pass thru 3008 BHA and all virtualised on ESXI 7.0U2 - and its been flawless, until recently....
  • Gigabyte Aurous Pro x570
  • Ryzen 3700x
  • 2 x 32GB Kingston Server Premier KSM32ED8/32ME Memory 32GB 3200MHz DDR4 ECC CL22
  • 2x Pools of 2x10TB WD Gold drives for mirrored live pool, and 3x 4TB for backup pool
  • HBA Avago 3008, passed through successfully
  • Onboard intel, and Pro 1000/Dual Port Card
  • ESXI 7.0U2.
In the last few months I have experienced random restarts (5 now) of only the TrueNas VM. In each case these panics have been a "Page Fault" and on two msgbuf.txt I have looked at, current process has been (txg_thread_enter) and (smbd).

I have also experienced a few degraded pools, with a couple of files corrupt. In each case, the drives showed no read/write errors, Smartctl extended tests are clean. I was able to scrub the pool and restore the odd affected file - and in one case the scrub triggered a reboot of the VM. I had no issues for the first 4 months, where I upgraded Truesnas from 12.0U2 to 12.0U3, but the last few months I have had reboots weekly and recently 48hrs.

Different proccesses, pagefaults, file corruption suggests memory issues to me - and to that end I have MemTested this ECC RAM outside and inside of ESXI - with no errors.

While I cannot absolutley rule out hardware - what I can say is that all the other VMs have run for weeks on end, with no issues, no events, and nothing in the ESXI logs, and find it hard to beleive I have a hardware fault repeatedly affceting only one VM.

Each restart I have gained for information, and have tried so many things following hours of googling, such as...
- Set memory shares to HIGH in ESXI, based on a forum comment.
- amended P and C states in ESXI based on forum advice,
- Reduced nunmber of vCPUs from 4 to 2 in ESXI VM
- Extended smart tests on both primary drives passed with no error
- Upgraded to TrueNAS-12.0-U5.1
- Upgraded to latest version of ESXI 7.0U1d

I am now though running out of ideas. The nearest I can find online to this is here - wehre there was no resolution. https://forums.freebsd.org/threads/kernel-panic-several-times-a-day.74234/

I am runnig out of ideas - before I start swapping hardware CPU/RAM/Motherboard - which seems crazy as they all test fine, and show no issues in other VMs - is there anything else I can try? Is there any more information I can gather? Please help me end these three months of hell and bring harmony to this house once again! :smile:


Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x10
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80a7af2a
stack pointer           = 0x28:0xfffffe00e60aea30
frame pointer           = 0x28:0xfffffe00e60aea80
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 60306 (smbd)
trap number             = 12
panic: page fault
cpuid = 0
time = 1625664646
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e60ae6f0
vpanic() at vpanic+0x17b/frame 0xfffffe00e60ae740
panic() at panic+0x43/frame 0xfffffe00e60ae7a0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe00e60ae800
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e60ae850
trap() at trap+0x286/frame 0xfffffe00e60ae960
calltrap() at calltrap+0x8/frame 0xfffffe00e60ae960
--- trap 0xc, rip = 0xffffffff80a7af2a, rsp = 0xfffffe00e60aea30, rbp = 0xfffffe00e60aea80 ---
knote_fdclose() at knote_fdclose+0x13a/frame 0xfffffe00e60aea80
closefp() at closefp+0x42/frame 0xfffffe00e60aeac0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00e60aebf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e60aebf0
--- syscall (6, FreeBSD ELF64, sys_close), rip = 0x80fd11c2a, rsp = 0x7fffffffd108, rbp = 0x7fffffffd120 ---
KDB: enter: panic



Many Thanks

CC
 

ClimbingKId

Cadet
Joined
Aug 25, 2021
Messages
6
Thought I would post back with my experiences, I know I had no replies, but this one was a bugger. No replies here, but the people over at FreeBSD were helpful - posting back to save anyone else the stress of random restarts and panics.

I kept a 5 month diary in the end of changes and things attempted, sometimes having to wait a week or so for each panic. Not only was I seeing these occasional restarts, but checksum errors of the pool with no errors or checksums from the underlying drives.

Each panics debug really did not help, and pagefault suggested memory error - however none of the otehr VMs under ESXI had any issues. A new VM with FreeBSD and strest-ng ran for days without issue which gave me confidence in the hardware.

I eventually split my TrueNas install across two virtual machines, passing through HBA and onboard SATA controllers and zfs pools. Here I finally had the fault following the HBA, my LSI 9300. Despite direct fan cooling its getting to an incredble temperature just at idle. Looks like its failing.

In hindsight not sure how I could have used the panics to fault this faster, but looks like failing LSI which would make sense given the checksums failures found at pool level.

Thanks

CC
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Oh, yeah, the checksum errors on the pool is the classic sign of a high-temp LSI controller, which starts corrupting data.

This also likely corrupts data being passed back and forth to the FreeBSD host driver. Most drivers that do "communications" with a secondary controller like the one on the HBA don't really do a lot of sanity checking of the stuff coming over the private channel from the device, because the firmware on the controller is trusted and treated as though it was something like an extension of the driver. The idea of corrupt bits coming back isn't handled well in most cases, because you would need to be running sanity checks against a huge amount of interactions which SHOULD never be corrupted, which is a significant performance hit for no gain.

So when the secondary controller is overheating and throwing random bad bits, there's your source for unsanitized input into the FreeBSD driver that then corrupts something and then ends up with some weird panic. This can definitely be difficult to debug, so you should know that your tenacity in resolving this is noted. I do this stuff professionally, and when stuff like this happens, the pragmatic fix is to just start swapping stuff out, because you usually don't have months to experiment on a customer's system. We don't always come to a clear answer like this. So, hat off to you. ;-)

Sorry I missed your message the first time around. I probably would have caught the corruption-in-pool thing.
 

ClimbingKId

Cadet
Joined
Aug 25, 2021
Messages
6
No worries - and thankyou for the better explaination. Yeah, not a ton of spare parts to throw at the problem in this home lab, that said, I do have another LSI on order :smile:

Thanks
CC
 
Top