ClimbingKId
Cadet
- Joined
- Aug 25, 2021
- Messages
- 6
First post here, and as a long time lab'r outside my professional IT job, I am in awe of what TrueNAS is doing for me. Expert in a few IT areas, but definitly a novice with FreeBSD, so please treat me gently 
I replaced my old virtualised box in Feb this year with bigger and beefier whitebox solution, and following weeks of carefull bench testing each area for a few weeks, I brought it into service. This runs 8 VMs, covering pFsense, Windows Server, CCTV, Plex and of course TrueNas 12 for NAS duties with a pass thru 3008 BHA and all virtualised on ESXI 7.0U2 - and its been flawless, until recently....
I have also experienced a few degraded pools, with a couple of files corrupt. In each case, the drives showed no read/write errors, Smartctl extended tests are clean. I was able to scrub the pool and restore the odd affected file - and in one case the scrub triggered a reboot of the VM. I had no issues for the first 4 months, where I upgraded Truesnas from 12.0U2 to 12.0U3, but the last few months I have had reboots weekly and recently 48hrs.
Different proccesses, pagefaults, file corruption suggests memory issues to me - and to that end I have MemTested this ECC RAM outside and inside of ESXI - with no errors.
While I cannot absolutley rule out hardware - what I can say is that all the other VMs have run for weeks on end, with no issues, no events, and nothing in the ESXI logs, and find it hard to beleive I have a hardware fault repeatedly affceting only one VM.
Each restart I have gained for information, and have tried so many things following hours of googling, such as...
- Set memory shares to HIGH in ESXI, based on a forum comment.
- amended P and C states in ESXI based on forum advice,
- Reduced nunmber of vCPUs from 4 to 2 in ESXI VM
- Extended smart tests on both primary drives passed with no error
- Upgraded to TrueNAS-12.0-U5.1
- Upgraded to latest version of ESXI 7.0U1d
I am now though running out of ideas. The nearest I can find online to this is here - wehre there was no resolution. https://forums.freebsd.org/threads/kernel-panic-several-times-a-day.74234/
I am runnig out of ideas - before I start swapping hardware CPU/RAM/Motherboard - which seems crazy as they all test fine, and show no issues in other VMs - is there anything else I can try? Is there any more information I can gather? Please help me end these three months of hell and bring harmony to this house once again!
Many Thanks
CC
I replaced my old virtualised box in Feb this year with bigger and beefier whitebox solution, and following weeks of carefull bench testing each area for a few weeks, I brought it into service. This runs 8 VMs, covering pFsense, Windows Server, CCTV, Plex and of course TrueNas 12 for NAS duties with a pass thru 3008 BHA and all virtualised on ESXI 7.0U2 - and its been flawless, until recently....
- Gigabyte Aurous Pro x570
- Ryzen 3700x
- 2 x 32GB Kingston Server Premier KSM32ED8/32ME Memory 32GB 3200MHz DDR4 ECC CL22
- 2x Pools of 2x10TB WD Gold drives for mirrored live pool, and 3x 4TB for backup pool
- HBA Avago 3008, passed through successfully
- Onboard intel, and Pro 1000/Dual Port Card
- ESXI 7.0U2.
I have also experienced a few degraded pools, with a couple of files corrupt. In each case, the drives showed no read/write errors, Smartctl extended tests are clean. I was able to scrub the pool and restore the odd affected file - and in one case the scrub triggered a reboot of the VM. I had no issues for the first 4 months, where I upgraded Truesnas from 12.0U2 to 12.0U3, but the last few months I have had reboots weekly and recently 48hrs.
Different proccesses, pagefaults, file corruption suggests memory issues to me - and to that end I have MemTested this ECC RAM outside and inside of ESXI - with no errors.
While I cannot absolutley rule out hardware - what I can say is that all the other VMs have run for weeks on end, with no issues, no events, and nothing in the ESXI logs, and find it hard to beleive I have a hardware fault repeatedly affceting only one VM.
Each restart I have gained for information, and have tried so many things following hours of googling, such as...
- Set memory shares to HIGH in ESXI, based on a forum comment.
- amended P and C states in ESXI based on forum advice,
- Reduced nunmber of vCPUs from 4 to 2 in ESXI VM
- Extended smart tests on both primary drives passed with no error
- Upgraded to TrueNAS-12.0-U5.1
- Upgraded to latest version of ESXI 7.0U1d
I am now though running out of ideas. The nearest I can find online to this is here - wehre there was no resolution. https://forums.freebsd.org/threads/kernel-panic-several-times-a-day.74234/
I am runnig out of ideas - before I start swapping hardware CPU/RAM/Motherboard - which seems crazy as they all test fine, and show no issues in other VMs - is there anything else I can try? Is there any more information I can gather? Please help me end these three months of hell and bring harmony to this house once again!
Code:
Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80a7af2a stack pointer = 0x28:0xfffffe00e60aea30 frame pointer = 0x28:0xfffffe00e60aea80 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 60306 (smbd) trap number = 12 panic: page fault cpuid = 0 time = 1625664646 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e60ae6f0 vpanic() at vpanic+0x17b/frame 0xfffffe00e60ae740 panic() at panic+0x43/frame 0xfffffe00e60ae7a0 trap_fatal() at trap_fatal+0x391/frame 0xfffffe00e60ae800 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e60ae850 trap() at trap+0x286/frame 0xfffffe00e60ae960 calltrap() at calltrap+0x8/frame 0xfffffe00e60ae960 --- trap 0xc, rip = 0xffffffff80a7af2a, rsp = 0xfffffe00e60aea30, rbp = 0xfffffe00e60aea80 --- knote_fdclose() at knote_fdclose+0x13a/frame 0xfffffe00e60aea80 closefp() at closefp+0x42/frame 0xfffffe00e60aeac0 amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00e60aebf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e60aebf0 --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x80fd11c2a, rsp = 0x7fffffffd108, rbp = 0x7fffffffd120 --- KDB: enter: panic
Many Thanks
CC