ClimbingKId
Cadet
- Joined
- Aug 25, 2021
- Messages
- 6
First post here, and as a long time lab'r outside my professional IT job, I am in awe of what TrueNAS is doing for me. Expert in a few IT areas, but definitly a novice with FreeBSD, so please treat me gently 
I replaced my old virtualised box in Feb this year with bigger and beefier whitebox solution, and following weeks of carefull bench testing each area for a few weeks, I brought it into service. This runs 8 VMs, covering pFsense, Windows Server, CCTV, Plex and of course TrueNas 12 for NAS duties with a pass thru 3008 BHA and all virtualised on ESXI 7.0U2 - and its been flawless, until recently....
I have also experienced a few degraded pools, with a couple of files corrupt. In each case, the drives showed no read/write errors, Smartctl extended tests are clean. I was able to scrub the pool and restore the odd affected file - and in one case the scrub triggered a reboot of the VM. I had no issues for the first 4 months, where I upgraded Truesnas from 12.0U2 to 12.0U3, but the last few months I have had reboots weekly and recently 48hrs.
Different proccesses, pagefaults, file corruption suggests memory issues to me - and to that end I have MemTested this ECC RAM outside and inside of ESXI - with no errors.
While I cannot absolutley rule out hardware - what I can say is that all the other VMs have run for weeks on end, with no issues, no events, and nothing in the ESXI logs, and find it hard to beleive I have a hardware fault repeatedly affceting only one VM.
Each restart I have gained for information, and have tried so many things following hours of googling, such as...
- Set memory shares to HIGH in ESXI, based on a forum comment.
- amended P and C states in ESXI based on forum advice,
- Reduced nunmber of vCPUs from 4 to 2 in ESXI VM
- Extended smart tests on both primary drives passed with no error
- Upgraded to TrueNAS-12.0-U5.1
- Upgraded to latest version of ESXI 7.0U1d
I am now though running out of ideas. The nearest I can find online to this is here - wehre there was no resolution. https://forums.freebsd.org/threads/kernel-panic-several-times-a-day.74234/
I am runnig out of ideas - before I start swapping hardware CPU/RAM/Motherboard - which seems crazy as they all test fine, and show no issues in other VMs - is there anything else I can try? Is there any more information I can gather? Please help me end these three months of hell and bring harmony to this house once again!
Many Thanks
CC
I replaced my old virtualised box in Feb this year with bigger and beefier whitebox solution, and following weeks of carefull bench testing each area for a few weeks, I brought it into service. This runs 8 VMs, covering pFsense, Windows Server, CCTV, Plex and of course TrueNas 12 for NAS duties with a pass thru 3008 BHA and all virtualised on ESXI 7.0U2 - and its been flawless, until recently....
- Gigabyte Aurous Pro x570
- Ryzen 3700x
- 2 x 32GB Kingston Server Premier KSM32ED8/32ME Memory 32GB 3200MHz DDR4 ECC CL22
- 2x Pools of 2x10TB WD Gold drives for mirrored live pool, and 3x 4TB for backup pool
- HBA Avago 3008, passed through successfully
- Onboard intel, and Pro 1000/Dual Port Card
- ESXI 7.0U2.
I have also experienced a few degraded pools, with a couple of files corrupt. In each case, the drives showed no read/write errors, Smartctl extended tests are clean. I was able to scrub the pool and restore the odd affected file - and in one case the scrub triggered a reboot of the VM. I had no issues for the first 4 months, where I upgraded Truesnas from 12.0U2 to 12.0U3, but the last few months I have had reboots weekly and recently 48hrs.
Different proccesses, pagefaults, file corruption suggests memory issues to me - and to that end I have MemTested this ECC RAM outside and inside of ESXI - with no errors.
While I cannot absolutley rule out hardware - what I can say is that all the other VMs have run for weeks on end, with no issues, no events, and nothing in the ESXI logs, and find it hard to beleive I have a hardware fault repeatedly affceting only one VM.
Each restart I have gained for information, and have tried so many things following hours of googling, such as...
- Set memory shares to HIGH in ESXI, based on a forum comment.
- amended P and C states in ESXI based on forum advice,
- Reduced nunmber of vCPUs from 4 to 2 in ESXI VM
- Extended smart tests on both primary drives passed with no error
- Upgraded to TrueNAS-12.0-U5.1
- Upgraded to latest version of ESXI 7.0U1d
I am now though running out of ideas. The nearest I can find online to this is here - wehre there was no resolution. https://forums.freebsd.org/threads/kernel-panic-several-times-a-day.74234/
I am runnig out of ideas - before I start swapping hardware CPU/RAM/Motherboard - which seems crazy as they all test fine, and show no issues in other VMs - is there anything else I can try? Is there any more information I can gather? Please help me end these three months of hell and bring harmony to this house once again!
Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80a7af2a
stack pointer = 0x28:0xfffffe00e60aea30
frame pointer = 0x28:0xfffffe00e60aea80
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 60306 (smbd)
trap number = 12
panic: page fault
cpuid = 0
time = 1625664646
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e60ae6f0
vpanic() at vpanic+0x17b/frame 0xfffffe00e60ae740
panic() at panic+0x43/frame 0xfffffe00e60ae7a0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe00e60ae800
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e60ae850
trap() at trap+0x286/frame 0xfffffe00e60ae960
calltrap() at calltrap+0x8/frame 0xfffffe00e60ae960
--- trap 0xc, rip = 0xffffffff80a7af2a, rsp = 0xfffffe00e60aea30, rbp = 0xfffffe00e60aea80 ---
knote_fdclose() at knote_fdclose+0x13a/frame 0xfffffe00e60aea80
closefp() at closefp+0x42/frame 0xfffffe00e60aeac0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00e60aebf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e60aebf0
--- syscall (6, FreeBSD ELF64, sys_close), rip = 0x80fd11c2a, rsp = 0x7fffffffd108, rbp = 0x7fffffffd120 ---
KDB: enter: panicMany Thanks
CC