TrueNAS 12 Bhyve, Pool, or Jail causing panic on boot (most of the time)

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
There's something definitely not right with at least one disk in your pool. The panic is when your system swaps to or from the swap partitions. The way swap works is TrueNAS reserves (by default) a 2GiB swap partition on each disk in the pool, and then creates an encrypted mirror or stripe of mirrors of all the swap partitions. From the msgbuf.txt, your swap is a 2-way stripe of 3-way mirrors, so you should have 4GiB of total swap.

For each of the panics recorded in ddb.txt, there's a swap_pager_alloc() call in the trace.

How is your pool RGC_STORAGE constructed?
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
The disks are virtually brand new, probably 6 months old and only using 2TB. I have 6 disks, 6TB each, and RaidZ2. When you ask how is the RGC_STORAGE constructed, here are two screenshots that hopefully have what you are looking for. Let me know what else you are looking for...
 

Attachments

  • pool status.png
    pool status.png
    27 KB · Views: 159
  • pool pic.png
    pool pic.png
    74.7 KB · Views: 149

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
OK, it looks like your CPU doesn't have AES-NI, so the swap partitions are GELI-encrypted using software. I wonder if your HBA just isn't fast enough for this. Could you try moving all your disks to the onboard SATA ports?
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
Is it possible I turned off AES-NI in the bios? I did try turning off CPU settings to rule out bios configuration issues related to CPU. I googled AES-NI for my CPU and seems like maybe it does have it? https://ark.intel.com/content/www/us/en/ark/compare.html?productIds=/65732

I will check the bios and restore typical CPU settings, enable features I previously disabled.
I will switch to use SATA ports in the morning, that's a good idea to rule out controller issues.

I will post back tomorrow morning.
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
I loaded optimized bios defaults because I had updated some of the CPU features to disabled to see if it would solve the issue. After loading defaults, I noticed that AES-NI was disabled by default, so I enabled it for the latest test. I believe according to the link above, my chip does support this feature.

In addition, per your suggestion I removed the HBA card and connected all 6 hard drives to the motherboard directly. Upon boot of truenas, I attempted to start the SMB service and it resulted in a panic. See latest logs...
 

Attachments

  • config.txt
    4 KB · Views: 185
  • ddb.txt
    689.6 KB · Views: 170
  • msgbuf.txt
    37.5 KB · Views: 208
  • panic.txt
    10 bytes · Views: 146
  • version.txt
    50 bytes · Views: 158

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
We're making progress; the GELI encryption for the swap partitions now shows it's using AES-NI hardware, so that performance limit is no longer a factor. The crash is still due to a swap_pager_alloc() call in the trace. Since the logs don't indicate which specific drive's swap partition is corrupt, you'll have to try physically disconnecting one disk at a time between reboots, and seeing if the problem goes away.
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
Just wanted to report back that I cycled through detaching each hard drive and booting and starting the SMB, in all cases it resulted in a panic and rebooted. At least now we know its not related to AES-NI bottleneck or the HBA card. What do you think I should try next?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
First, let's map each swap partition via geom -t. Note which partition is assigned to which swap mirror.

Next, let's try blocking off a swap mirror before starting SMB. swapctl -lh should show your active swap devices. From your msgbuf.txt file, your swap devices should be /dev/mirror/swap0 and /dev/mirror/swap1. Before launching SMB, try swapoff /dev/mirror/swap0 or swapoff /dev/mirror/swap1, to see if one or the other swap device is bad.

If you get consistent crashes with one swap device, then the geom -t output will let you localize the problem to one of the 3 disks in that mirror.
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
Ok, I ran the geom-t to get the mapping and made note of it below.

ada0 -> swap1
ada1 -> swap1
ada2 -> swap1
ada3 -> swap0
ada4 -> swap0
ada5 -> swap0

swapoff /dev/mirror/swap0.efi
start SMB -> panic/reboot

swapoff /dev/mirror/swap1.efi
start SMB -> panic/reboot

swapoff -a
start SMB -> panic/reboot

Do you still think it has to do with the SWAP? What if I backed up all my data on my other NAS and we re-created pool? The only thing I would need help backing up or migrating is the VM under Bhyve to my other freenas server.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I'm starting to think you might have RAM going bad. It's extremely unlikely for all 6 disks to have a bad swap partition. Have you tried booting into a rescue Linux live thumb drive to run memtest86+?

Also, you & I may have typoed the swapoffs. It should be /dev/mirror/swap0.eli and swap1.eli.
 
Last edited:

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
I figured out the command and confirmed swap was off each time. I will run a memory test tomorrow.
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
Sorry for the delay, we are making progress!!!!!

Memory Configuration:
DIMM0A - 8GB Memory (Labeled 1)
DIMM1A - 8GB Memory (Labeled 2)
DIMM0B - 8GB Memory (Labeled 3)
DIMM1B - 8GB Memory (Labeled 4)

1) Created MemTest86 bootable USB from PassMark (opposed to linux live cd)
2) Upon booting and running memory test, it would lock up within 3 to 5 seconds of starting the test. I created the boot device several times with different USBs to make sure it wasn't an issue with that, still having it lock up during testing.
3) Removed 2 & 4 RAM sticks
4) MemTest86 Booted and ran 3+ hours of testing, no errors they passed.
5) Before testing the rest of memory, booted TRUENAS back up and started SMB it worked!!
6) I am going to try and start the VM as a second test, then will reseat RAM sticks 2 &4 and run memtest86 again.

Will post back more details later today to confirm tentative results above!!
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Excellent news!
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
Thanks so much for your help!! Before trying to start the VM, I re-seated RAM stick 2&4 and re-ran the memory test, it did not lock up this time and have almost completed 2 passes. I will report back full results tonight.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
You may want to save this for future reference:

 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
I wanted to update back, not completely out of the woods yet but very good progress. I ran memtest86 for 6 hours and completed 4 passes, there were no errors encountered in all 4 RAM modules. I got the NAS backup and running and started the VM, no issues and it seemed stable. But..... when I started the jail, I got a reboot. I deleted the jail and re-created the jail, still reboots when I start the jail. I checked the /data/crash directory and no dump file was created. NAS alerts to an unscheduled reboot in the GUI, anyways to check what the error was when jail was started.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Is there anything logged in /var/log/console.log? Anything else in /var/log? Nonetheless, this is encouraging progress.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Also, which release is your jail using?
 

dfalke

Dabbler
Joined
Mar 12, 2021
Messages
31
I am running jail "12.2-RELEASE-p4" that is what is says in the properties, its a brand new jail - nothin in it. Attached are the logs in the /var/log directory, I gave you any files modified on or around the reboot. The alert in NAS GUI says reboot occurred at ~Thu Mar 18 15:11:53 2021.
 

Attachments

  • logs.zip
    258.5 KB · Views: 136
Top