SOLVED TrueNAS VM crash after update from v12.0-U8.1 to v13.0-U5.1

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
I've been running FreeNAS, then TrueNAS, virtualized for several years without issue. However just ran into problem after updating VM from 12.0-U8.1 to 13.0-U5.1.
System is as follows:

OS:
VMware ESXi, 7.0.3, 21686933
Head unit:
Supermicro X10SRL, E5-2699V4, 256gb ram, Intel 82599 10g nic
Intel NVME P3605 1.6tb vm boot storage
2xLSI 2308 hba dual-connect passthrough to TrueNAS VM
HBA's & motherboard updated to latest f/w from Supermicro website

Disk shelf:
Supermicro SC847 45-drive ext chassis
integraded SM SAS2 backplanes

VM config:
TrueNAS 12.0-U8.1
12 cpu cores, 72GB ram, 60GB provisioned boot drive, 51gb used, VMXNET3 virtual nic
pool_01 32x HGST HDN724040AL z1
(3x 9-drive vdev +1x 5-drive vdev)
pool_02 9x HGST HUH721008AL4200 z1
pool_03 4x SAMSUNG MZ7KM960 z1

This VM/disk shelf is for bi-weekly backups of main media server. All three pools scrubbed successfully, no errors, right before TN OS upgrade. VM boots OK after upgrade, all pools status' green. As soon as workload added to any of the 3 pools, VM powers off. No other changes except TN OS upgrade. I've attached 3 debug files: One right before upgrade (v12.0-U8.1), one right after upgrade reboot (v13.0-U5.1) and one after unscheduled shutdown.

Any insight as to cause of shutdown would be appreciated. Quite spoiled until now, FN/TN just worked perfectly out-of-the-box. I'm thinking either some of my equipment is no longer supported, or updated OS now exposing some weakness in hardware.

TIA
 

Attachments

  • debug-TrueNAS-01-20230627173744 (12.8-U8.1, stable).tgz
    2 MB · Views: 87
  • debug-TrueNAS-01-20230628151757 (after upgrade to 13.0-U5.1).tgz
    1.9 MB · Views: 83
  • debug-TrueNAS-01-20230628152533 (upgrade to 13.0-U5.1 post crash).tgz
    1.8 MB · Views: 82

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Inside your debugs - all three - in ixdiagnose/crash/textdump.tar.last.gz/textdump.tar.last/msgbuf.txt I'm seeing the following right before the panic

mps0: IOC Fault 0x40007e23, Resetting

The common fault that we've seen in the community from upgrading a VM to CORE 13 is related to doorbell handshake failed - but this is often coupled with either the older SAS2008 adapter (you have SAS2308's) or assigning RAM across NUMA nodes (you have a single socket)

Can you identify the PCIe slot that the mps0 HBA is installed in? It seems to be the only one giving you trouble - it might be worth reseating it, checking to see that it's not overheating, or otherwise might have failed.
 

TheNiTz

Cadet
Joined
Dec 22, 2021
Messages
6
Good afternoon, I am seeing the same problem - windows VM ware crashes - running for months on end on 12 as soon as updated during beta it would crash after a few minutes and it does the same as of last nights update to 13.u5.1. VM shows as running and I am able to connect via VNC but no interaction is possible. right now I just restarted it and the time is stuck at 12:15 pm while it is currently 12:24 pm. Ubuntu doesn't seem affected. here is my log hope it helps

misread the op - thought he was having issues with VMs but he is actually having issue with the vm of truenas itself. will create a new thread for this
 

Attachments

  • debug-HomeNas-20230702122131.tgz
    1.4 MB · Views: 94
Last edited:

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
update:
- Swapped in new 9207-8e HBA after verifying firmware, bios & NVDATA identical to existing 2 parts. v13.0-U5.1 VM crashes under load regardless of which 2 hba are connected to disk shelf.
- all 3 hba are stable when only 1 hba connected to disk shelf, that is single-connect from front & rear backplanes to 1 hba
- considered possible conflict with 3008 hba in esxi host, moved v13.0-U5.1 VM & hba's to different host, same hardware but no other pci devices.
- also tested with and without boot bios installed, unstable in either case
- same result, any single hba connected to disk shelf is stable, but any combination of 2 hba's crashes under load
- rolled back boot environment to v12.0-U8.1, no problems under load. Running scrub on all 3 pools while reading & writing to pools, no problems
- attached diag file for current config, original 2 hba dual-connect to disk shelf, TrueNAS VM v12.0-U8.1

The only constant is that VM crashes when using 2 hba with boot environment v13.0-U5.1

Open to any other thoughts on possible solutions.
 

Attachments

  • debug-TrueNAS-01-20230706104107.tgz
    2.1 MB · Views: 77

TheNiTz

Cadet
Joined
Dec 22, 2021
Messages
6
update:
- Swapped in new 9207-8e HBA after verifying firmware, bios & NVDATA identical to existing 2 parts. v13.0-U5.1 VM crashes under load regardless of which 2 hba are connected to disk shelf.
- all 3 hba are stable when only 1 hba connected to disk shelf, that is single-connect from front & rear backplanes to 1 hba
- considered possible conflict with 3008 hba in esxi host, moved v13.0-U5.1 VM & hba's to different host, same hardware but no other pci devices.
- also tested with and without boot bios installed, unstable in either case
- same result, any single hba connected to disk shelf is stable, but any combination of 2 hba's crashes under load
- rolled back boot environment to v12.0-U8.1, no problems under load. Running scrub on all 3 pools while reading & writing to pools, no problems
- attached diag file for current config, original 2 hba dual-connect to disk shelf, TrueNAS VM v12.0-U8.1

The only constant is that VM crashes when using 2 hba with boot environment v13.0-U5.1

Open to any other thoughts on possible solutions.
what are you using for your controller achi or virtio? i switched from achi to virtio and same for ethernet from intel to virtio and it fixed issue
 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
what are you using for your controller achi or virtio? i switched from achi to virtio and same for ethernet from intel to virtio and it fixed issue
Thanks for the suggestion, but wrong OS. I'm troubleshooting a TrueNAS VM running under VMWare ESXI.
 

smcclos

Dabbler
Joined
Jan 22, 2021
Messages
43
OK, I have been spending the last 5 days setting up my virtual TrueNAS, and was having the same problems
I've been running FreeNAS, then TrueNAS, virtualized for several years without issue. However just ran into problem after updating VM from 12.0-U8.1 to 13.0-U5.1.

I have been setting up my own TrueNAS VM, for the last 5 days, and tracking my saga, Virtualized FreeNAS apparently crashing and 99% sure I licked the problem.

I started new, so never ran 12.0-U8.1 virtually, but only 13.0-U5.1
My ESXi is 6.7.0-U3 because my SAS controller, LSI 9206-16e is on that HCL, not on 7.0 or 8.0

I used the following command to stress my pool:

Code:
fio --name=test --size=100g --rw=write --ioengine=posixaio --direct=1 --bs=1m


and my virtual TrueNAS was crashing and powering down until I did one thing; I configured my virtual machine with 1 socket, and 8 cores, from 2 sockets, and 4 cores each.

The system is now rock solid. I suggest you start there.
 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
OK, I have been spending the last 5 days setting up my virtual TrueNAS, and was having the same problems


I have been setting up my own TrueNAS VM, for the last 5 days, and tracking my saga, Virtualized FreeNAS apparently crashing and 99% sure I licked the problem.

I started new, so never ran 12.0-U8.1 virtually, but only 13.0-U5.1
My ESXi is 6.7.0-U3 because my SAS controller, LSI 9206-16e is on that HCL, not on 7.0 or 8.0

I used the following command to stress my pool:

Code:
fio --name=test --size=100g --rw=write --ioengine=posixaio --direct=1 --bs=1m


and my virtual TrueNAS was crashing and powering down until I did one thing; I configured my virtual machine with 1 socket, and 8 cores, from 2 sockets, and 4 cores each.

The system is now rock solid. I suggest you start there.
Thanks, VM is configured with 12 cores, 1 socket.
 

smcclos

Dabbler
Joined
Jan 22, 2021
Messages
43

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
Reduced cores to 8 (1 socket), but same result. VM boots successfully, but crashes as soon as any of 3 pools are stressed.

VM is stable with single HBA, but crashes as soon as 2 hba's are dual-linked to front & rear backplanes of disk shelf.

Next step, I think, is provisioning fresh TrueNAS install, importing pools & seeing if that corrrects issue.

If not, probably try LSI 9305-16e hba & new ext sas cables.
 

smcclos

Dabbler
Joined
Jan 22, 2021
Messages
43

smcclos

Dabbler
Joined
Jan 22, 2021
Messages
43
Next step, I think, is provisioning fresh TrueNAS install, importing pools & seeing if that corrrects issue.
That is not a bad idea, start clean
 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
Do you have to dual-link the HBA's?

I know it was a not a concern in mine, but have you checked the temperature of the HBA adapters?
Temps are OK, fan profile set to "Heavy I/O" when TN VM active. Also, during previous testing, hba's were in 4u enclosure with extra fan over them.

Dual-link not mandatory, but reduces scrub times a lot.
 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
Clean install TrueNAS VM 13.0-U5.1. Imported 3x pools successfully. VM crashed under load. Chassis fans at 100%, no overheating.

Will try 4-port hba (9305-16e) & new cables. Update to follow after new hba/cables.
 

smcclos

Dabbler
Joined
Jan 22, 2021
Messages
43
Clean install TrueNAS VM 13.0-U5.1. Imported 3x pools successfully. VM crashed under load. Chassis fans at 100%, no overheating.

Try 13.0-U5.2 that just came out. For some reason, my virtual test TrueNAS became much more stable overnight.
 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
Received 9305-16e hba via ebay. New hba shows temp fault LED red continuous, even with extra fan. Megaraid Storage Manager reports 61C chip temp, nowhere near overheated. Considered the possibility that Dell hba had some quirks interfacing with Supermicro motherboard, so installed in two different servers, both attempts resulted in no devices discovered. Return begun, will update when functioning HBA secured.
 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
Update - HBA issure resolved, device as received had firmware or bios mismatch. Once flash erased & re-written functions as expected.

New hba & cables seem to have resolved crash issues. Uptime over 8hrs, read/write stress test + data scrub successful.

Would appreciate any experts to look over attached debug file to see if there are any underlying issues.

TIA

@HoneyBadger
 

Attachments

  • debug-TrueNAS-01-20230717081403.tgz
    2.1 MB · Views: 75
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Update - HBA issure resolved, device as received had firmware or bios mismatch. Once flash erased & re-written functions as expected.

New hba & cables seem to have resolved crash issues. Uptime over 8hrs, read/write stress test + data scrub successful.

Would appreciate any experts to look over attached debug file to see if there are any underlying issues.

TIA

@HoneyBadger
You might want to apply the 16.00.12.00 firmware mentioned in this thread, as it's intended to address a controller reset under specific circumstances with SATA drives.

 

poto

Dabbler
Joined
Jun 27, 2023
Messages
11
I did see that notice, but hesitated because the IOC for 9305-16e is part 3216 vs 3008

The f/w package lists 9300-16e, but not the 9305-16e, don't want to brick anything.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Top