NVMe issue on virtualized TrueNas

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
So we get booted up without issue now, but same issue, after a short burst, it stops dead... after a short while, it jumps into life, only to stop again... the dips in the attached chart line up with when we get the errors on the console.

My backup plan at this stage is to play with TrueNas Scale on my spare server in the garage, get my head around that, and if that proves viable, look at moving to that long term... But i really want to avoid having to send these NVME drives back for SATA ones in the mean time as annoyingly I am critical on space on my iSCSI pool so need new drives either way, so any ideas you may have up your sleeve, I'm all ears :)
 

Attachments

  • ScreenGrab.JPG
    ScreenGrab.JPG
    93.9 KB · Views: 234

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Try adding this tunable as a type of LOADER with the other settings still in play:

hw.pci.enable_msi="0"

1605732365687.png

Like this?

It got further into the 11Gb test file but still stopped twice along the way.. Do i need to reboot for it to take affect?
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
ok so did the reboot, but oddly TrueNas has lost access to the network, and it reports "the web interface could not be accessed"?... If i try to change settings at the command line, it reports that no interface could be accessed. :( Will try removing and re-adding the VM Nics
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
So watching it boot, i realised not only is it booting extremely quickly, but its failing to mount every single device, including the NVME drives, and disks on the LSI cards, or attached to the NICs. Is there a way to remove that tunable, as thats the last thing we changed?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
So watching it boot, i realised not only is it booting extremely quickly, but its failing to mount every single device, including the NVME drives, and disks on the LSI cards, or attached to the NICs. Is there a way to remove that tunable, as thats the last thing we changed?
From the VM console, edit the file /boot/loader.conf.local and remove the offending line.
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
1605734978389.png

The file looks to not have anything in there? Am i looking at the right file? Also checked loader.conf and while theres more in that, the hw.pci.enable_msi="0" line isn't... Its difficult as every few seconds I get an error splashing over the page because it can't connect to the UPS. Could it be in any of these?

1605735391928.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Looks like they're in the config DB now.

List your tunables:

sqlite3 /data/freenas-v1.db 'SELECT * FROM system_tunable;'

The first column is the ID - disable the offender:

sqlite3 /data/freenas-v1.db 'UPDATE system_tunable SET tun_enabled = 0 WHERE id=X;'
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Looks like they're in the config DB now.

List your tunables:

sqlite3 /data/freenas-v1.db 'SELECT * FROM system_tunable;' | cut -d '|' -f 2,6 | sed s'/\|/ /g' | awk '{print $2" "$1}'

Disable the offender:

sqlite3 /data/freenas-v1.db 'UPDATE system_tunable SET tun_enabled = 0 WHERE id=X;'
I'm getting a "error: no such table: system_tunable". I had found that and tried that myself, would the path be different as this is TrueNas now, rather than FreeNas?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm going to need to do a clean install. Clearly systems that upgraded from FN11 are acting differently from fresh TN12 installs, because I'm on 12.0-RELEASE and this is what I get:

Code:
root@freenas2:~ # sqlite3 /data/freenas-v1.db 'SELECT * FROM system_tunable;'
1|8388608|sysctl|8MB/s L2ARC Feed Rate|1|vfs.zfs.l2arc_write_max
2|25165824|sysctl|32MB L2ARC Boost Rate (8+24)|1|vfs.zfs.l2arc_write_boost
5|1|sysctl|Enable Persistent L2ARC|1|vfs.zfs.l2arc.rebuild_enabled
root@freenas2:~ # 
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Ok looks like you updated your post after I refreshed my page. When i run sqlite3 /data/freenas-v1.db 'SELECT * FROM system_tunable;' I get no result at all.

I did take a config backup at around 1pm today before I started messing around... If I just choose the "reset defaults" option, will that leave my data intact, I can just restore that config file, and that should put me back to where i was? (Never had todo a restore before, but always practiced good due-diligence and taken backups before c0cking about :smile: )

(Feel free to check my systax. This was FreeNas upgraded in situ to TrueNas)
1605736821918.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Take a second backup of the current state (just in case) and then do the restore. Good practice making a backup.

Not sure why you don't have the config DB - I will have to do a clean install in a VM or something and see where that table has gotten to.
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Ok, and we are back up and running. NVMe's have been pulled and are going back to Amazon (SATA drives ordered instead). Clearly a limitation on the ESX end. Will start playing with TrueNAS Scale on my test rig, and in time, may look to go that way and scrap ESX, and then revisit the NVME route later down the line :)

Thanks for everyones input, but we're too deep down the rabbit hole than I'm prepared to go with ESX, its already got a few little "hacks" to make it work, and messing with it future just risks making it more and more of a liability.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This goes back to what I was hinting at up front. These are unfortunately complicated bits of software, and everything has to be lined up >< just so.
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
This goes back to what I was hinting at up front. These are unfortunately complicated bits of software, and everything has to be lined up >< just so.
Yeah I had it fairly well balanced, but every time i updated either FreeNas or ESX, I did it while holding my lucky rabbits foot, stroking a black chat, touching a 4 leaf clover, and making a blood sacrifice to the Golds of VMWare, and I'll be honest, kinda wearing a bit thin, as today proves :rolleyes: So time to go away and look at options :smile: Scale sounds promising!!
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
NVMe's have been pulled and are going back to Amazon (SATA drives ordered instead).
Are you sure PCI passthrough works for the controller they will end up to be attached to?
E.g. in my toy system the onboard SATA ports cannot be passed to a VM. Probably because it is one single controller and ESXi boots from that.
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Are you sure PCI passthrough works for the controller they will end up to be attached to?
E.g. in my toy system the onboard SATA ports cannot be passed to a VM. Probably because it is one single controller and ESXi boots from that.
It appears with NVME cards, you pass each SSDs controller through independently, ESX sees each drive as its own entry. (where as the LSI cards and SATA controller, it would pass through them on mass).

Having scrubbed through the VMWare hardware support list, I'm fairly confident this is ESX/VMWare not supporting what they would consider consumer grade hardware, as in there eyes, this is a enterprise tool.

I'm currently playing with TrueNas Scale on my test server, and so far, I'm impressed. Adding more VMs to test stability but I can see this being the way forward, but I'll wait for it to go Release first, so the NVME drives will have to wait till then.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It appears with NVME cards, you pass each SSDs controller through independently, ESX sees each drive as its own entry. (where as the LSI cards and SATA controller, it would pass through them on mass).

Correct; the IOMMU passthrough happens on the PCI device level. Since your LSI HBA is a single PCI device, passing it through drags all of the connected drives with it as a unit. NVMe SSDs are individual PCI devices and have to be (get to be?) handled individually.

Having scrubbed through the VMWare hardware support list, I'm fairly confident this is ESX/VMWare not supporting what they would consider consumer grade hardware, as in there eyes, this is a enterprise tool.

It actually seems to be an issue specific to the combination of FreeBSD NVMe driver and a virtual environment, from the looks of their bugtracker they've been playing whack-a-mole with NVMe and drives that have wonky MSI/MSI-X behavior for months now. I wager if you make a Linux VM to test, your Samsung drives will behave there. Your Optane behaves fine under virtual FreeBSD as well, lending further support to that cause.

Sorry things didn't work out in this case, but it's good to know that things are back to a steady state and you're experimenting with SCALE.
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Correct; the IOMMU passthrough happens on the PCI device level. Since your LSI HBA is a single PCI device, passing it through drags all of the connected drives with it as a unit. NVMe SSDs are individual PCI devices and have to be (get to be?) handled individually.



It actually seems to be an issue specific to the combination of FreeBSD NVMe driver and a virtual environment, from the looks of their bugtracker they've been playing whack-a-mole with NVMe and drives that have wonky MSI/MSI-X behavior for months now. I wager if you make a Linux VM to test, your Samsung drives will behave there. Your Optane behaves fine under virtual FreeBSD as well, lending further support to that cause.

Sorry things didn't work out in this case, but it's good to know that things are back to a steady state and you're experimenting with SCALE.

Yeah I passed them directly into a Windows VM with no issue whats so ever, so that lends more evidence to that argument. I'll revisit NVME storage later, hopefully when I've made the switch to TrueNas Scale. Question, could I take a config backup from TrueNas Core and import it to TrueNas Scale down the road? :wink:
 
Top