Virtualized FreeNAS apparently crashing

Octopuss · Mar 7, 2020

I am running FreeNAS as a VM and never had a problem until very recently when I repeatedly yet seemingly randomly noticed all my torrents report as having missing files. I kept restarting services and the torrent server itself only to find out the NAS server was down.
I have absolutely no clue what's wrong as all other VMs run flawlessly.
Are there any specific logs I should check, or does anyone have any idea what might be going on?
I can provide more info about the setup, just ask (not sure what else to write right now).

Octopuss · Mar 7, 2020

Uh, damnit.
I googled up location of ESXi VM logs and I am not happy at all.
2020-03-07T17:55:40.569Z| vmx| I125: [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vmx)
2020-03-07T17:55:40.569Z| vmx| I125+ PCI passthru device 0000:02:00.0 caused an IOMMU fault type 6 at address 0xc0000000. Powering off the virtual machine. If the problem persists please contact the device's vendor.

Now what.
I don't think it should be related to cooling of the card (I wrote about that here) since I can keep my finger on the heatsink. The card in question is Lenovo 03X4446 LSI SAS9217-8i 9207-8i.
But what could be happening? The server has been running perfectly fine for months. Of course I have recently updated FreeNAS, but I can't say whether it started before or after that. I'm inclined to say it was before.

What should I check? Any ideas?

Code:

2020-03-07T17:55:38.386Z| vmx| E105: PANIC: PCI passthru device 0000:02:00.0 caused an IOMMU fault type 6 at address 0xc0000000.  Powering off the virtual machine.  If the problem persists please contact the device's vendor.A core file is available in "/vmfs/volumes/5c59cbe9-a2ae909c-5a68-0025905f9eac/FreeNAS new/vmx-zdump.003"
2020-03-07T17:55:38.905Z| vmx| I125: Writing monitor file `vmmcores.gz`
2020-03-07T17:55:38.907Z| vmx| W115: Dumping core for vcpu-0
2020-03-07T17:55:38.907Z| vmx| I125: VMK Stack for vcpu 0 is at 0x451a18393000
2020-03-07T17:55:38.907Z| vmx| I125: Beginning monitor coredump
2020-03-07T17:55:38.979Z| mks| W115: Panic in progress... ungrabbing
2020-03-07T17:55:38.979Z| mks| I125: MKS: Release starting (Panic)
2020-03-07T17:55:38.979Z| mks| I125: MKS: Release finished (Panic)
2020-03-07T17:55:39.223Z| vmx| I125: End monitor coredump
2020-03-07T17:55:39.224Z| vmx| W115: Dumping core for vcpu-1
2020-03-07T17:55:39.224Z| vmx| I125: VMK Stack for vcpu 1 is at 0x451a15713000
2020-03-07T17:55:39.224Z| vmx| I125: Beginning monitor coredump
2020-03-07T17:55:39.537Z| vmx| I125: End monitor coredump
2020-03-07T17:55:40.569Z| vmx| I125: Printing loaded objects
2020-03-07T17:55:40.569Z| vmx| I125: [0xBF8B889000-0xBF8CA14A3C): /bin/vmx
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCD058000-0xBFCD05E630): /lib64/librt.so.1
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCD260000-0xBFCD261E90): /lib64/libdl.so.2
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCD464000-0xBFCD70088C): /lib64/libcrypto.so.1.0.2
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCD933000-0xBFCD99C6FC): /lib64/libssl.so.1.0.2
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCDBA7000-0xBFCDCBB37C): /lib64/libX11.so.6
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCDEC1000-0xBFCDED001C): /lib64/libXext.so.6
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCE0D2000-0xBFCE1B6341): /lib64/libstdc++.so.6
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCE3D5000-0xBFCE4D0B94): /lib64/libm.so.6
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCE6D2000-0xBFCE6E6BC4): /lib64/libgcc_s.so.1
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCE8E9000-0xBFCE900680): /lib64/libpthread.so.0
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCEB06000-0xBFCECAA8BC): /lib64/libc.so.6
2020-03-07T17:55:40.569Z| vmx| I125: [0xBF8CE35000-0xBF8CE549C0): /lib64/ld-linux-x86-64.so.2
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCEEB5000-0xBFCEECF634): /lib64/libxcb.so.1
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCF0D1000-0xBFCF0D295C): /lib64/libXau.so.6
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCF335000-0xBFCF4DA75C): /usr/lib64/vmware/plugin/objLib/upitObjBE.so
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCF788000-0xBFCF7A165C): /lib64/libz.so.1
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFCF9A3000-0xBFCFB761F4): /usr/lib64/vmware/plugin/objLib/vsanObjBE.so
2020-03-07T17:55:40.569Z| vmx| I125: [0xBFD0081000-0xBFD008C758): /lib64/libnss_files.so.2
2020-03-07T17:55:40.569Z| vmx| I125: End printing loaded objects
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace:
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[0] 000003514bc06280 rip=000000bf8bfc0d27 rbx=000000bf8bfc0820 rbp=000003514bc062a0 r12=0000000000000000 r13=0000000000000001 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[1] 000003514bc062b0 rip=000000bf8ba69420 rbx=000003514bc062d0 rbp=000003514bc067b0 r12=000000bf8ccdf8d0 r13=0000000000000001 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[2] 000003514bc067c0 rip=000000bf8bb97e89 rbx=000003514bc06800 rbp=000003514bc067d0 r12=000000bf8bb97cc0 r13=000000bf8d2e4610 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[3] 000003514bc067e0 rip=000000bf8bb0d934 rbx=000003514bc06800 rbp=000003514bc06a30 r12=000000bf8bb97cc0 r13=000000bf8d2e4610 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[4] 000003514bc06a40 rip=000000bf8ba64110 rbx=000000bfcf2d6010 rbp=000003514bc06ab0 r12=000000bfcf2d6010 r13=000000bf8d2e85f0 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[5] 000003514bc06ac0 rip=000000bf8ba6466d rbx=000003514bc06b00 rbp=000003514bc09b50 r12=000000bfcf2d6010 r13=0000000000000001 r14=0000000000000000 r15=0000000000000007
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[6] 000003514bc09b60 rip=000000bf8ba64a59 rbx=00000000000f423a rbp=000003514bc09c00 r12=0000000132afcccd r13=000000bf8d35bf20 r14=000000bf8d362160 r15=000000bfcf2d6010
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[7] 000003514bc09c10 rip=000000bf8ba6966b rbx=000000bf8ccdf900 rbp=000003514bc09c60 r12=000000bf8d1f2aa0 r13=0000000000000000 r14=000000bf8d2e4610 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[8] 000003514bc09c70 rip=000000bf8ba5e37c rbx=0000000000000003 rbp=000003514bc09cf0 r12=0000000000000000 r13=000000bf8c5580ae r14=000000bf8ca17040 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[9] 000003514bc09d00 rip=000000bfceb2797d rbx=0000000000000000 rbp=0000000000000000 r12=000000bf8ba5ed04 r13=000003514bc09dc8 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[10] 000003514bc09dc0 rip=000000bf8ba5ed2d rbx=0000000000000000 rbp=0000000000000000 r12=000000bf8ba5ed04 r13=000003514bc09dc8 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Backtrace[11] 000003514bc09dc8 rip=0000000000000000 rbx=0000000000000000 rbp=0000000000000000 r12=000000bf8ba5ed04 r13=000003514bc09dc8 r14=0000000000000000 r15=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[0] 000003514bc06280 rip=000000bf8bfc0d27 in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[1] 000003514bc062b0 rip=000000bf8ba69420 in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[2] 000003514bc067c0 rip=000000bf8bb97e89 in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[3] 000003514bc067e0 rip=000000bf8bb0d934 in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[4] 000003514bc06a40 rip=000000bf8ba64110 in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[5] 000003514bc06ac0 rip=000000bf8ba6466d in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[6] 000003514bc09b60 rip=000000bf8ba64a59 in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[7] 000003514bc09c10 rip=000000bf8ba6966b in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[8] 000003514bc09c70 rip=000000bf8ba5e37c in function main in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[9] 000003514bc09d00 rip=000000bfceb2797d in function __libc_start_main in object /lib64/libc.so.6 loaded at 000000bfceb06000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[10] 000003514bc09dc0 rip=000000bf8ba5ed2d in function (null) in object /bin/vmx loaded at 000000bf8b889000
2020-03-07T17:55:40.569Z| vmx| I125: SymBacktrace[11] 000003514bc09dc8 rip=0000000000000000
2020-03-07T17:55:40.569Z| vmx| I125: Msg_Post: Error
2020-03-07T17:55:40.569Z| vmx| I125: [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vmx)
2020-03-07T17:55:40.569Z| vmx| I125+ PCI passthru device 0000:02:00.0 caused an IOMMU fault type 6 at address 0xc0000000.  Powering off the virtual machine.  If the problem persists please contact the device's vendor.
2020-03-07T17:55:40.569Z| vmx| I125: [msg.panic.haveLog] A log file is available in "/vmfs/volumes/5c59cbe9-a2ae909c-5a68-0025905f9eac/FreeNAS new/vmware.log". 
2020-03-07T17:55:40.569Z| vmx| I125: [msg.panic.requestSupport.withoutLog] You can request support. 
2020-03-07T17:55:40.569Z| vmx| I125: [msg.panic.requestSupport.vmSupport.vmx86]
2020-03-07T17:55:40.569Z| vmx| I125+ To collect data to submit to VMware technical support, run "vm-support".
2020-03-07T17:55:40.569Z| vmx| I125: [msg.panic.response] We will respond on the basis of your support entitlement.
2020-03-07T17:55:40.569Z| vmx| I125: ----------------------------------------
2020-03-07T17:55:40.570Z| vmx| I125: Exiting

jgreco · Mar 7, 2020

Hopefully you followed the instructions as outlined in

https://www.ixsystems.com/community...ide-to-not-completely-losing-your-data.12714/

Execute your fallback plan and run FreeNAS directly on the bare metal. This will generally help isolate the fault without the complexity of having a hypervisor sitting in the way. As you noticed, hypervisors get tetchy when things aren't operating exactly right.

You haven't outlined your hardware platform so more specific advice is a bit difficult to provide.

Octopuss · Mar 7, 2020

Yea, let's throw this €1000 server I built myself out of the window, especially after it's been running flawlessly for almost a year until now. What a solution

garm · Mar 8, 2020

I think you should read his post again..

jgreco · Mar 8, 2020

garm said:
I think you should read his post again..

Perhaps he needs a little more clue.

YOUR HYPERVISOR IS SHUTTING OFF YOUR VM.

Your options are:

1) Do as you suggest and toss out your pricey server (your suggestion, incidentally, not mine, don't put words in my mouth, I get mean about it).

2) Randomly thrash around trying random remediations.

3) Follow my suggestion and simplify the scenario by ditching the hypervisor.

See, FreeNAS can be installed on a USB thumb drive, and if you follow the design principles in the guide I linked to, you can boot a bare metal version of FreeNAS very easily, basically for the cost of a thumb drive.

Now this probably won't FIX your NAS but it will probably help to highlight what the underlying problem is. Perhaps your HBA is baked, if you didn't provide sufficient airflow over the heatsink. Perhaps the heatsink became detached. Perhaps the heatsink grease needs to be replaced. Given the error I'm really thinking your HBA is baked, but it could be other things. Putting FreeNAS on the bare metal gives FreeBSD a better chance of continuing to run and bringing out a more familiar set of failure symptoms. It doesn't mean that you can't pull the USB key out once it's fixed and go back to a virtualized setup, once the underlying problem is fixed. I actually wrote that sticky EXACTLY FOR YOUR PROBLEM CASE. Once you are running on bare metal, you are much more likely to get useful responses from the userbase here in the forums.

See, you walked in here and the person who wrote all the stickies about virtualization (that'd be me!) gave you a suggestion, because I developed the remediation strategy for your situation well more than half a decade ago. I do this stuff professionally. I gave you some really good advice. There's no need to get snarky and eyeroll'y. You either follow the guidance and it will probably be much easier to resolve the underlying issue, or you can randomly try random things hoping to strike upon the problem by dumb luck. And the guidance is really simple: go back to bare metal and do the problem solving there.

My strong feeling is that there's a problem with the HBA. A lesser possibility is that something has changed in the environment, BIOS update, ESXi update, FreeNAS update, firmware update, added or removed vCPU's, etc.

Octopuss · Mar 8, 2020

jgreco said:
See, FreeNAS can be installed on a USB thumb drive, and if you follow the design principles in the guide I linked to, you can boot a bare metal version of FreeNAS very easily, basically for the cost of a thumb drive.

Oh, right, I completely forgot about this possibility.
It might not had been clear enough (to me) in your post, but I guess I undestand how you meant it - as a test, not permanently. Sorry for being snarky. I'm not a native speaker and sometimes I miss the point.

The big problem is it seems to be completely random, triggered by who knows what. It only happened like four times in total in the time frame of maybe ten days or more - I don't even know for sure. And I don't have another physical machine to try this on, unfortunately.
I got a response on another forum from someone who has a few boxes and one of them did exactly this after 11.3 update.

Octopuss · Apr 8, 2020

I've restarted the server today and it has happened three times in a row again.
But I can't test this at all, because it didn't crash once since I last posted here :(

cooldude919 · Oct 18, 2020

Octopuss said:
I've restarted the server today and it has happened three times in a row again.
But I can't test this at all, because it didn't crash once since I last posted here :(

Did you ever figure this out? Ive started running into this recently. I started to try and passthrough a GPU to a different VM in the past few weeks and started running into this. I put everything back and took it out of the machine and its still there. Looking back i went to 11.2 u8 back in May, and didnt have any problems until recently.

ive tried a lot of things since then but it still happens randomly. I was in a similar boat that this machine was built in 2017 and this didnt pop up until the last 3 weeks or so when i started messing with things, go figure!

Octopuss · Nov 12, 2020

Oh my, I don't even remember.
I think it must had been a bug in ESXi related to the HBA. It hasn't happened for a long time now, so I presume one of the later updates fixed it. I am on ESXi 6.7 btw.

samuel-emrys · Nov 13, 2020

Octopuss said:
And I don't have another physical machine to try this on, unfortunately.

I realise that you said you've fixed this, but for future reference I believe the suggestion was to install FreeNAS on a USB, and then boot directly from this USB to use it with your existing hardware. No additional physical hardware required.

cooldude919 said:
I started to try and passthrough a GPU to a different VM in the past few weeks and started running into this.

Have you followed @jgreco 's suggestion of running FreeNAS bare metal to isolate the issue?

Octopuss · Dec 9, 2022

So this started happening out of the blue after two years.
I really have no idea what do to now, any ideas? Maybe buy a different HBA (any tips?)? This time the damn VM won't stay up for even a few hours, it just crashed pretty soon.

P.S. I cannot test this baremetal.

Octopuss · Dec 27, 2022

My server is clearly creepy.
The problem went away the moment I decided to start looking for a new HBA (which I have since bought but haven't installed yet).

sybreeder · Dec 27, 2022

Octopuss said:
So this started happening out of the blue after two years.
I really have no idea what do to now, any ideas? Maybe buy a different HBA (any tips?)? This time the damn VM won't stay up for even a few hours, it just crashed pretty soon.

P.S. I cannot test this baremetal.

Best moethod to diagnose for future:
1. memtest your ram
2. reseat ram in socket
3. Test CPU stability
4. reseat CPU in socket carrefully
5. Maybe your HBA is overheating and because of that it gaves errors. Replace thermal paste to MX-4 for example and put additional fan.
You can also check if there is newer bios for motherboard.

jgreco · Dec 27, 2022

Octopuss said:
My server is clearly creepy.

So your server is about two months too late with that. Halloween (US holiday) was two months ago.

Octopuss · Dec 27, 2022

sybreeder said:
Best moethod to diagnose for future:
1. memtest your ram
2. reseat ram in socket
3. Test CPU stability
4. reseat CPU in socket carrefully
5. Maybe your HBA is overheating and because of that it gaves errors. Replace thermal paste to MX-4 for example and put additional fan.
You can also check if there is newer bios for motherboard.

This is completely ridiculous. The problem appeared out of the blue, then disappeared by itself, showed up again two years later, and disappeared completely randomly, and you're telling me to reseat the CPU and other nonsense?
This sounds like one of those replies on Microsoft forums made by copypasting Indians.

Maybe you should read more than the last post in a thread? Or not post at all.

jgreco · Dec 27, 2022

Octopuss said:
This is completely ridiculous. The problem appeared out of the blue, then disappeared by itself, showed up again two years later, and disappeared completely randomly, and you're telling me to reseat the CPU and other nonsense?
This sounds like one of those replies on Microsoft forums made by copypasting Indians.

Maybe you should read more than the last post in a thread? Or not post at all.

Participants are reminded that community members are not paid support staff and are kindly taking time out of their day to try to offer suggestions to help with your problem. Please bear this in mind when responding to someone, even if you do not feel their input to be correct.

From my perspective, as someone who has helped at least many hundreds of people with virtualization quandaries, I might very well have suggested at LEAST repasting the CPU heatsink and checking for heat issues. Thermal paste does break down over time and needs occasional replacing. Many of the HBA's in use these days are now at about the 10 year old mark, and seasonal temperature variations in the NAS's physical environment can absolutely cause a problem to appear out of the blue, later disappear, and then show up again two years later. As someone who does hardware professionally, I respect this line of thinking because I've seen it again and again over the years.

I would also add that in addition to repasting the HBA CPU and host CPU, it would be prudent to check the power supply. If it is older than five years, do consider trying a different PSU. Low end PSU's tend to bake their innards and ruin the electrolytics and other components.

Octopuss · Dec 27, 2022

If I had at least somewhat persistent and repeated problems, then by all means. But not when something works perfectly for a year, then acts up or a few days, and then starts acting up after two more years for 2 days again. That just doesn't add up and is not indicative of anything more than some super bizarre random stuff.

I could have made myself perfectly clear that I take good care about my hardware, but I kind of thought repasting a cooler every year, vacuuming the dust, cleaning fans etc. etc. was such bare minimum that everyone was doing it. I guess I was wrong?

jgreco · Dec 27, 2022

Octopuss said:
If I had at least somewhat persistent and repeated problems, then by all means. But not when something works perfectly for a year, then acts up or a few days, and then starts acting up after two more years for 2 days again. That just doesn't add up and is not indicative of anything more than some super bizarre random stuff.

Nah, it happens. And worse yet it sometimes happens on equipment that's very far away. I've got a big hypervisor 2500 miles away that has intermittently been having troubles. Very frustrating because it crashed, mysteriously, PSOD, reset, weeks pass, crash again, PSOD, at which point my eyebrows raise, reset again, runs for quite a bit longer, and then it PSOD's again. It's very difficult to diagnose faults that are excessively intermittent but there is obviously something wrong, and here in the shop, we do have a history of benching gear, sometimes for months, before redeploying it.

I've found that yanking stuff apart and reseating it all, cables, PCIe cards, DIMM's, CPU's, etc., is actually a strategy that works. You might never get to know if it is a bit of excessive vibration from a chassis fan that causes a CPU pin to intermittently fail. But I've seen it happen and it sucks.

Octopuss said:
I could have made myself perfectly clear that I take good care about my hardware, but I kind of thought repasting a cooler every year, vacuuming the dust, cleaning fans etc. etc. was such bare minimum that everyone was doing it. I guess I was wrong?

I've spent a quarter of a century in data centers, and in that time, I've seen a few people with vacuums, one fellow with a MetroVac DataVac (and only because I gave it to him), but I'm the only person I've seen stocking shop materials to clean and repaste heatsinks. I've seen canned air now and then (real bad idea by the way). I'm sometimes hard on the gamer community for their hardware choices, but one of the things that some of them get right is that they're paying attention to things like paste and temps.

Octopuss · Dec 27, 2022

Wel, in my short IT support career I was regularly dealing with shit like this, so why am I even surprised, actually...

Also I apologise @sybreeder

Important Announcement for the TrueNAS Community.

Virtualized FreeNAS apparently crashing

Patron

Patron

Resident Grinch

Patron

Wizard

Resident Grinch

Patron

Patron

Cadet

Patron

Contributor

Patron

Patron

Explorer

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Patron

Attachments

Similar threads