TrueNAS 12.x constantly crashing

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Hello Everyone,

This is my first post. I am new to TrueNAS. I recently installed it when migrated over to a new server. My knowledge is limited of both FreeBSD along with TrueNAS. I simply used online guides to do installations.

The server specs are as follows:

ESXi 6.7U3
TrueNAS 12.1

Supermicro X8DT6
2 x Intel(R) Xeon(R) CPU L5640
48 GB non-ecc DDR3 (32GB assigned to TrueNAS)
OCZ 60GB SSD (50GB provisioned for TrueNAS)
LSI 9207e (latest v20 IT firmware)
I have dual port and a quad port intel-based NICs (I don't know the model numbers)
I also have a Radeon 5400-based card

I have two ZFS pools:
6 x HGST 4TB enterprise drives (HMS5C4040BL) in raid2z
2 x WD Green 1.5T drives mirrored

The system constantly reboots. The rebooting happens anywhere from 1 to 5 hour intervals. While I rather not have the system constantly reboot, the main issue is that every reboot messes up the iSCSI connection between the TrueNAS server and a Win2k16 domain controller (using iSCSI for folder redirection). The connections are not automatically re-established, requiring me to re-create the iSCSI shares on the TrueNAS server along with delete and create the target connection on the Win2k16 server.

In an attempt to resolve the solution, I upgraded to the latest firmware for the LSI HBA, reattached the iSCSI cable, and I even ordered some ECC DDR3 ram because I read that the non-ECC stuff can be problematic.

Any and all help will be greatly appreciated.

I've enclosed the output from dmesg.
 

Attachments

  • sc_sh-1.png
    sc_sh-1.png
    79.5 KB · Views: 283
  • sc_sh-2.png
    sc_sh-2.png
    186.3 KB · Views: 274
  • sc_sh-3.png
    sc_sh-3.png
    168.9 KB · Views: 267
  • sc_sh-4.png
    sc_sh-4.png
    59.6 KB · Views: 256

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Oh, one more addition that I forgot to mention, which I'd put in the original message if I were able, both pools have an Intel 120GB SSD (SSDSC2CT06) as the cache device.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You're running TrueNAS under ESXi 6.7 on a Supermicro X8 system?

That's not expected to work. People have always had problems with PCI passthru on the X8 systems, and they often manifest similarly to what you describe, random freakouts several times a day.

Ditch the ESXi and run TrueNAS on the bare metal and it should be fine.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Thank you for your reply.

I have to say that this is the answer I was really hoping that I wouldn't get. My idea was to consolodate everything onto a single server to reduce the power footprint (along with the space footprint). Because of budgetary reasons, I'm quite limited with my options. I do have an Intel s5520sc that I was planning on selling, but I can swap out with the Supermicro board. Would this be a solution, or will I experience the same issues as with this Supermicro board?
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Would it make any difference if I downgraded to TrueNAS 11? I read a couple of forum posts that stated moving from v11 to v12 resulted in a certain type of error that I'm seeing.

If that won't make a difference, what about swapping out of ESXi for something else like Proxmox (which I was looking do do anyways)? Or, is this an issue with the platform I have any virtualziation and pass-through in general?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
what about swapping out of ESXi for something else like Proxmox
Proxmox is more of an unknown... you may find it is better, but you may not. There is far more experience out there to tell you how it works with ESXi (but in this case it's telling you the mix you have isn't good).

What are you getting from ESXi that you wouldn't be able to replace by using TrueNAS on bare-metal?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It's hard to know for certain.

Virtualization was so dodgy and risky back in the early parts of last decade, when people were coming in with all sorts of used Nehalem and Westmere gear (or older), that I ultimately had to write a post:

https://www.truenas.com/community/t...nas-in-production-as-a-virtual-machine.12484/

Because people would be doing a variety of stuff to try to accomplish what you're trying. Sometimes they just jumped right to the really bad stuff like RDM. In that era, also, basically no consumer/desktop/gaming/workstation boards implemented PCI passthru correctly, and also the experience was that there was a limited selection of server boards where it might work fine.

In particular, Supermicro started getting their shit together with the X9SCL/X9SCM and X9DRL boards, where, with the latest BIOS, properly crossflashed cards, and specific versions of ESXi 4.1, it worked pretty reliably. But even today with ESXi 6.7, I very occasionally (like as in I saw one example earlier this month) there is still a very low incidence of the doorbell handshake bug, which requires a restart of the hypervisor to clear, and this is on a hypervisor that has been supporting PCIe passthru for a decade now.

I would say that things got better across the board with the Haswell and newer generation (Supermicro X10+), where not only Supermicro but also Dell and HP support seemed to have a much greater chance at success. Not 100%. But it wasn't like the massive failure rates of Nehalem/Westmere systems, or the highly constrained compatibility of Sandy/Ivy. I also don't really know what the actual problem is, though my suspicion is that the hardware itself is actually capable of it, and that the BIOS simply isn't properly configuring it, because we saw a bunch of examples of similar behaviour with Supermicro X9 (Sandy/Ivy) systems not working, which was later remediated via a BIOS update. I think I've probably spent more hours and words on these forums supporting virtualization than anybody, so while I hate to give that sorta-non-technically-complete answer, it is worth noting that I've been involved in most "failed virtualization" conversations and that I've tried to correlate what I can.

Would it make any difference if I downgraded to TrueNAS 11?

Doubtful. Not only because there isn't a TrueNAS 11 (heh), but because this has been a problem going all the way back to the very first iX FreeNAS 8.0 versions. This doesn't seem to be limited to FreeBSD, as people have experienced similar problems with PCIe passthru on Linux on the same platforms. In general, FreeBSD has integrated patches that *FIX* edge cases with PCIe passthru, increasing compatibility as new versions come out, and serious hardware hackers have identified workarounds for specific issues. I mean, don't let me discourage you, but.

I do have an Intel s5520sc that I was planning on selling, but I can swap out with the Supermicro board. Would this be a solution, or will I experience the same issues as with this Supermicro board?

That's a Nehalem board? In general, I don't recall any examples of this working, and I had a strong sense that Intel wasn't serious about supporting this. I believe VT-d was first generation on the Nehalem boards, which might explain some of the issues.

swapping out of ESXi for something else like Proxmox (which I was looking do do anyways)?

Well, ah, they haven't even gotten around to officially supporting PCIe passthru yet. According to their own support, it is listed as

Note: PCI passthrough is an experimental feature in Proxmox VE!

There's no reason to expect that a new entry to the hypervisor races that hasn't yet even formally supported PCIe passthrough is going to fix problems that the Cadillac of hypervisors, ESXi, which has officially supported it for a decade, and has a massive engineering team that works on these things, hasn't been able to resolve in a decade. Unlikely. Possible. But unlikely.

Overall, maybe two years ago, Proxmox was basically unusable for anything but trite FreeBSD VM's. They've actually come quite a long way, and as I have some axes to grind with VMware, I would like nothing more than to see a competitive hypervisor. But I'm thinking I'd rather that be TrueNAS itself, even if it means I'll need to wait another year for it to be fully rounded out. The advantages of using TrueNAS itself as a hypervisor would be that you wouldn't have to deal with any of the VT-d or resourcing issues; you just run your other VM's on top of TrueNAS and call it a day. But I think that's still a ways off.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
While I have never used VMs under TrueNAS (or FreeNAS in my case), several people have done so. Depending on the requirements, you could try to run TrueNAS on bare metal and your other systems as VMs. And then there is of course TrueNAS Scale on the horizon, as @jgreco mentioned ...
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
What are you getting from ESXi that you wouldn't be able to replace by using TrueNAS on bare-metal?

Consolidation. I wanted to keep everything on a single server.

It's hard to know for certain.
Thank you for a brilliant post. I learned quite a bit from reading of your experiences and knowledge.

Depending on the requirements, you could try to run TrueNAS on bare metal and your other systems as VMs.
This how I ran things before the attempted consolidation. I really hate to bring it back to this, but maybe I can get by with a very low voltage Xeon (E3-1220Lv2) on a Q87, which I happen to have.
 
Last edited:

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Here's the thing, while crashes are not something I want, if there was somehow a way to resolve the Win2k16 initiator-TrueNAS target issue that I mentioned in the original post, it would be somewhat bearable.

Given that I'm using folder redirection, whenever I lose the connection, no one can access their files, and it takes a really really long time to get things sorted out. For some reason, even after I delete the settings and relationships, the reconnection doesn't happen right after I re-create the iSCSI settings. The connection fails for a long time before it finally succeeds.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
This how I ran things before the attempted consolodation. I really hate to bring it back to this, but maybe I can get by with a very low voltage Xeon (E3-1220Lv2) on a Q87, which I happen to have.
Can you elaborate which problems you had? Just out of curiosity ...
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Can you elaborate which problems you had? Just out of curiosity ...
Absolutely, I was running my NAS off of an AMD FM1 CPU (I forget which one at the moment) and every time I'd copy over files (moving files over to new locations with folder redirection, for example), all 4 cores would peak, which would result in the system locking up. I assumed the system was very underpowered.

I started piecing my current system, so that I could transition over to a single bare metal hypervisor to consolidate everything into a single piece of hardware (sans the storage array).

As a update, I upgraded the bios to the latest version and re-seated the RAM. The system ram for a great man hours without crashing. Alas, this wasn't meant to be because the VM crashed again a couple of hours back, and when it came up, I lost the iSCSI connection.

Again, if someone can help me with sorting out this iSCSI issue, I'd really appreciate it. I don't know why iSCSI just stops working after the VM crashes.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Hm, but isn't the issue of not-enough-power solved with your new system?
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Hm, but isn't the issue of not-enough-power solved with your new system?
Exactly. This was the idea. Two L5640s were going to do the trick.

On the one hand, since the BIOS update and the RAM configuration, It seems that I've only had one crash, but now the issue with the Win2k16 (and the Win2k12r2) not being able to connect the the TrueNAS iSCSI target (initiator failed to connect to the target) is front and center.

Should I create a new threat discussing this issue? I only found a post back in 2012 that discussed a similar issue, but there was no resolution given.
 
Last edited:

olddog9

Dabbler
Joined
Apr 23, 2021
Messages
28
exwhywhyzee,

Lets check the basic hardware. The hardware may be healthy, however if NOT error free, it really doesn't matter much what is done to software configuration.

If I correctly understand, the X8DT6 dates back to 2011. Is the power supply also 10 years old ?
Old power supplies with age degraded components are notorious for intermittently crashing systems, typically during intense CPU computation or high disk I/O. (which is a symptom you posted)

Although the consolidation is convenient for operation, more hardware (LSI 9207 and Radeon 5400, etc) add to the electrical load on a power supply. What is normal or maximum amperage and / or watts ? Add the required amperage of all the installed hardware to confirm it is lower than the maximum available from the power supply. Preferably no more than 60 to 70% of the maximum amperage available.

If power supply is adequate, then next, run intense memory tests. And yes, USE the ECC memory.

Next, is the onboard LSI 2008 controller activated in the BIOS AND have disks attached ? The reason I ask is the screen shots show SCSI targets numbered up in the teens (13, 14, 15, ... 20). I presume these are hanging off the LSI 9207 (e ... or is that an ... 8e ?). Could there be any configuration conflict between the onboad and the 9207?

Also, if at all possible, use the onboard VGA port, and remove the Radeon 5400. That saves a bit of electrical power and maybe a small risk of hardware IRQ conflict.

Checking these items will eliminate a lot of possible troubles.
 

olddog9

Dabbler
Joined
Apr 23, 2021
Messages
28
jgreco,

I used the Release Date of April 4, 2011 from version 1.1a of the X8DT6 SuperMicro User manual.
A search of the SuperMicro site did not reveal an earlier version, but I trust you are correct about an earlier 1.x version. A motherboard (with leaky capacitors) or power supply build in 2008 or 2009 is older and would concern me even more.

 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Lets check the basic hardware. The hardware may be healthy, however if NOT error free, it really doesn't matter much what is done to software configuration.

If I correctly understand, the X8DT6 dates back to 2011. Is the power supply also 10 years old ?
Old power supplies with age degraded components are notorious for intermittently crashing systems, typically during intense CPU computation or high disk I/O. (which is a symptom you posted)
Good points. While the tech is old, the board was "new" or rather unused. The PS is 900W and new. I purchased a new (or again, unused) Supermicro EATX case which included the gold-rated 900W PS.

Although the consolidation is convenient for operation, more hardware (LSI 9207 and Radeon 5400, etc) add to the electrical load on a power supply. What is normal or maximum amperage and / or watts ? Add the required amperage of all the installed hardware to confirm it is lower than the maximum available from the power supply. Preferably no more than 60 to 70% of the maximum amperage available.
I definitely want to pull the Radeon card out. The MB has an embedded Matrox GPU, which would be best to use instead of the Radeon.

Next, is the onboard LSI 2008 controller activated in the BIOS AND have disks attached ? The reason I ask is the screen shots show SCSI targets numbered up in the teens (13, 14, 15, ... 20). I presume these are hanging off the LSI 9207 (e ... or is that an ... 8e ?). Could there be any configuration conflict between the onboad and the 9207?
The MB doesn't have an internal SAS controller (it's the model without this feature). The 9207-8e is an added device.

If power supply is adequate, then next, run intense memory tests. And yes, USE the ECC memory.
I plugged in these modules, only to get a series of beeps. Unfortunately, these 32GB modules are incompatible with this MB. I'll need to go back to the drawing board to find some that will work.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Just a point to raise about the loss of the iSCSI target capability on every crash.

Even though I set the iSCSI feature to "start automatically," something doesn't seem to work. After every crash, I need to manually kill the ctld process, and re-run it. When I do this, the TrueNAS target connects to the Windows Server initiators.

Does anyone have any idea what's going on?
 

olddog9

Dabbler
Joined
Apr 23, 2021
Messages
28
exwhywhyzee,

A new 900 watt power supply should be 'good to go'. Although the motherboard is unused, sometimes the electrolytic capacitors still degrade beyond operating margins (They might look like little fat cans ...near the processor socket. Look for bulging sides or leaks of a light colored caulk like goo ). Years back, there was a bad batch in a lot of motherboards. I remember buying new replacement capacitors from Mouser Electronics for $10 per motherboard. But of course, ...that may NOT be a problem with your board.

Lets turn to the SCSI disks ...

A Broadcom SAS Gen 2.5 compatibility chart shows ... your Hitachi drives appear compatible with that HBA.
You might check the drive firmware for version A5D0.

Manufacturer Type Protocol Link Speed Model FW Version Capacity Size RPM SectorSize Self-Encrypting
HGST HDD SATA 6Gb/s MegaScale DC 4000.B, HMS5C4040BLE640 A5D0 4TB 3.5" 7200 512 No


That is all I can offer today. Moving on to other daily tasks.
Good luck !
 
Top