TrueNAS Crashing several times

stuartbh

Dabbler
Joined
Jan 25, 2023
Messages
10
TrueNAS users, developers, et alia:

I have a use case whereby a user has a TrueNAS SCALE dedicated server (an IBM x3650 "server grade server" with 48GB of RAM, 2 Xeon 4-core processors, and 8 SAS drives) that uses NFS mounts to several smaller servers (NUCs) that run ProxMox in a cluster. If the TrueNAS server crashes the VMs can crash (they are non critical VMs) because currently, the TrueNAS server is the only file server servicing the ProxMox cluster of NUCs. I am beginning to think that perhaps there might be wisdom in replacing the TrueNAS server a CEPH environment, seeing how the TrueNAS SCALE environment well belly up some 6 or so times in 1 day and CEPH is a more distributed system that survive any one server crashing in most cases. However, for the good of the TrueNAS SCLAE community I'd like to know what steps are recommended so I can assure this was a hardware based problem vice some bugulance that might have been run into that is worthy of being remediated.

Generally speaking, I know how to troubleshoot Linux when it crashes but I know TrueNAS (CORE and SCALE) are viewed as appliances and are heavily modified, so this is why I am posting this question.

Stuart
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Welcome to the forums.

Sorry to hear you're having trouble. Please take a few moments to review the Forum Rules, conveniently linked at the top of every page in red, and pay particular attention to the section on how to formulate a useful problem report, especially including a detailed description of your hardware.

It sounds like you are using TrueNAS to serve block storage for VM's. I would note that SCALE is not particularly good for this; on a 48GB RAM system you only have 24GB of ARC, which is much smaller than the 64GB of ARC recommended for block storage. The X3650 is old enough that I would suspect that it has an incompatible storage controller in it as well, so some more hardware details are in order here. If it's something like an IBM ServeRAID-8k that's not going to be usable.



Finally, that server platform is from 2006, so it is nearing the 20 year old mark. It is quite possible that it is developing problems, which is not unusual once a server passes ten to fifteen years of age.
 

stuartbh

Dabbler
Joined
Jan 25, 2023
Messages
10
Welcome to the forums.

Sorry to hear you're having trouble. Please take a few moments to review the Forum Rules, conveniently linked at the top of every page in red, and pay particular attention to the section on how to formulate a useful problem report, especially including a detailed description of your hardware.

I will work on this later tonight.

It sounds like you are using TrueNAS to serve block storage for VM's. I would note that SCALE is not particularly good for this; on a 48GB RAM system you only have 24GB of ARC, which is much smaller than the 64GB of ARC recommended for block storage. The X3650 is old enough that I would suspect that it has an incompatible storage controller in it as well, so some more hardware details are in order here. If it's something like an IBM ServeRAID-8k that's not going to be usable.

It is a ServeRAID-8k but it has been configured with proper firmware so as to pass the drives directly through to Linux without any attempt to control or RAID the drives. Thus, ZFS receives the drives and I placed some datasets on them which were thereafter shared out via NFS. Those NFS mounts are then mounted by ProxMox in its environment where it then rights the VMs out as qcow2 files on the NFS share.




I'll read the above entitled links, but I think I already have a good idea of what they will say. Yet, I try to be open minded since I never know when I'll learn something new.

Finally, that server platform is from 2006, so it is nearing the 20 year old mark. It is quite possible that it is developing problems, which is not unusual once a server passes ten to fifteen years of age.

I did run the IBM diagnostics on the system succeeding all these challenges and it found not a single hardware problem (I even had it run RAM tests as well). That said, I know everyone loves to blame old hardware first, but, I really do wish to look at the diagnostic analysis and see what Linux/TrueNAS thinks has gone wrong. If it is hardware, then so be it!

Thanks for your insights.

Stuart
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It is a ServeRAID-8k but it has been configured with proper firmware so as to pass the drives directly through to Linux without any attempt to control or RAID the drives.

Please re-read the other resource I quoted. Unless your ServeRAID-8k is an LSI 2008 or 2308 chipset with HBA IT firmware loaded on it, it is not considered acceptable. You need to particularly read points 3), 4), 4a), 5) and 10). The resource is not a menu you can pick and choose the points you like. It may seem like it's working today, but after many years, no one has demonstrated that anything but my shortlist of acceptable disk attachment choices is stable.
 

stuartbh

Dabbler
Joined
Jan 25, 2023
Messages
10
Jgreco,

Let us not presume facts not in evidence. You think I had only read certain parts of the recommended reading you provided when I replied, this is not true, I had not yet read ANY of it when I replied (and said as much in my reply that I'd read it tonight). I since did read the entirety of both documents (not saying I fully understood and absorbed all of it), but I did that in precedence to drafting this epistle.

Fascinating, I think you seem to be presuming I am attempting to argue with you or dispute your experience or knowledge, this is simply not the case. I am interested to do what I can to maximize the capabilities and usefulness of the hardware I have. Thus, if changing the HBA's firmware gives me greater function with TrueNAS I am fully open to doing so. If changing the controller does, then, that is a worthy consideration to ponder as well. Perhaps my hardware is just not compatible with TrueNAS and I would be served to consider using OpenMediaVault with ZFS or ProxMox with Ceph. I am trying figure out what will give me a functional manner to leverage the hardware I have.

Here is the information I got from lspci -vvv on the controller (dmidecode did not show any firmware versions but I am continuing to research how I might find that info without needing to reboot and read the banners when they print during boot up sequence).

04:00.0 RAID bus controller: Adaptec AAC-RAID (Rocket) (rev 02)
Subsystem: IBM ServeRAID 8k/8k-l8
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at c9e00000 (64-bit, non-prefetchable) [size=2M]

Region 2: Memory at c7fe0000 (64-bit, prefetchable) [size=128K]
Region 4: I/O ports at 5000
Expansion ROM at c8000000 [virtual] [disabled] [size=32K]

Capabilities: <access denied>
Kernel driver in use: aacraid
Kernel modules: aacraid


By the way I am able to access the devices using smartclt and can see the type of device as well.

Thanks in advance for your time and consideration regarding the instant matter before us.

Stuart
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Perhaps my hardware is just not compatible with TrueNAS and I would be served to consider using OpenMediaVault with ZFS or ProxMox with Ceph
To be clearer on that, it's ZFS that you're going to be incompatible with, not TrueNAS specifically... it's just that TrueNAS uses ZFS exclusively, so not working with ZFS means not working with TrueNAS.

You can use OMV without ZFS, so maybe do that instead.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Let us not presume facts not in evidence. You think I had only read certain parts of the recommended reading you provided when I replied, this is not true, I had not yet read ANY of it when I replied (and said as much in my reply that I'd read it tonight). I since did read the entirety of both documents (not saying I fully understood and absorbed all of it), but I did that in precedence to drafting this epistle.

Fascinating, I think you seem to be presuming I am attempting to argue with you or dispute your experience or knowledge, this is simply not the case. I am interested to do what I can to maximize the capabilities and usefulness of the hardware I have.

I'm really not interested in arguing this with you. I've been doing support here for more than a decade, and there is a classic pattern where users show up and argue that their RAID controller is just fine; your response of "It is a ServeRAID-8k but it has been configured with proper firmware so as to pass the drives directly through to Linux without any attempt to control or RAID the drives." feels like a classic entrance to a line of argument where users cover their ears and scream LA LA LA LA LA as loud as they can. If you weren't doing that, fine.

My perspective is that I'm trying to save you problems down the road. The very worst outcome we see in these forums is when someone shows up and says their pool won't mount, and then we triage that and find they've made a fatal hardware selection that has ruined their pool and data. That's the reason that the RAID/HBA article exists.

As a small-l not-batshit-crazy libertarian, I respect your right to do whatever the hell you want to do, but I reserve the right to make sure you're making an informed decision before you go down that path. Having heard it all before hundreds of times before, it gets a bit dreary. We entered this thread with a description of a server system crashing multiple times a day, where the only thing obviously wrong was the use of the Adaptec/ServeRAID controller which is known to be problematic, so the answer I *have* is that this component is the likely issue. If yours is a PCIe card product version, it looks like you could probably just swap it for a Dell PERC H200 or IBM ServeRAID M1015 crossflashed to IT mode, which are highly compatible cards for TrueNAS, and sometimes can be found as cheap as $30 on eBay. It looks like there is also a "zero channel" add-on module version of your RAID controller. This might be a little trickier.
 

stuartbh

Dabbler
Joined
Jan 25, 2023
Messages
10
To be clearer on that, it's ZFS that you're going to be incompatible with, not TrueNAS specifically... it's just that TrueNAS uses ZFS exclusively, so not working with ZFS means not working with TrueNAS.

You can use OMV without ZFS, so maybe do that instead.

I may have been a bit inarticulate, but yes that is what I meant (compatibility with ZFS). I like TrueNAS (it was obviously my first choice as it stands installed), so if I can use it effectively that is my preference.

Stuart
 

stuartbh

Dabbler
Joined
Jan 25, 2023
Messages
10
I'm really not interested in arguing this with you. I've been doing support here for more than a decade, and there is a classic pattern where users show up and argue that their RAID controller is just fine; your response of "It is a ServeRAID-8k but it has been configured with proper firmware so as to pass the drives directly through to Linux without any attempt to control or RAID the drives." feels like a classic entrance to a line of argument where users cover their ears and scream LA LA LA LA LA as loud as they can. If you weren't doing that, fine.

I apologize if it came off that way, my intention was to express that I honestly believed that I had considered this concern and thought my controller was configured and functioning in an acceptable manner. I would prefer to be told I am wrong and how to fix something than to use the "LA LA LA" method of dealing with the problem! :)

My perspective is that I'm trying to save you problems down the road. The very worst outcome we see in these forums is when someone shows up and says their pool won't mount, and then we triage that and find they've made a fatal hardware selection that has ruined their pool and data. That's the reason that the RAID/HBA article exists.

I appreciate that and would prefer that outcome myself!

As a small-l not-batshit-crazy libertarian, I respect your right to do whatever the hell you want to do, but I reserve the right to make sure you're making an informed decision before you go down that path. Having heard it all before hundreds of times before, it gets a bit dreary. We entered this thread with a description of a server system crashing multiple times a day, where the only thing obviously wrong was the use of the Adaptec/ServeRAID controller which is known to be problematic, so the answer I *have* is that this component is the likely issue. If yours is a PCIe card product version, it looks like you could probably just swap it for a Dell PERC H200 or IBM ServeRAID M1015 crossflashed to IT mode, which are highly compatible cards for TrueNAS, and sometimes can be found as cheap as $30 on eBay. It looks like there is also a "zero channel" add-on module version of your RAID controller. This might be a little trickier.

I am going to open the x3650 7979 system up later and make a physical inspection of what is in there HBA wise, its been a while since I did that. I also have an x3650-M3 that I am getting ready to update the firmware in so I will look inside of it as well. I see the M1015 controllers on eBay for under $30, so I might just grab one this week after I make my physical inspection of what I have. I have a few x3650 7979 systems laying around, so I am going to see what HBAs I have here as perhaps one is an M1015. Either way, once I do have an M1015 in hand is there a quick "How to" you can recommend my following in pursuance of upgrading the firmware to support IT mode so that it will work with maximum compatibility with TrueNAS (ZFS)? The M3 has 16 2.5" slots in it, so eventually I plan to use that system for TrueNAS.

What are your thoughts on using TrueNAS running as a virtual machine? I had originally installed TrueNAS bare metal but then realized its support for running KVM and its virtualization GUI were still being developed and improved with SCALE, so I am more willing to run TrueNAS as a VM with hardware pass through now.

Thank you for all your time and expertise.

Stuart
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Either way, once I do have an M1015 in hand is there a quick "How to" you can recommend my following in pursuance of upgrading the firmware to support IT mode so that it will work with maximum compatibility with TrueNAS (ZFS)?

Second post has the link to the official "how-to"... can't call it "quick" though, due to the size of that post.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I respect your right to do whatever the hell you want to do, but I reserve the right to make sure you're making an informed decision before you go down that path.
1684161902102.png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I apologize if it came off that way,

No worries, no need to apologize.

What are your thoughts on using TrueNAS running as a virtual machine?

It can be done, carefully, but I have significant doubts about a platform as old as yours. An eyeballing around the 'net suggests that your box is either Westmere or Nehalem. While these CPU's theoretically support Intel® Virtualization Technology for Directed I/O (VT-d), this is really a "first generation" CPU for that feature and the general experience has been that ESXi and Proxmox cannot reliably use VT-d on these older platforms. Supermicro support of VT-d on Sandy and Ivy Bridge is pretty solid, and most boards seem to work once you get to Haswell. I suspect that this has to do more with the firmware on the boards that set up the chipset than some actual inability to support VT-d; VT-d was a rather esoteric function when it first came out, but now it is commonly used even on non-server boards.

 

stuartbh

Dabbler
Joined
Jan 25, 2023
Messages
10
Jgreco, et alia:

Alright, it has been quite a while since I have had time to put on this project but here is where I am at with it thus far:

I have an x3650 M3 with an M1015 in the server and it now has all ECC RAM as well (about 172GB worth). I am going to cross-flash the M1015 but I need to do some research on how I can backup the firmware and such that is in the M1015 currently (I plan to do some reading on the Broadcom website). I do not care about configuration data as the controller has no configuration setup on it at all now that I care about.

My eventual plan is to run ProxMox with TrueNAS running in a VM with the M1015 passed through to TrueNAS. That said, I realize that doings so on this older server may be less than optimal regarding VT-D support as you pointed out, though I suspect if anyone got it right early on, IBM did.

I have an x3650 (7979 or "M1") has only a motherboard based SAS controller within it and as such it does not support drives over 2TB (and I recently came into a 3TB SAS drive). I was going to make the 7979 into a test server (for erasing drives and running smart long tests on it and such) and ordered an M5015 for it (in retrospect maybe I should have gotten an M1015, but I was not thinking about running ZFS on it at the time). I did read the follow article concerning the M5110 being cross-flashed and I am wondering if this article might be applicable to an M5015?

M5015 = SAS2108 = LSI9260-8i
M5110 = SAS2208 = LSI9265-8i

Thanks in advance to everyone.

Stuart
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I always feel vicariously bad when I see people trying to argue with the grinch...
ProxMox with TrueNAS
proxmox + truenas is generally considered inferior, far less tested, and prone to seemingly random problems.

this hardware is OLD. like "you're laptop is a month old? well thats great, if you could a nice, heavy, paperweight" old.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
proxmox + truenas is generally considered inferior, far less tested, and prone to seemingly random problems.
I agree totally, but I do wonder if a lot of the "seemingly random problems" are really just related to the frighteningly high number of YouTube videos instructing people to do disk passthrough using the by-id path in the mistaken assumption that this somehow avoids the problem of having an unsupported controller in between ZFS and the disk.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I agree totally, but I do wonder if a lot of the "seemingly random problems" are really just related to the frighteningly high number of YouTube videos instructing people to do disk passthrough using the by-id path in the mistaken assumption that this somehow avoids the problem of having an unsupported controller in between ZFS and the disk.

The other thing I've seen is that people seem to show up with ancient (like Nehalem or Westmere ancient) systems and then their PCIe passthru doesn't quite work quite right, leading them to try other bad ways (including what you mention) to "make it go". ESXi has the charming benefit that it is (at least on recent versions) much more likely to be very bitchy about it if you try to boot it on ancient hardware. Proxmox seems to just "do it anyways" which may not be the thing you really want to happen. I've got some late 2007 era Opteron 2346's that Proxmox seemed to work fine on (but does not support PCIe passthru as far as I know) and of course it's always nice to recycle free hardware, but if a system just isn't up to it, "do it anyways by some bassackward method" is not a healthy policy if you care about your data.

The reason I wrote the original virtualization guide was because there were a lot of people trying janky ESXi tricks at the time and then having it go wrong. Much as I despise VMware, it's a decade ahead of Proxmox in terms of stability, reliability, and compatibility, and while I am quietly cheering on Proxmox, I don't see it taking over the big on-site farms anytime soon. You just don't tow a fifth wheel RV with a Ford Maverick. It might look like a truck and maybe can be used for some light duty truck-like work, but it just doesn't compare to an F-350 Super Duty in what it can do.
 
Top