TrueNAS Scale (and CORE before it) keeps restarting on random intervals

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
Hello,

I'm running TrueNAS Scale now (latest version), been running CORE before it, same issues applied, hence why I tried to "upgrade" it (no offense, it's what they are calling it via GUI in the automated process) to SCALE.

I' running it on proxmox 8.
  • Motherboard make and model: Gigabyte B760M Gaming X DDR4
  • CPU make and model: Intel i5-13600K
  • RAM quantity: 64 GB (non-ECC), Corsair Vengeance LPX 64GB (2x32GB) DDR4 3200MHz C16
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives: 3x Seagate Exos X20 20 TB, running in RAIDZ1
  • Hard disk controllers: MZHOU PCI-E SATA Expansion Card 10 SATA ports, PCIe passthrough to TrueNAS VM
  • Network cards: Onboard
Now for the problems. The first round of problems I encountered was when I first populated the pool (or better said, when I first checked reports after populating the pool. Too noob at first, trusting it from the get-go.) with 20 TB of data (copied from my various external drives). I had A LOT of checksum errors and 5 files with permanent erros. No biggie, fixed all those files (replaced them) and SCRUBing the drives fixed the other errors. But this time I setup e-mail notifications and began looking at things (still TrueNAS CORE at this point in time).
  • The VM would automatically reboot approx. once per day. Proxmox still lists its original uptime, but TrueNAS itself is now saying it boot up at a difference point in time. Sometime this interval is longer, sometime shorter.
  • When I upload ~100 GB of new data, checksum errors are back, in dozens, with possibly one permanent corruption.
  • TrueNAS thinks memory is ECC...no idea why or how to fix it.
I replaced the cables. All SMART tests are PASSED (long and short).

But for now, lets focus on the reboot (unless you wan to tackle everything at once?). Proxmox doesn't catch anything. Where should I go to look next?

Everything was new hardware. I tried 4 passes with memtest when assembled, no errors.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Hard disk controllers: MZHOU PCI-E SATA Expansion Card 10 SATA ports, PCIe passthrough to TrueNAS VM
Oh god, that's terrible... Cheap SATA controller plus a SATA port multiplier... Not a recipe for success in any way and it may be part of the problem if the boot device is attached to that thing.
When I upload ~100 GB of new data, checksum errors are back, in dozens, with possibly one permanent corruption.
Yeah, I suspect the SATA controller may be particularly nasty...
TrueNAS thinks memory is ECC...no idea why or how to fix it.
That's cosmetic, but something the hypervisor is responsible for. TrueNAS can only report what it sees...
I've just come across this post: proxmox with TrueNAS, where it states possible problems with Proxmox. I'm guessing my first recommendation would be to go barebones?
Definitely. @jgreco's Resources also show other less extreme options.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Definitely. @jgreco's Resources also show other less extreme options.

Wow, that's lazy! Heh.


ESXi is still the king of hypervisors, though Proxmox has been okay for many folks. If you have recent vintage virtualization-capable server hardware (say post-2015 or so) it's likely to be just fine, but still test and burn-in your solution. Older hardware (pre-X9 for ESXi) has problems with PCIe passthru sometimes, and X11 seems to be the boundary line for Proxmox PCIe passthru. This doesn't mean all configurations fail, it's just where I've noticed the problems/success ratio seems to switch over from likely-to-fail and becomes likely-to-work. Non-server boards are more of a crapshoot as many do not have properly functioning BIOS support for PCIe passthru.



etc
 

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
Oh god, that's terrible... Cheap SATA controller plus a SATA port multiplier... Not a recipe for success in any way and it may be part of the problem if the boot device is attached to that thing.
So I was correct in assumint non-ECC memory (I should also mention that I haven't turned on the XMP profile to run RAM at full speed, due to idle power consumption and for better stability - once everything will work as it should I'll begin playing around with those settings) shouldn't be making so many errors if everything is working properly. Thanks.

Proxmox boot device is attached to onboard SATA controller as an older Samsung EVO 840 SSD, 500 GB in size. TrueNAS boot device is a virtual disk on this drive, so I guess this shouldn't be the reason for the reboots? Unless the SATA expansion card can mess it this much...
Wow, that's lazy! Heh.


ESXi is still the king of hypervisors, though Proxmox has been okay for many folks. If you have recent vintage virtualization-capable server hardware (say post-2015 or so) it's likely to be just fine, but still test and burn-in your solution. Older hardware (pre-X9 for ESXi) has problems with PCIe passthru sometimes, and X11 seems to be the boundary line for Proxmox PCIe passthru. This doesn't mean all configurations fail, it's just where I've noticed the problems/success ratio seems to switch over from likely-to-fail and becomes likely-to-work. Non-server boards are more of a crapshoot as many do not have properly functioning BIOS support for PCIe passthru.



etc
Due to idle power consumption I decided to try running consumer hardware. I chose the middle ground processor with support for ECC if it becomes more necessary down the road or use it as my desktop/workstation if this won't play out okay. So yes, it is 13th gen which should tick the boxes for hardware new enough to support such stuff.

Another thing to mention is, I've setup affinity to only ever use the same type of cores, having read that TrueNAS uses an older kernel which didn't yet have the "proper" implementation of E-cores and P-cores.

I'm guessing for barebones installation I should disable E-cores to make sure everything works as it should first, right? Should I also remove the SATA expansion card to remove it from the suspects (at least until I make sure everything works as it should when uploading files)? My motherboard does have 4 SATA onboard ports so it should be enough.

---

Another problem I'm having with TrueNAS Scale (not present in CORE) is UPS service. I keep disabling it (and yes, checkmark to run at startup is also removed) and after each reboot, the service starts up on its own again (though at the moment, this is what gives me a heads-up that it were rebooted, unlike CORE which actually let me know it were rebooted. How can I enable this notification and permanently disable UPS service?
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
I've just come across this post: proxmox with TrueNAS, where it states possible problems with Proxmox. I'm guessing my first recommendation would be to go barebones?
First recommendation, which cannot be skipped, is to ditch the SATA card and get a LSI 9200/9300 SAS HBA. (Or attach drives to the motherboard, and pass the PCH controller, if it's not used for something else.)
If that does not solve the issue, try bare metal.
If that still does not solve the problem, well, you may need proper server hardware, including a proper server NIC rather than Realtek 2.5G.

But in any case 3*20TB in raidz1 is not a recommended layout: Too much data at risk in the event of a drive failure.
 

LarsR

Guru
Joined
Oct 23, 2020
Messages
719
Next question would be how well does the current bsd kernel support 13th gen intel as a virual machine or bare metal...
 

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
First recommendation, which cannot be skipped, is to ditch the SATA card and get a LSI 9200/9300 SAS HBA. (Or attach drives to the motherboard, and pass the PCH controller, if it's not used for something else.)
If that does not solve the issue, try bare metal.
If that still does not solve the problem, well, you may need proper server hardware, including a proper server NIC rather than Realtek 2.5G.

But in any case 3*20TB in raidz1 is not a recommended layout: Too much data at risk in the event of a drive failure.
The first 20TB was moved via VMs on this same host. You think NIC could still make corruption on the transfered data? I'm using internal SATA controler at the moment for boot SATA SSD. I guess I could move it to NVMe SSD and boot from there, then move disks to onboard ports.

Used to have 2x Samsung 2TB SSD as cache, hence why I went for expansion SATA card, total of 5 disks (with potentially 2-3 additional HDDs following soon, replacing it to RAIDZ2 then). Had no idea it could potentially screw up so many things.

For HBA, would these from China be OK or should I be more coutious?
Next question would be how well does the current bsd kernel support 13th gen intel as a virual machine or bare metal...
Well, TrueNAS probably doesn't, Proxmox should have no problems, based on my research. Hence why I setup affinity for TrueNAS SCALE to always use the same cores, eliminating the problem of different cores which could potentially cause problems for it. FOr baremetal, I would have to disable E-cores to make sure the proof-of-concept solves the problem, I guess, and then experiment from there.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
For HBA, would these from China be OK or should I be more coutious?
I'm not skilled enough to spot fake cards (and this particular listing has a mix of references), but yes this is the general direction you should take if you need a SAS HBA. How many ports/drives do you actually need (ZFS pool and other VMs)?
 

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
I'll probably be going for 12 MAX at any point in time, 5-6 for the near future.
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
The first 20TB was moved via VMs on this same host. You think NIC could still make corruption on the transfered data? I'm using internal SATA controler at the moment for boot SATA SSD. I guess I could move it to NVMe SSD and boot from there, then move disks to onboard ports.

Used to have 2x Samsung 2TB SSD as cache, hence why I went for expansion SATA card, total of 5 disks (with potentially 2-3 additional HDDs following soon, replacing it to RAIDZ2 then). Had no idea it could potentially screw up so many things.

For HBA, would these from China be OK or should I be more coutious?

Well, TrueNAS probably doesn't, Proxmox should have no problems, based on my research. Hence why I setup affinity for TrueNAS SCALE to always use the same cores, eliminating the problem of different cores which could potentially cause problems for it. FOr baremetal, I would have to disable E-cores to make sure the proof-of-concept solves the problem, I guess, and then experiment from there.
Both Proxmox and Scale's scheduler can deal with the new Intel architecture. No reason to deactivate the e cores.
 

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
Both Proxmox and Scale's scheduler can deal with the new Intel architecture. No reason to deactivate the e cores.
Thank you, didn't know that.

Removing SATA expansion card, moving boot disk from SATA SSD to NVMe, and now connecting all HDDs from SATA expansion card to onboard SATA controller didn't resolve auto-restarting of the VM. Neither did increasing the memory allocation from 16GB to 32GB for the VM (reported 26.3 GiB free of those 31,3 GiB available). Maybe I could try to reinstall everything instead of having it "upgraded" from CORE version, but I doubt this would help. Barebones install is now probably my only choice.

Reboot happened at: 21:38

Nothing useful visible in:
1699307033093.png
Neither in:
1699307874959.png
Probably useless as well, but:
1699307969253.png
And this is pretty much it what I looked at, besides:
1699308167626.png

And:
1699308204340.png

Scrub is still in progress to begin everything with a clean slate before trying to copy new data onto it.
 
Last edited:

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
@Etorix Is the LSI SAS 9300-16I safe to use (since it has more ports, totaling 16 ports when using SFF-8643 SATA Cable)? Or should I go for 9300-8I (8 ports when using SFF-8643 SATA Cable)? Considering data transfer speed isn't important to me.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Any 9300 HBA with the recommended P16 firmware if fine to use, and provide the best possible speed including for SSDs. For HDDs, even a 9200 HBA would do.
In any case, make sure it's well ventilated: These SAS HBA run hot.
Which brings the next point: They run hot because they use some wattage, especially the 9300-16i, which is two 3008 controllers (= two 9300-8i) on the same card. As described above, you have few drives, so consider whether you need a -16i.
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
Any 9300 HBA with the recommended P16 firmware if fine to use, and provide the best possible speed including for SSDs. For HDDs, even a 9200 HBA would do.
In any case, make sure it's well ventilated: These SAS HBA run hot.
Which brings the next point: They run hot because they use some wattage, especially the 9300-16i, which is two 3008 controllers (= two 9300-8i) on the same card. As described above, you have few drives, so consider whether you need a -16i.
Same "problem" as most server gear.
They are made for high air flow which most non server cases don't have.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Same "problem" as most server gear.
They are made for high air flow which most non server cases don't have.

That's why we recommend using server gear for server purposes, just like you would use gamer gear for gaming purposes or desktop gear for desktop purposes. You can try to use a desktop for gaming but it is likely to be suboptimal, just as you can use a gaming board for server use but it is super likely to be suboptimal. Server boards are designed for demanding applications and designed for 24/7 100% duty cycle as a result. This often includes high volume airflow, though you can also design servers for quiet environments with lower airflow. The typical 1U and 2U designs are good at maintaining airflow across the entire chassis, and you can absolutely duplicate that with a well-selected tower chassis, but the classic error made by enthusiasts familiar with gamer or desktop PC "builds" is a failure to do airflow design to make sure everything gets airflow. This can be very hard on high-wattage (especially server) CPU's, high capacity memory DIMM's, HBA's, network controllers, and flash and hard drives where heat is dissipated in distressingly high amounts, levels unfamiliar to gamers.

There's plenty of server gear that doesn't require high airflow too, though, so your first sentence is not really true.
 

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
Thank you everyone for all your points. I'll make some more research into which HBA to go for. Does anyone know how much this wattage is? It'd be interesting to know and could greatly help in my decision making.

Now let me get back to my first point for a moment and make an update for it: having removed the SATA expansion card and reinstalling TrueNAS Scale baremetal, the server is still auto-rebooting! Everything it were doing is SCRUB running through the night and approximately 2 hours ago it rebooted.

Again, nothing in the logs to indicate the reason it rebooted:
1699635547623.png

Any help greatly appreciated.

EDIT:
Again rebooted just now, mere minutes ago. This was a very short interval! Everything I were doing were looking at logs! And scrub still running in the background. When I was running proxmox, proxmox never auto rebooted, only this VM was. I start running memtest now, will probably be running for a whole day to see if this impacts it in any way, but I have my doubts (now everything left is new gear from proper brands).
 

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
This time I could find something in the logs (attached).
 

Attachments

  • logs.txt
    65.2 KB · Views: 70

MrMakuc

Dabbler
Joined
Nov 5, 2023
Messages
14
It appears something is wrong with the RAM. Guess I'll be checking each individual stick now to see where the problem lies. Didn't pass even the first pass of memtest. Such a rookie mistake. Guess I had too much trust in new hardware from reputable manufacturers. Thanks @jgreco for the burn-in recommendation. Better running it all late than never.
 
Last edited:
Top