TrueNAS Scale (and CORE before it) keeps restarting on random intervals

MrMakuc · Nov 5, 2023

Hello,

I'm running TrueNAS Scale now (latest version), been running CORE before it, same issues applied, hence why I tried to "upgrade" it (no offense, it's what they are calling it via GUI in the automated process) to SCALE.

I' running it on proxmox 8.

Motherboard make and model: Gigabyte B760M Gaming X DDR4
CPU make and model: Intel i5-13600K
RAM quantity: 64 GB (non-ECC), Corsair Vengeance LPX 64GB (2x32GB) DDR4 3200MHz C16
Hard drives, quantity, model numbers, and RAID configuration, including boot drives: 3x Seagate Exos X20 20 TB, running in RAIDZ1
Hard disk controllers: MZHOU PCI-E SATA Expansion Card 10 SATA ports, PCIe passthrough to TrueNAS VM
Network cards: Onboard

Now for the problems. The first round of problems I encountered was when I first populated the pool (or better said, when I first checked reports after populating the pool. Too noob at first, trusting it from the get-go.) with 20 TB of data (copied from my various external drives). I had A LOT of checksum errors and 5 files with permanent erros. No biggie, fixed all those files (replaced them) and SCRUBing the drives fixed the other errors. But this time I setup e-mail notifications and began looking at things (still TrueNAS CORE at this point in time).

The VM would automatically reboot approx. once per day. Proxmox still lists its original uptime, but TrueNAS itself is now saying it boot up at a difference point in time. Sometime this interval is longer, sometime shorter.
When I upload ~100 GB of new data, checksum errors are back, in dozens, with possibly one permanent corruption.
TrueNAS thinks memory is ECC...no idea why or how to fix it.

I replaced the cables. All SMART tests are PASSED (long and short).

But for now, lets focus on the reboot (unless you wan to tackle everything at once?). Proxmox doesn't catch anything. Where should I go to look next?

Everything was new hardware. I tried 4 passes with memtest when assembled, no errors.

MrMakuc · Nov 5, 2023

I've just come across this post: proxmox with TrueNAS, where it states possible problems with Proxmox. I'm guessing my first recommendation would be to go barebones?

Ericloewe · Nov 5, 2023

MrMakuc said:
Hard disk controllers: MZHOU PCI-E SATA Expansion Card 10 SATA ports, PCIe passthrough to TrueNAS VM

Oh god, that's terrible... Cheap SATA controller plus a SATA port multiplier... Not a recipe for success in any way and it may be part of the problem if the boot device is attached to that thing.

MrMakuc said:
When I upload ~100 GB of new data, checksum errors are back, in dozens, with possibly one permanent corruption.

Yeah, I suspect the SATA controller may be particularly nasty...

MrMakuc said:
TrueNAS thinks memory is ECC...no idea why or how to fix it.

That's cosmetic, but something the hypervisor is responsible for. TrueNAS can only report what it sees...

MrMakuc said:
I've just come across this post: proxmox with TrueNAS, where it states possible problems with Proxmox. I'm guessing my first recommendation would be to go barebones?

Definitely. @jgreco's Resources also show other less extreme options.

jgreco · Nov 5, 2023

Ericloewe said:
Definitely. @jgreco's Resources also show other less extreme options.

Wow, that's lazy! Heh.

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

In the last year or two, we've had a resurgence of users asking about SATA Port Multipliers and cheap SATA controllers. Please, do NOT use port multipliers, and use cheap SATA controllers only after extensive research. SATA controllers and SATA...

www.truenas.com

ESXi is still the king of hypervisors, though Proxmox has been okay for many folks. If you have recent vintage virtualization-capable server hardware (say post-2015 or so) it's likely to be just fine, but still test and burn-in your solution. Older hardware (pre-X9 for ESXi) has problems with PCIe passthru sometimes, and X11 seems to be the boundary line for Proxmox PCIe passthru. This doesn't mean all configurations fail, it's just where I've noticed the problems/success ratio seems to switch over from likely-to-fail and becomes likely-to-work. Non-server boards are more of a crapshoot as many do not have properly functioning BIOS support for PCIe passthru.

Resource - "Absolutely must virtualize TrueNAS!" ... a guide to not completely losing your data.

[---- 2024/01/16: Still relevant. Virtualization really doesn't change much. Updates made as appropriate. ----] [---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some...

www.truenas.com

Building, Burn-In, and Testing your FreeNAS system

I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without professional experience building servers, it goes from...

www.truenas.com

etc

MrMakuc · Nov 5, 2023

Ericloewe said:
Oh god, that's terrible... Cheap SATA controller plus a SATA port multiplier... Not a recipe for success in any way and it may be part of the problem if the boot device is attached to that thing.

So I was correct in assumint non-ECC memory (I should also mention that I haven't turned on the XMP profile to run RAM at full speed, due to idle power consumption and for better stability - once everything will work as it should I'll begin playing around with those settings) shouldn't be making so many errors if everything is working properly. Thanks.

Proxmox boot device is attached to onboard SATA controller as an older Samsung EVO 840 SSD, 500 GB in size. TrueNAS boot device is a virtual disk on this drive, so I guess this shouldn't be the reason for the reboots? Unless the SATA expansion card can mess it this much...

jgreco said:
Wow, that's lazy! Heh.

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

In the last year or two, we've had a resurgence of users asking about SATA Port Multipliers and cheap SATA controllers. Please, do NOT use port multipliers, and use cheap SATA controllers only after extensive research. SATA controllers and SATA...

www.truenas.com

ESXi is still the king of hypervisors, though Proxmox has been okay for many folks. If you have recent vintage virtualization-capable server hardware (say post-2015 or so) it's likely to be just fine, but still test and burn-in your solution. Older hardware (pre-X9 for ESXi) has problems with PCIe passthru sometimes, and X11 seems to be the boundary line for Proxmox PCIe passthru. This doesn't mean all configurations fail, it's just where I've noticed the problems/success ratio seems to switch over from likely-to-fail and becomes likely-to-work. Non-server boards are more of a crapshoot as many do not have properly functioning BIOS support for PCIe passthru.

Resource - "Absolutely must virtualize TrueNAS!" ... a guide to not completely losing your data.

[---- 2024/01/16: Still relevant. Virtualization really doesn't change much. Updates made as appropriate. ----] [---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some...

www.truenas.com

Building, Burn-In, and Testing your FreeNAS system

I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without professional experience building servers, it goes from...

www.truenas.com

etc

Due to idle power consumption I decided to try running consumer hardware. I chose the middle ground processor with support for ECC if it becomes more necessary down the road or use it as my desktop/workstation if this won't play out okay. So yes, it is 13th gen which should tick the boxes for hardware new enough to support such stuff.

Another thing to mention is, I've setup affinity to only ever use the same type of cores, having read that TrueNAS uses an older kernel which didn't yet have the "proper" implementation of E-cores and P-cores.

I'm guessing for barebones installation I should disable E-cores to make sure everything works as it should first, right? Should I also remove the SATA expansion card to remove it from the suspects (at least until I make sure everything works as it should when uploading files)? My motherboard does have 4 SATA onboard ports so it should be enough.

---

Another problem I'm having with TrueNAS Scale (not present in CORE) is UPS service. I keep disabling it (and yes, checkmark to run at startup is also removed) and after each reboot, the service starts up on its own again (though at the moment, this is what gives me a heads-up that it were rebooted, unlike CORE which actually let me know it were rebooted. How can I enable this notification and permanently disable UPS service?

Etorix · Nov 5, 2023

MrMakuc said:
I've just come across this post: proxmox with TrueNAS, where it states possible problems with Proxmox. I'm guessing my first recommendation would be to go barebones?

First recommendation, which cannot be skipped, is to ditch the SATA card and get a LSI 9200/9300 SAS HBA. (Or attach drives to the motherboard, and pass the PCH controller, if it's not used for something else.)
If that does not solve the issue, try bare metal.
If that still does not solve the problem, well, you may need proper server hardware, including a proper server NIC rather than Realtek 2.5G.

But in any case 3*20TB in raidz1 is not a recommended layout: Too much data at risk in the event of a drive failure.

LarsR · Nov 5, 2023

Next question would be how well does the current bsd kernel support 13th gen intel as a virual machine or bare metal...

MrMakuc · Nov 5, 2023

Etorix said:
First recommendation, which cannot be skipped, is to ditch the SATA card and get a LSI 9200/9300 SAS HBA. (Or attach drives to the motherboard, and pass the PCH controller, if it's not used for something else.)
If that does not solve the issue, try bare metal.
If that still does not solve the problem, well, you may need proper server hardware, including a proper server NIC rather than Realtek 2.5G.

But in any case 3*20TB in raidz1 is not a recommended layout: Too much data at risk in the event of a drive failure.

The first 20TB was moved via VMs on this same host. You think NIC could still make corruption on the transfered data? I'm using internal SATA controler at the moment for boot SATA SSD. I guess I could move it to NVMe SSD and boot from there, then move disks to onboard ports.

Used to have 2x Samsung 2TB SSD as cache, hence why I went for expansion SATA card, total of 5 disks (with potentially 2-3 additional HDDs following soon, replacing it to RAIDZ2 then). Had no idea it could potentially screw up so many things.

For HBA, would these from China be OK or should I be more coutious?

LarsR said:
Next question would be how well does the current bsd kernel support 13th gen intel as a virual machine or bare metal...

Well, TrueNAS probably doesn't, Proxmox should have no problems, based on my research. Hence why I setup affinity for TrueNAS SCALE to always use the same cores, eliminating the problem of different cores which could potentially cause problems for it. FOr baremetal, I would have to disable E-cores to make sure the proof-of-concept solves the problem, I guess, and then experiment from there.

Etorix · Nov 6, 2023

MrMakuc said:
For HBA, would these from China be OK or should I be more coutious?

I'm not skilled enough to spot fake cards (and this particular listing has a mix of references), but yes this is the general direction you should take if you need a SAS HBA. How many ports/drives do you actually need (ZFS pool and other VMs)?

MrMakuc · Nov 6, 2023

I'll probably be going for 12 MAX at any point in time, 5-6 for the near future.

asap2go · Nov 6, 2023

MrMakuc said:
The first 20TB was moved via VMs on this same host. You think NIC could still make corruption on the transfered data? I'm using internal SATA controler at the moment for boot SATA SSD. I guess I could move it to NVMe SSD and boot from there, then move disks to onboard ports.

Used to have 2x Samsung 2TB SSD as cache, hence why I went for expansion SATA card, total of 5 disks (with potentially 2-3 additional HDDs following soon, replacing it to RAIDZ2 then). Had no idea it could potentially screw up so many things.

For HBA, would these from China be OK or should I be more coutious?

Well, TrueNAS probably doesn't, Proxmox should have no problems, based on my research. Hence why I setup affinity for TrueNAS SCALE to always use the same cores, eliminating the problem of different cores which could potentially cause problems for it. FOr baremetal, I would have to disable E-cores to make sure the proof-of-concept solves the problem, I guess, and then experiment from there.

Both Proxmox and Scale's scheduler can deal with the new Intel architecture. No reason to deactivate the e cores.

MrMakuc · Nov 6, 2023

asap2go said:
Both Proxmox and Scale's scheduler can deal with the new Intel architecture. No reason to deactivate the e cores.

Thank you, didn't know that.

Removing SATA expansion card, moving boot disk from SATA SSD to NVMe, and now connecting all HDDs from SATA expansion card to onboard SATA controller didn't resolve auto-restarting of the VM. Neither did increasing the memory allocation from 16GB to 32GB for the VM (reported 26.3 GiB free of those 31,3 GiB available). Maybe I could try to reinstall everything instead of having it "upgraded" from CORE version, but I doubt this would help. Barebones install is now probably my only choice.

Reboot happened at: 21:38

Nothing useful visible in:

Neither in:

Probably useless as well, but:

And this is pretty much it what I looked at, besides:

And:

Scrub is still in progress to begin everything with a clean slate before trying to copy new data onto it.

MrMakuc · Nov 8, 2023

@Etorix Is the LSI SAS 9300-16I safe to use (since it has more ports, totaling 16 ports when using SFF-8643 SATA Cable)? Or should I go for 9300-8I (8 ports when using SFF-8643 SATA Cable)? Considering data transfer speed isn't important to me.

Ericloewe · Nov 8, 2023

MrMakuc said:
Is the LSI SAS 9300-16I safe to use

Yes

Etorix · Nov 9, 2023

Any 9300 HBA with the recommended P16 firmware if fine to use, and provide the best possible speed including for SSDs. For HDDs, even a 9200 HBA would do.
In any case, make sure it's well ventilated: These SAS HBA run hot.
Which brings the next point: They run hot because they use some wattage, especially the 9300-16i, which is two 3008 controllers (= two 9300-8i) on the same card. As described above, you have few drives, so consider whether you need a -16i.

asap2go · Nov 9, 2023

Etorix said:
Any 9300 HBA with the recommended P16 firmware if fine to use, and provide the best possible speed including for SSDs. For HDDs, even a 9200 HBA would do.
In any case, make sure it's well ventilated: These SAS HBA run hot.
Which brings the next point: They run hot because they use some wattage, especially the 9300-16i, which is two 3008 controllers (= two 9300-8i) on the same card. As described above, you have few drives, so consider whether you need a -16i.

Same "problem" as most server gear.
They are made for high air flow which most non server cases don't have.

jgreco · Nov 9, 2023

asap2go said:
Same "problem" as most server gear.
They are made for high air flow which most non server cases don't have.

That's why we recommend using server gear for server purposes, just like you would use gamer gear for gaming purposes or desktop gear for desktop purposes. You can try to use a desktop for gaming but it is likely to be suboptimal, just as you can use a gaming board for server use but it is super likely to be suboptimal. Server boards are designed for demanding applications and designed for 24/7 100% duty cycle as a result. This often includes high volume airflow, though you can also design servers for quiet environments with lower airflow. The typical 1U and 2U designs are good at maintaining airflow across the entire chassis, and you can absolutely duplicate that with a well-selected tower chassis, but the classic error made by enthusiasts familiar with gamer or desktop PC "builds" is a failure to do airflow design to make sure everything gets airflow. This can be very hard on high-wattage (especially server) CPU's, high capacity memory DIMM's, HBA's, network controllers, and flash and hard drives where heat is dissipated in distressingly high amounts, levels unfamiliar to gamers.

There's plenty of server gear that doesn't require high airflow too, though, so your first sentence is not really true.

MrMakuc · Nov 10, 2023

Thank you everyone for all your points. I'll make some more research into which HBA to go for. Does anyone know how much this wattage is? It'd be interesting to know and could greatly help in my decision making.

Now let me get back to my first point for a moment and make an update for it: having removed the SATA expansion card and reinstalling TrueNAS Scale baremetal, the server is still auto-rebooting! Everything it were doing is SCRUB running through the night and approximately 2 hours ago it rebooted.

Again, nothing in the logs to indicate the reason it rebooted:

Any help greatly appreciated.

EDIT:
Again rebooted just now, mere minutes ago. This was a very short interval! Everything I were doing were looking at logs! And scrub still running in the background. When I was running proxmox, proxmox never auto rebooted, only this VM was. I start running memtest now, will probably be running for a whole day to see if this impacts it in any way, but I have my doubts (now everything left is new gear from proper brands).

MrMakuc · Nov 10, 2023

This time I could find something in the logs (attached).

MrMakuc · Nov 10, 2023

It appears something is wrong with the RAM. Guess I'll be checking each individual stick now to see where the problem lies. Didn't pass even the first pass of memtest. Such a rookie mistake. Guess I had too much trust in new hardware from reputable manufacturers. Thanks @jgreco for the burn-in recommendation. Better running it all late than never.

Important Announcement for the TrueNAS Community.

TrueNAS Scale (and CORE before it) keeps restarting on random intervals

Dabbler

Dabbler

Server Wrangler

Resident Grinch

Dabbler

Wizard

Guru

Dabbler

Wizard

Dabbler

Patron

Dabbler

Dabbler

Server Wrangler

Wizard

Patron

Resident Grinch

Dabbler

Dabbler

Attachments

Dabbler

Similar threads