System keeps rebooting/freezing after update to Cobia

cben0ist

Cadet
Joined
Jan 30, 2023
Messages
7
Disclaimer, I have created a ticket for my issue already, looking here for some help from the community to figure out if my problem is software or hardware: https://ixsystems.atlassian.net/browse/NAS-125248

I have updated from BlueFin to Cobia yesterday. While everything looked ok at first I quickly started noticing random errors
  • Network was shown down even though I was remotely connected to it
  • Sometime pools were not going back up after reboot
  • Truenas catalog cannot sync
And I realized the system just kept restarting...
I have reinstalled the system from scratch, reimported my pools and restore my last backup. Things are more stable but the system still randomly freezes.

I am not excluding Hardware issues but would these have been related to the cobia update?
I see looots of messages in my /etc/log/messages (full log attached to the jira ticket) but I don't really know how to read all of that and someone else might be able to help me pinpointing the root issue

Any help is appreciated.
Thank you!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Please do not use pastebin or other outside sites. Many people will not look at your data due to the world of viruses. Also, you can just add the data to you message here, that is the preferred method.

I realize that you possibly included your hardware information in the pastebin but also list your system components. How much RAM, CPU, Motherboard, Hard drives, etc. The more details you can provide, the better advice you will receive.

Welcome to TrueNAS Forums! Hope we can help with your problem.
 

cben0ist

Cadet
Joined
Jan 30, 2023
Messages
7
Oopssie, I did not realize I could attach a message here too, sorry about that. The document is now attached.

Also adding a TN specs.txt which contains the hardware details straight from the command line.
In a nutshell,
- 12th Gen Intel(R) Core(TM) i5-12600K - BX8071512600K,
- Mag B660m Mortar Motherboard
- 128GB Kingston Furry RAM - KF556C40,
- 1 x RAIDZ2 | 5 wide | 16.37 TiB SATA VDev (Data pool) - Seagate Ironwolf PRO 18TB NAS Hard Drive ST18000NE000,
- 1 x MIRROR | 2 wide | 931.51 GiB NVme (Apps Pool) - WD Black SN770 1TB PCIe Gen4 NVMe
- another 931.51 GiB NVme (Boot Pool) - WD Black SN770 1TB PCIe Gen4 NVMe

Let me know if there is anything else I can add.
Thank you!!!
 

Attachments

  • messages.txt
    861.3 KB · Views: 66
  • TN specs.txt
    14.7 KB · Views: 60

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
At first glance, while your hardware may not be server grade, I do not see a reason the hardware would cause the issues you are having. I will say that a 1TB nvme as the boot drive is a lot of wasted space, but that again is not the issue, just an observation.

It was smart to file a ticket, it has already been assigned to an engineer who will hopefully reach out to you soon to obtain additional information and hopefully solve your issue.

One question we need an answer to is: Was your system stable before the upgrade? I'm talking days of running without issue, not hours.

Can you roll back to BlueFin? If you can, that would be the easiest way to return your TureNAS machine to operation. The Cobia version has not gone smoothly for many people but your problem I have not seen, but then again there are hundreds of these postings and I only read a small amount.

While waiting for additional help for the forum members, I recommend that you conduct both a RAM and CPU stress test just to verify your system is stable. Things break and while I don't expect this to be the issue, it is one thing we can rule out in the meantime. It really does matter if you can state that Memtest 86+ passed 5 times, or Prime95 ran for one hour (or 30 minutes). True burn-in testing would have you run these tests for days. I will run Prime 95 for a few hours, for my home system I'm good with that. Memtest 86+ I run for days until I get at least 5 full passes or more.

And lastly, I'm not the guru for SCALE, not by a long shot. There are some very experienced people here that can help you. Answer the questions above and post the answers.

I wish you the best of luck to resolve your issues.
 

cben0ist

Cadet
Joined
Jan 30, 2023
Messages
7
Thank you @joeschmuck for all these leads. I have never used Prime95 or Memtest86+ before so I am looking into that now.
Prime95 ios actually running as we speak and even though it is showing many "Self-test 348k passed" it printed a "FATAL ERROR: Rounding was 0.5, expected less than 0.4" at the very beginning... It has only been 10 minutes so far.
Memtest86+ I will make a bootable key and see if I can launch it for the night.
 

cben0ist

Cadet
Joined
Jan 30, 2023
Messages
7
One question we need an answer to is: Was your system stable before the upgrade? I'm talking days of running without issue, not hours.
Yes, I had Bluefin running since January without any issue like that. I was very happy with it with all my data an multiple apps running (immich, Nextcloud...)
Can you roll back to BlueFin?
That would be my next try I guess but since I have opened a ticket I may want to see if that gives anything. I am also worried I activated some new features on the data and apps pool after the install. Will that prevent me to rollback to BlueFin?
And since I have reinstalled Cobia from scratch, I'll have to reinstall Bluefin from scratch too, no real "rollback".
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You can also burn the ISO for The UBCD "Ultimate Boot CD". It have many programs on it, several CPU stress tests and Memtest 86+. I generally create a CD and use my USB CD/DVD drive to boot from. Easy. However, UBCD is not always easy to navigate, just a heads up.

Any failures are not good. If you have a failure, rerun the test multiple times to ensure it does not happen again. If it does, you cannot trust that system to work properly. I've got some new hardware coming to me and I plan to spend days doing the burn-in testing. If something fails then I have the opportunity to identify the problem part and have it replaced immediately.
I have reinstalled the system from scratch
This is the only part where I don't think you will be able to just change the boot environment to the previous version if you had reinstalled over the boot drive. And it looks like you have a backup of your data. correct? If true, you are more experienced than many who roll the dice.
 

cben0ist

Cadet
Joined
Jan 30, 2023
Messages
7
Thanks @joeschmuck for the UBCD suggestion, I am going to look at that soon. I was away the last 3 days and had to take a break from all that nonsense ;)
Yes, i do have backups, I just need a proper running system.
The part I am less "confortable" is really the hardware selection. I have built that config from documentation and advices here and there and it's been great until last week honestly. If I have to get a new Motherboard/CPU/RAM combo what would you suggest me to get today?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
If I have to get a new Motherboard/CPU/RAM combo what would you suggest me to get today?
First, I hope you do not need to purchase a new system.
Second, It completely depends on what you require the system to do, and your wallet.
Third, Location is a factor. If you are in the US, not a problem, overseas can be a real issue.

I just purchased a new setup with all NVMe storage, cost $2,300 USD. It should be good for me for a long time. It will be replacing a perfectly good system with old hard drives. What prompted me to use NVMe was the fact that they cost the same as what I originally paid for the hard drives. Crazy for sure but a good deal is hard to pass up. Of course to support the NVMe's I needed a new motherboard, CPU, and RAM. They I bought a new case and power supply (I normally reuse cases and power supplies as I buy good quality from the start). This was a lot of money for me to spend but I don't spend much on myself so what the hell. I pulled the trigger.

If you look at my system lists, you will see my ESXi1 system, that is what I'm replacing. I also have a ESXi Test System, that runs TrueNAS very well if all you are using a NAS for is to backup files and maybe serve up media content, and it's fairly low power and silent. The hard drives are the only thing making noise.

If you do need to purchase a new system, write down what you need it to do, how much storage you need. A high air flow case is important but these are generally larger in size. Think about cooling, hence the larger case. Fans can make a lot of noise if you do not select them properly. I run my fans at 7VDC. When you purchase hardware, think of it lasting you at a minimum of 10 years. The hard drives, 5 years. Most people will need more storage before they know it so I tell people to double what they think they need for the next 5 years.

You will have one more big hurdle to go over and that is the layout of your pool. Research it. For my purposes a RAIDZ2 is adequate. It's not super fast but I don't need super fast. It's reliable and I have the redundancy I want.
 

cben0ist

Cadet
Joined
Jan 30, 2023
Messages
7
Alright, I think I found my issue. I have been running a Memtest 86+ and the result was definitely not good. I have removed the 2 sticks I added a month ago and Cobia has now been running for about 8 hours without restarting or freezing. Crossing my fingers it will last for the night.
Looks like adding these sticks after the bluefin install did not cause any issue until the upgrade or a new install from scratch.
I'll close my ticket if everything still good after 24h...
Thank you @joeschmuck for helping me locating my issue!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I hope that fixes it. And most memory sticks have a lifetime warranty so hopefully you can get those exchanged it they are the issue.
 
Top