Truenas Scale Bluefin freezes frequently.

Felipe350

Cadet
Joined
Oct 10, 2023
Messages
4
Hello Truenas Community!

I've been using Truenas Scale on my server for almost a year now i think, and while the docker experience isn't the best it could be i have been enjoying it quite a lot.

In the past 3 months i've started leaving it on for weeks, months even, as i started using it more consistently. But in the past 2/3 weeks every now and then (like 1-2 days or even hours) it simply freezes, becomes unresponsive to any input and the monitor that shows the underscore that would indicate where you would start typing is frozen too.

I've tried the following:
- memtest the ram for a whole day, no warning or errors
- checked for dust and stuff, nothing
- tired to check if the mobo gave any error codes, nothing
- thought it could be an overheating issue, nothing would go over 40 degrees celsius
- perhaps it was an error with the gpu as some have said in this forum, removed it and it still forze.

Here's my system:
CPU: Ryzen 5 2600
COOLER: Noctua NH-L12S
MOBO: ASROCK X470D4U
RAM: 64GB Patriot Viper Steel DRR4 3200MHz NON-ECC (yeah ik not appropriate but full day memtest gave no errors)
PSU: Corsair RM550X
Drives:
- System drive: 128GB ADATA_SU800 SSD
- 1 TB Crucial P3 SSD (main pool for VM, Docker, SMB etc)
- 2 x 4TB Seagate ‎ST4000DM005 HDD (RAID 1 Backup)
- 1 TB TOSHIBA_MQ01ABD100 HDD (Old main pool, kept in case i needed smth)

Also i had inside a GTX 1660 to use for a windows vm via passthough, but i removed it 2 days ago to test it was the cause of the issue (it isn't).

I haven't found a clue so far of what could be causing the issue, i've just recently updated to the last version of Truenas Scale 22.12.4 (like 16h ago), but i doubt it will fix this issue.

Thanks in advice to anyone willing to give me an help ^-^
 

Fleshmauler

Explorer
Joined
Jan 26, 2022
Messages
79
Anything pop-up in the console messages when it freezes? (System Settings>General>Other Options> Show Console Messages [Check]). Has helped me diagnose some nonsense in the past.
 

LarsR

Guru
Joined
Oct 23, 2020
Messages
719
Did you change anything in the bios during those 2-3 weeks?
First and second Gen ryzen had some quirks with bios power saving modes, that would, if enabled, lead to frozen systems.
But far as i can remember that was only on Core systems since it was a BSD problem. Still would be worth checking if those settings are present and what happens if you disable them.
The settings were: AMD Cool&Quit, ERP-Ready and Global C-States.
 

Felipe350

Cadet
Joined
Oct 10, 2023
Messages
4
I've enabled the console messages as Fleshmauler suggested, when the next freeze happens I'll go check if i can find something.

Regarding LarsR message, i haven't changed the bios settings during those weeks, i don't remember changing them at all if not for updating/downgrading the bios as i later discovered the newest didn't support 2nd gen Ryzen. Nonetheless I've changed the following settings:
- PSS Support from auto -> disabled : this should be the AMD Cool&Quiet but renamed
- Power Supply Idle Control from auto -> typical current idle: this should also be ERP-Ready but renamed
- Global C-states from auto to disabled

Also i've added Restore AC Power Loss from No Change -> Last State, just in case.

I'll send another message in a couple of days to update the situation.
Thank you for the replies.
 

Felipe350

Cadet
Joined
Oct 10, 2023
Messages
4
Hello so after a week of no freezes or other things whatsoever i was going to write how the fixes you guys gave me fixed it or that the System update somehow managed to pull thing off.

Well as you might have guessed this is not that, in fact this morning i woke up to my wonderful cathodic-ray tube monitor showing only a "9C" error code on the bottom right of the screen, which should stand for an error in loading the drivers during boot :)
Seemed weird as i other than the changes i reported i didn't change the drivers or such so i restarted thinking it could have been a case of the server freezing during the bootloader (which would be extremely weird considering i don't even know how it rebooted in the first place).

But then i rebooted and it didn't have issues, so i thought perhaps the bluetooth usb could be the culprit and left it to see if it was the case.
Perhaps it is or not but nonetheless not even after an hour and the server froze again.
Confused as to how it happened even after a week of flawless workload i simply rebooted again thinking of writing here in the late evening after thinking up about it.
BUT THEN i came to look back up 15m ago to see how it went and well... This popped off:
IMG_20231019_125220925.jpg


I'm thinking of reinstalling the BIOS for this mobo in case it got an update but at this point I'm not sure what's the issue here, and if it is indeed the mobo being faulty i would end up with a nice $300 unstable piece of server

I'm also attaching the IPMI event log as afaik the BIOS event log should describe a reboot of the system, and there are unknowns that have the same description, but i don't know what they mean.

Again any help whatsoever is greatly appreciated.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Well you didn't post the log (I do that all the time) but I would definitely check BIOS updates, and, look at all the powersaving options in the BIOS. I would definitely add the tsc=unstable to kernel boot options. I believe you will find that will fix it. Most reports seem to be Ryzen.

Something like this should do it:

midclt call system.advanced.update '{"kernel_extra_options": "tsc=unstable"}'
 

Felipe350

Cadet
Joined
Oct 10, 2023
Messages
4
Oh yeah sorry i did attach the log file but in between me sending the message the website gave me 502 and forgot to save it. Anyway here's the file.

Also I've completely restored everything to base settings, bios, reinstalled latest Truenas Scale on another ssd, restored main config and dataset, and it still froze. Now I've added the kernel option for tsc unstable and I'll see how it goes.
 

Attachments

  • SELLog.txt
    526.7 KB · Views: 47

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I don't recall if that's a bug with debian, the kernel driver, or the "bios". Hopefully you rebooted after applying the option. SHOULD be fine. Would love to hear back though in a few days.
 
Top