TrueNAS-13.0-U6 random freezes

Triox

Cadet
Joined
Dec 10, 2023
Messages
7
Yeah, my shell directly on the NAS video output is still functioning when it happens, that's why I can restart it easily. But SSH and everything else is a no go. I am suspecting, that in my case, the onboard LAN card may the problem, since it appears, as though the link goes down. Haven't checked the LEDs on thebackside if they're having any activity. But will do so when/if it happens again.
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
so I have a suspicion: whenever I update my VMs (and therefore have to restart some services) or when I'm restarting my jails these freezes seem to happen - so far without concrete evidence, though. Maybe a probable reason could be some interaction with the network device?
either way, I think it would be a good start to use a logger to see what's happening and then do some testing on jails and VMs. Is there any way that you guys know of to do useful logging on TrueNAS?
as stated before, the standard logs do not show any signs of the problem OR I wasn't able to find the right logging file so far ...
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
I did - nothing in there

my H/W:
ASRock Rack C236 WSI
Intel Core i3-6100
Crucial CT16G4WFD8213 (2x 16 GB DDR4-RAM, ECC, 2133 MHz)
Crucial P3 SSD (on ASUS Hyper M.2 X16 Card)
8x WD Reds as storage (CMR)
be quiet! Dark Power Pro 11 (650W)
---
2 VMs
2 Jails
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
How many drives? Which PSU? How many VMs? How many jails? What do they do? Does this behaviour happen even with VMs stopped? With jails stopped? With both stopped? How are the temperatures? Have you tried a fresh install?
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
added info

as said, I am trying out, can't say anything for sure, looking for anything that points me in a direction

temp are okay, fresh install tried
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Did you look into /var/log/system and /var/log/iocage.log? Which jails and VM are you using?
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
sadly, both have no useful information:

jails: Emby & Nextcloud
VMs: one with pihole, one with Ubunutu and Docker (mostly for automation, i.e. sabnzbd, etc.)

/var/log/system is not existent
/var/log/iocage.log has it's latest entry more than two months in the past

EDIT:
I also checked console.log, cron, daemon.log, dmesg.today, middleware.log

from cron I was able to determine more or less exact time of freeze at least ... but nothing out of the ordinary
from daemon.log I saw that smartd tests were running which can be seen being also initialized on middleware.log


Maybe it has something to do with the smart tests?

EDIT 2:
Hours logged on each HD:

2023/12/22
ada0 257
ada1 257
ada2 23609
ada3 257
ada4 24654
ada5 57225
ada6 24654
ada7 24648

2023/11/26
ada0 65242
ada1 65243
ada2 23059
ada3 65242
ada4 24104
ada5 56675
ada6 24104
ada7 24098

2023/10/12
ada0 64164
ada1 64164
ada2 21987
ada3 64164
ada4 23032
ada5 55603
ada6 23025
ada7 23025

there are two things:
1st: as seen on first sight is that ada0, 1 and 3 restarted their respective count
2nd: hours logged on ada6 and ada7 were similar on the 12th of October but different on the 26th of November ... the difference of 6h is still there on the 22nd of December
also: smartd doesn't report any errors on any of those disks ...

I assume that smartd restarts at zero when 65536 which would explain the first thing noticed
for the 2nd I have no idea
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
/var/log/system is not existent
That's strange, I have as following.

Code:
root@truenas[~]# cd /var/log
root@truenas[/var/log]# ls
aculog                  console.log.1.bz2       cron                    daemon.log              daemon.log.3.bz2        libvirt                 maillog.1.bz2           maillog.5.bz2           messages.1.bz2          messages.5.bz2          middlewared.log.1       proftpd                 ups.log                 wtmp
auth.log                console.log.2.bz2       cron.0.bz2              daemon.log.0.bz2        daemon.log.4.bz2        lpd-errs                maillog.2.bz2           maillog.6.bz2           messages.2.bz2          messages.6.bz2          nginx                   samba4                  userlog                 xferlog
console.log             console.log.3.bz2       cron.1.bz2              daemon.log.1.bz2        debug.log               maillog                 maillog.3.bz2           messages                messages.3.bz2          messages.7.bz2          nut                     security                utx.lastlogin           zettarepl.log
console.log.0.bz2       console.log.4.bz2       cron.2.bz2              daemon.log.2.bz2        iocage.log              maillog.0.bz2           maillog.4.bz2           messages.0.bz2          messages.4.bz2          middlewared.log         ppp.log                 system                  utx.log

Did you try cat /var/log/system?
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
Code:
cat: /var/log/system: No such file or directory


for that, I do get the following message while booting:
Code:
Dec 22 07:53:59 fenix4k kernel: pid 1699 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
Dec 22 07:53:59 fenix4k syslog-ng[3324]: syslog-ng starting up; version='3.35.1'


However, when I run
Code:
service syslog-ng status

it says
Code:
syslog_ng is running as pid 3324.


I was googling that and it seems to be something that occurs but doesn't cause any problems as syslog-ng starts itself still
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
well for next steps:
1st: I'll check the BIOS event log for anything
2nd: I'll disable smart & scrub schedule for a while after the next freeze ... somehow I think it could be related
3rd: I'll disable watchdog for a while as well ... I recall that I had months of problem solving back in 2015 and watchdog was the cause...however back then I had an output on the screen at least. nevertheless I give it a try
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
it's gonna be a lot of try and error I assume...but I'll let people know when I find the root cause so I might be able to spare some other person some time

thanks for your help, though! appreciated!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I must have missed something in the two pages of troubleshooting but just in case I didn't...

Did anyone go into the Boot Environment and Activate the previous version and reboot. You know, the version you were running before all hell broke loose? This will validate if it actually was the update to TrueNAS 13.0. So long as you did not upgrade the pool feature set, you may be able to roll back. If you updated the feature set, you are likely stuck with version 13.0 for now.

I think it's time to rule out items vice trying to go directly to a component.

What is your boot device and where is your System Dataset stored? This may not have anything to do with your problem but it's data that might help us figure out what is going on.

Lets hit on the drives for a minute, are SMART tests being conducted on them? Have you checked to see when the last Long/Extended test passed on each drive? TrueNAS should report a failure but if you are not testing, that is a problem, even if not related, but it can rule it out as a possible problem.

I have a few other things to do as well but it will take me some time to type up. I really should make a guide on this since there are basic things everyone can do to isolate a problem. I hope you are able to roll back to TrueNAS 12.x and if that fixes it, then we can discuss how to test if TrueNAS 13.0 is the problem or maybe the upgrade went poorly.

I would actually like to see the output for each drive [in code brackets] using this command
Code:
smartctl -x /dev/ada0
(repeat for each drive) and I do want the entire output, some people trim it thinking they are being helpful when in reality they remove something important. If it makes it easier for you (and it will) run this command
Code:
smartctl -x /dev/ada0 >> testfile.txt
and then rerun the command for each drive. All the data will be in the file testfile.txt. Make sure you run this command form a location you can grab a copy of this file, like a shared dataset. Then you can just attach the text file to a posting. Easy.
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
hey Joe

thanks for helping me out.

can't go back to the old one. I reinstalled with 13.0-U6 and upgraded pool set with it. looking back, that was stupid. But since I never encountered a problem before and TrueNAS 13 was running for days without any signs of a problem I upgraded. However, chances are that something is messed up due to the update as the problem seems to have started to occur since the update

boot device: Crucial P3 SSD (on ASUS Hyper M.2 X16 Card)
system dataset pool is the boot pool

smart test are scheduled, above you find the latest info about the running times. all the info from smartctl added as file
 

Attachments

  • testfile.txt
    95.5 KB · Views: 57

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So, drive WD-WCC4E5RTU4FL, WD-WCC4E1EA3C4H, WD-WCC4E1JFJNH8 test data rolled over the magical 64K and is starting at zero again, yet the power on hours is continuing to increment normally. It is not a real problem, but you need to subtract 65536 from the power on hours to match it up with the SMART test hours.

Drive WD-WCC4E1EA3C4H has errors. Raw Read Error Rate and MultiZone Error Rate. These are valid numbers for this drive model regardless of the fact it's still passing an extended test.

Drive WD-WCC4E1JFJNH8 has Raw Read Error Rate = 3. This alone so long as it doesn't increase and not other errors occur, does not need replacement yet, but I'm sure it's getting close. Keep an eye on it.

The rest of the drives look fine.

If you haven't already done so, hopefully you did before you upgraded but if not, you can create it now, make a backup of your TrueNAS configuration and secret seed file via the GUI. Next download a fresh iso of TrueNAS CORE. Install on your boot drive and then after the system boots up and you are able to get into the GUI, restore your configuration file. The system should reboot possibly several times but once over, the system should be as it once was, but hopefully without the intermittent failures. This is different than an upgrade, it's a completely fresh install. Periodically someone (including me) will have a failed upgrade so the best way to take care of it is to reinstall from scratch.

I reinstalled with 13.0-U6 and upgraded pool set with it. looking back, that was stupid. But since I never encountered a problem before and TrueNAS 13 was running for days without any signs of a problem I upgraded.
We all have things we learn the hard way. Remember, do not upgrade the pool unless you know there is a new feature you need. It gives you the opportunity to roll back generally.

I really hope this helps.
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
hi there

thanks for your update on my HDs. the read errors were reported, I reset the message and see if it comes up again. Worst case, I always have a spare drive just in case.

I do have backups of the config and make them regularly. I will go on and try to figure out what the reason for the freezes is. Either way, when it keeps coming up I'll definitively do a reinstall and see if it helps - thanks for the tip!

as stated before: I will report back either way just in case it can help someone else! :)

merry xmas and thanks again!
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
well ... it went well for almost a month.
I am still trying to figure out what's happening and I'll keep trying for a bit.
just in case someone still reads these: I'll intend to write down what eventually solved it :)
 

Silvan Burch

Explorer
Joined
May 1, 2016
Messages
65
seems like I'm on the right track now ... apparently one of my VMs caused the problem.
I'll watch it a bit more and write in later again if everything goes smooth for a few more weeks
 
Top