SCALE becomes unreachable every few weeks

Davvo · Aug 5, 2023

NickF said:
That wasn't my intention at all. I was merely trying to point out that my understanding of you as a TN user is that the vast majority of your experience is in CORE. If I misspoke there, that's on me and I can apologize for that.

Apology accepted. You are right that my experience is in CORE.

NickF said:
To be clear - it is NOT a solution. It's a shitty walmart brand band-aid to mask the underlying problem.

Agreed, it's using duct tape to hold togheter your car in order to reach the repair shop.

NickF · Aug 5, 2023

Cool :) Glad we are good on both fronts.

grigory · Aug 8, 2023

NetSoerfer said:
Okay, I didn’t think 32GB was so tight but fair enough. I’m not actually using docker at all, just TrueCharts apps on SCALE‘s k3s.

Heavyscript just says docker, but it’s a tool to manage Truecharts apps.

I have this exact same issue. I am running an IX Systems MINI XL+ with 32 GB RAM (4 WD red Plus and 1 wd ssd (plus os drive) (2gb swap)
My system will run for a few days, maybe upto two weeks maximum before the only thing that will allow me to access webui again is a reset. The system boots up and I can login. Inspecting the log shows a lot of oom-killer messages. As we have heard here the linux memory management does its thing, basically what I can see in the log is it starts killing things with the (lowest?) oom score and then keeps on going down the list. It never kills enough to recover.

I also run several apps, all from truecharts catalog. I raised this in the truecharts discord and basically was told some thing else must be wrong, other users were running similar setups on 32GB ram.

1) blocky
2) cert-manager
3) chronos
4) cloudnative-pg
5) grafana
6) home-assistant
7) immich
8) jellyfin
9) loki
10) nextcloud
11) pgadmin
12) prometheus
13) prometheus-operator
14) prowlarr
15) qbittorrent
16) radarr
17) sonarr
18) traefik
19) zwavejs2mqtt

Since my system is still under warranty I raised this with IX Systems and they pointed me to this thread, explaining that they cannot offer me support on software and that based on the debug info all the memory modules appeared to be functioning correctly, no hardware faults.

I am going to double the ram and see how it goes.

I will say, it is quite confusing to see the free and cache lines in the TN Scale interface showing so much ram, but for some reason the system crashes in this way. I would expect the cache to be enough to recover the system, that being said it is impossible to say how much memory is being requested by the services, all I know is there is a spike and then oomkiller starts executing :P

DonnerPlays · Aug 8, 2023

I have experienced this error for a long time as well. The NAS would usually crash after ~5days where it would start killing through killing all processes until nothing was left.

I'm pretty sure it's not RAM size related. It appears to be more of a bug (OS thinking there is no ram left)
I had 32GB ECC RAM in my nas (Ryzen build)
and the problem still occured.

I upgraded to 64 GB ECC RAM but the crashes still happened, no difference, still happened after ~5 days.

The last time it happened I had a short time where I could still access the WebUI (SSH was long gone) and was able to see the memory usage.

Which was sitting at ~20GB Cache ~20GB Services (more or less, but it definetly wasn't like the system had no RAM left)
(At least that's what the WebUI told me)

The ONLY thing that has (at least so far) resolved the issue was a reinstall of TrueNAS (backing up config, reinstall to same drive, restore backup)
System has been online for 25days now (with a manual reboot done by me while I was fighting against an App not working as I wanted) so It's probably more than 30days.

RAM usage according to the WebUI is currently:

Free: 6.0 GiB
ZFS Cache: 30.9 GiB
Services: 25.8 GiB

NetSoerfer · Aug 9, 2023

DonnerPlays said:
I have experienced this error for a long time as well. The NAS would usually crash after ~5days where it would start killing through killing all processes until nothing was left.

I'm pretty sure it's not RAM size related. It appears to be more of a bug (OS thinking there is no ram left)
I had 32GB ECC RAM in my nas (Ryzen build)
and the problem still occured.

I upgraded to 64 GB ECC RAM but the crashes still happened, no difference, still happened after ~5 days.

The last time it happened I had a short time where I could still access the WebUI (SSH was long gone) and was able to see the memory usage.

Which was sitting at ~20GB Cache ~20GB Services (more or less, but it definetly wasn't like the system had no RAM left)
(At least that's what the WebUI told me)

The ONLY thing that has (at least so far) resolved the issue was a reinstall of TrueNAS (backing up config, reinstall to same drive, restore backup)
System has been online for 25days now (with a manual reboot done by me while I was fighting against an App not working as I wanted) so It's probably more than 30days.

RAM usage according to the WebUI is currently:

Free: 6.0 GiB
ZFS Cache: 30.9 GiB
Services: 25.8 GiB

What’s the long-term history of RAM usage? Is it still walking towards 0 Free?

DonnerPlays · Aug 9, 2023

NetSoerfer said:
What’s the long-term history of RAM usage? Is it still walking towards 0 Free?

Never goes to 0 RAM free

Based on what I know the amount of free ram doesn't matter because zfs just takes up everything left (I think till a specific point) after services to use for cache

If something needs ram, the zfs cache just reduces a bot.

Davvo · Aug 9, 2023

Just looking at the swap space statistics should give you a definitive answer regarding the lack or RAM; if your systems crash without using any of the swap space it might be worth submitting a bug report.

grigory · Aug 18, 2023

I tried to reinstall the os and 6 days later oom-killer did its thing.
I can confirm that free swap goes to 0 when this happens.
Next I think I will install more ram, since its been recommended here and I have it on hand.
Then I think I will set my system to restart nightly rather than uninstall/reinstalling each app/configuration to test which might be causing this.

It is mildly upsetting that ix-systems support did not seem too concerned. If this could be a bug I would expect the support person to direct me to file a bug or some other course of action, instead they checked my debug file, said the hardware was fine and sent me away.

Anyway, just providing an update in case anyone else with the same issue comes across this.

NetSoerfer · Aug 22, 2023

grigory said:
I tried to reinstall the os and 6 days later oom-killer did its thing.
I can confirm that free swap goes to 0 when this happens.
Next I think I will install more ram, since its been recommended here and I have it on hand.
Then I think I will set my system to restart nightly rather than uninstall/reinstalling each app/configuration to test which might be causing this.

It is mildly upsetting that ix-systems support did not seem too concerned. If this could be a bug I would expect the support person to direct me to file a bug or some other course of action, instead they checked my debug file, said the hardware was fine and sent me away.

Anyway, just providing an update in case anyone else with the same issue comes across this.

I upgraded my RAM to 64GB and just had the same thing happen to me again.

DonnerPlays said:
I'm pretty sure it's not RAM size related. It appears to be more of a bug (OS thinking there is no ram left)
I had 32GB ECC RAM in my nas (Ryzen build)
and the problem still occured.

I upgraded to 64 GB ECC RAM but the crashes still happened, no difference, still happened after ~5 days.

It happened when I tried to stream (and transcode) a video with emby. It was fine for a week or so, with a fairly constant RAM usage somewhere around 25GB, 30GB cache, 6-7GB free for days.

I checked earlier on the same day and the above numbers were still true, and then suddenly the WebUI was inaccessible and htop reported RAM and swap exhausted.

What can I do to identify which app (or host process) is the cause of the issue?

grigory · Aug 22, 2023

I have added more ram, bringing the system to 64gb. I will wait now to see if it occurs again.
I did notice that overtime the swap seems to increase in use over the week prior to oom-killer kicking in.
I think if this happens again in a week, my next step will be to increase swap size and maybe add the command mentioned to a cron to sort of log the oom-killer a bit.

How to diagnose causes of oom-killer killing processes

I have a small virtual private server running CentOS and www/mail/db, which has recently had a couple of incidents where the web server and ssh became unresponsive. Looking at the logs, I saw t...

serverfault.com

edit: I forgot to add, that I do notice this more when watching something on jellyfin, but that could just be noticable because that is the app I use the most.

NetSoerfer · Aug 22, 2023

grigory said:
I have added more ram, bringing the system to 64gb. I will wait now to see if it occurs again.
I did notice that overtime the swap seems to increase in use over the week prior to oom-killer kicking in.
I think if this happens again in a week, my next step will be to increase swap size and maybe add the command mentioned to a cron to sort of log the oom-killer a bit.

How to diagnose causes of oom-killer killing processes

I have a small virtual private server running CentOS and www/mail/db, which has recently had a couple of incidents where the web server and ssh became unresponsive. Looking at the logs, I saw t...

serverfault.com

edit: I forgot to add, that I do notice this more when watching something on jellyfin, but that could just be noticable because that is the app I use the most.

Do you transcode?

grigory · Aug 22, 2023

NetSoerfer said:
Do you transcode?

I had to double check and in a quick test playback was 'direct streaming', so no I don't think so.

NetSoerfer · Aug 22, 2023

I found that transcoding in emby consumes enormous amounts of RAM, that’s why I asked.

DonnerPlays · Sep 6, 2023

I have new information to share about this problem.

It was gone for me for a long time after reinstalling...

Unfortunatly it has been back again.

This time I was able to catch it though so I can share some more info.

Let me know if any logs could be helpful now, as I was able to stop the nas from crashing, I also still have all logs (usually they would be gone after a crash)

My findings:

Apps stopped responding and looking at the webui
- The Apps tab was no longer loading
- WebUI reported the ram used up completly by "Services"

htop through ssh showed all ram used up by
/usr/bin/dockerd -H fd://
~71% and rising slowly

There were multiple rows with the same command using up the same exact percentage of ram
at the top was one that had a Time+ of ~405h (I think, wouldn't match my current uptime of ~9days though)

I killed that process using htop => F9 => SIGKILL

- the ram usage in htop immediately dropped to near zero
- the apps tab in the webui worked again
- the webui responded faster
- ram usage shown in webui returned back to near zero, slowly increasing on zfs cache & services
- all apps had been killed, slowly deploying again.

I would assume it's some app causing this suddenly... I just don't know a way to figure out WHICH app is using up all the ram...

Davvo · Sep 6, 2023

Please submit a Bug Report :)

brahmy · Sep 17, 2023

Sorry to bump this thread but the same thing as OP just happened to me on TrueNAS-SCALE-22.12.3.2.

Web UI and ssh were totally unresponsive.

I can add some additional details unique to my setup...

While TN was totally unresponsive, running apps were OK, including:
- A Ubuntu VM running Docker, influxdb, grafana, and a bunch of other stuff. I was able to ssh into the VM and VM-hosted apps were experiencing no lag or issues.
- Three TrueNAS-hosted apps includning Telegraf, Plex, and Tailscale...

... What's interesting about this failure mode is that Telegraf was able to fire metrics on what was happening on TrueNAS into influx (see attached). Basically, the CPU was absolutely pinned (though the system was still respecting the allocation to the VM which was running fine), RAM still had plenty of free capacity, and in the bottom right of the image, there was a hiccup in S.M.A.R.T data from my boot drive. Not sure if this is cause or effect.

I can dig into the other metrics saved in influxdb if anyone things there would be any good clues there.

Fortunately I was able to do a controlled remote power-down of the VM, and after a hard reset of TrueNAS everything came back. This is the first time this has happened to me on this version and hardware mix (been running this since about June).

Hardware (random consumer grade stuff):

Motherboard:: MSI PRO Z790-A
CPU: Intel i5-12400
RAM: Corsair Vengeance 64GB DDR5 5600MHz
3x 8TB HDD in one RAIDZ1 pool (data)
3x 1TB SSD in one RAIDZ1 pool (apps)
NVIDIA 2080 passed through to VM
Google Coral TPU passed through to VM

grigory · Sep 17, 2023

@brahmy thanks for providing all this detail. I would be curious to know what swap usage looked like.

Has anyone raised this as a bug with ixsystems?

I will also report that after moving from 32GB to 64GB my problem seems to be gone. Although I have had some other reasons to restart, I think I was able to get to 20 days uptime (longest over the past 7-10 restarts), time will tell if this is resolved.

brahmy · Sep 17, 2023

grigory said:
@brahmy thanks for providing all this detail. I would be curious to know what swap usage looked like.

Has anyone raised this as a bug with ixsystems?

I will also report that after moving from 32GB to 64GB my problem seems to be gone. Although I have had some other reasons to restart, I think I was able to get to 20 days uptime (longest over the past 7-10 restarts), time will tell if this is resolved.

If the screenshot from the TrueNAS dashboard can be believed, I was using 0 swap memory (thought Telegraf/influx metric was off but TN dashboard was the same).

Also pictured:

free mem.png - lots of free RAM, seems to indicate I'm not memory-constrained.
blocked and zombie processes shoot up around 4am CST when this issue started.
Possible smoking gun? A big spike on scrub_task.png indicates something going on with a boot pool scrub exactly just before the CPU usage and everything else spiked up. Both my pools got scrubbed last night so maybe that triggered something?

So MAYBE a good question for others experiencing the issue, does the freeze line up with the start or completion of a scrub task?

grigory · Sep 18, 2023

ugh I spoke too soon, this happened to me today while watching a video on jellyfin.
The failure does not coincide with a scrub task.
As with previous all swap memory is used (0mb free) and oom-killer starts.
I raised this ticket, I encourage others with this issue to login there and provide context.

[NAS-124163] - iXsystems TrueNAS Jira

ixsystems.atlassian.net

morganL · Sep 18, 2023

DonnerPlays said:
Apps stopped responding and looking at the webui
- The Apps tab was no longer loading
- WebUI reported the ram used up completly by "Services"

htop through ssh showed all ram used up by
/usr/bin/dockerd -H fd://
~71% and rising slowly

There were multiple rows with the same command using up the same exact percentage of ram
at the top was one that had a Time+ of ~405h (I think, wouldn't match my current uptime of ~9days though)

I killed that process using htop => F9 => SIGKILL

- the ram usage in htop immediately dropped to near zero
- the apps tab in the webui worked again
- the webui responded faster
- ram usage shown in webui returned back to near zero, slowly increasing on zfs cache & services
- all apps had been killed, slowly deploying again.

I would assume it's some app causing this suddenly... I just don't know a way to figure out WHICH app is using up all the ram...

This diagnosis seems very interesting . can anyone else confirm this dockerd problem with htop.

It seems to pint to a memory leak in dockerd... perhaps exaggerated when an app does something.

Interestingly, Cobia replaces dockerd with containerd. If anyone can conform the issue and then test with Cobia, it would be useful.

Cobia or SCALE 23.10-RC.1 comes out tomorrow. Don't try unless you have the skills to rollback if needed.

Important Announcement for the TrueNAS Community.

SCALE becomes unreachable every few weeks

MVP

Guru

Dabbler

Cadet

Explorer

Cadet

MVP

Dabbler

Explorer

Dabbler

Explorer

Dabbler

Explorer

Cadet

MVP

Dabbler

Attachments

Dabbler

Dabbler

Attachments

Dabbler

Captain Morgan

Similar threads