SCALE becomes unreachable every few weeks

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
That wasn't my intention at all. I was merely trying to point out that my understanding of you as a TN user is that the vast majority of your experience is in CORE. If I misspoke there, that's on me and I can apologize for that.
Apology accepted. You are right that my experience is in CORE.

To be clear - it is NOT a solution. It's a shitty walmart brand band-aid to mask the underlying problem.
Agreed, it's using duct tape to hold togheter your car in order to reach the repair shop.
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
Okay, I didn’t think 32GB was so tight but fair enough. I’m not actually using docker at all, just TrueCharts apps on SCALE‘s k3s.

Heavyscript just says docker, but it’s a tool to manage Truecharts apps.
I have this exact same issue. I am running an IX Systems MINI XL+ with 32 GB RAM (4 WD red Plus and 1 wd ssd (plus os drive) (2gb swap)
My system will run for a few days, maybe upto two weeks maximum before the only thing that will allow me to access webui again is a reset. The system boots up and I can login. Inspecting the log shows a lot of oom-killer messages. As we have heard here the linux memory management does its thing, basically what I can see in the log is it starts killing things with the (lowest?) oom score and then keeps on going down the list. It never kills enough to recover.

I also run several apps, all from truecharts catalog. I raised this in the truecharts discord and basically was told some thing else must be wrong, other users were running similar setups on 32GB ram.

1) blocky
2) cert-manager
3) chronos
4) cloudnative-pg
5) grafana
6) home-assistant
7) immich
8) jellyfin
9) loki
10) nextcloud
11) pgadmin
12) prometheus
13) prometheus-operator
14) prowlarr
15) qbittorrent
16) radarr
17) sonarr
18) traefik
19) zwavejs2mqtt

Since my system is still under warranty I raised this with IX Systems and they pointed me to this thread, explaining that they cannot offer me support on software and that based on the debug info all the memory modules appeared to be functioning correctly, no hardware faults.

I am going to double the ram and see how it goes.

I will say, it is quite confusing to see the free and cache lines in the TN Scale interface showing so much ram, but for some reason the system crashes in this way. I would expect the cache to be enough to recover the system, that being said it is impossible to say how much memory is being requested by the services, all I know is there is a spike and then oomkiller starts executing :P
 

DonnerPlays

Cadet
Joined
Aug 8, 2023
Messages
3
I have experienced this error for a long time as well. The NAS would usually crash after ~5days where it would start killing through killing all processes until nothing was left.

I'm pretty sure it's not RAM size related. It appears to be more of a bug (OS thinking there is no ram left)
I had 32GB ECC RAM in my nas (Ryzen build)
and the problem still occured.

I upgraded to 64 GB ECC RAM but the crashes still happened, no difference, still happened after ~5 days.

The last time it happened I had a short time where I could still access the WebUI (SSH was long gone) and was able to see the memory usage.

Which was sitting at ~20GB Cache ~20GB Services (more or less, but it definetly wasn't like the system had no RAM left)
(At least that's what the WebUI told me)

The ONLY thing that has (at least so far) resolved the issue was a reinstall of TrueNAS (backing up config, reinstall to same drive, restore backup)
System has been online for 25days now (with a manual reboot done by me while I was fighting against an App not working as I wanted) so It's probably more than 30days.

RAM usage according to the WebUI is currently:

Free: 6.0 GiB
ZFS Cache: 30.9 GiB
Services: 25.8 GiB
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
I have experienced this error for a long time as well. The NAS would usually crash after ~5days where it would start killing through killing all processes until nothing was left.

I'm pretty sure it's not RAM size related. It appears to be more of a bug (OS thinking there is no ram left)
I had 32GB ECC RAM in my nas (Ryzen build)
and the problem still occured.

I upgraded to 64 GB ECC RAM but the crashes still happened, no difference, still happened after ~5 days.

The last time it happened I had a short time where I could still access the WebUI (SSH was long gone) and was able to see the memory usage.

Which was sitting at ~20GB Cache ~20GB Services (more or less, but it definetly wasn't like the system had no RAM left)
(At least that's what the WebUI told me)

The ONLY thing that has (at least so far) resolved the issue was a reinstall of TrueNAS (backing up config, reinstall to same drive, restore backup)
System has been online for 25days now (with a manual reboot done by me while I was fighting against an App not working as I wanted) so It's probably more than 30days.

RAM usage according to the WebUI is currently:

Free: 6.0 GiB
ZFS Cache: 30.9 GiB
Services: 25.8 GiB
What’s the long-term history of RAM usage? Is it still walking towards 0 Free?
 

DonnerPlays

Cadet
Joined
Aug 8, 2023
Messages
3
What’s the long-term history of RAM usage? Is it still walking towards 0 Free?
Never goes to 0 RAM free

Based on what I know the amount of free ram doesn't matter because zfs just takes up everything left (I think till a specific point) after services to use for cache

If something needs ram, the zfs cache just reduces a bot.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Just looking at the swap space statistics should give you a definitive answer regarding the lack or RAM; if your systems crash without using any of the swap space it might be worth submitting a bug report.
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
I tried to reinstall the os and 6 days later oom-killer did its thing.
I can confirm that free swap goes to 0 when this happens.
Next I think I will install more ram, since its been recommended here and I have it on hand.
Then I think I will set my system to restart nightly rather than uninstall/reinstalling each app/configuration to test which might be causing this.

It is mildly upsetting that ix-systems support did not seem too concerned. If this could be a bug I would expect the support person to direct me to file a bug or some other course of action, instead they checked my debug file, said the hardware was fine and sent me away.

Anyway, just providing an update in case anyone else with the same issue comes across this.
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
I tried to reinstall the os and 6 days later oom-killer did its thing.
I can confirm that free swap goes to 0 when this happens.
Next I think I will install more ram, since its been recommended here and I have it on hand.
Then I think I will set my system to restart nightly rather than uninstall/reinstalling each app/configuration to test which might be causing this.

It is mildly upsetting that ix-systems support did not seem too concerned. If this could be a bug I would expect the support person to direct me to file a bug or some other course of action, instead they checked my debug file, said the hardware was fine and sent me away.

Anyway, just providing an update in case anyone else with the same issue comes across this.
I upgraded my RAM to 64GB and just had the same thing happen to me again.
I'm pretty sure it's not RAM size related. It appears to be more of a bug (OS thinking there is no ram left)
I had 32GB ECC RAM in my nas (Ryzen build)
and the problem still occured.

I upgraded to 64 GB ECC RAM but the crashes still happened, no difference, still happened after ~5 days.
It happened when I tried to stream (and transcode) a video with emby. It was fine for a week or so, with a fairly constant RAM usage somewhere around 25GB, 30GB cache, 6-7GB free for days.

I checked earlier on the same day and the above numbers were still true, and then suddenly the WebUI was inaccessible and htop reported RAM and swap exhausted.

What can I do to identify which app (or host process) is the cause of the issue?
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
I have added more ram, bringing the system to 64gb. I will wait now to see if it occurs again.
I did notice that overtime the swap seems to increase in use over the week prior to oom-killer kicking in.
I think if this happens again in a week, my next step will be to increase swap size and maybe add the command mentioned to a cron to sort of log the oom-killer a bit.

edit: I forgot to add, that I do notice this more when watching something on jellyfin, but that could just be noticable because that is the app I use the most.
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
I have added more ram, bringing the system to 64gb. I will wait now to see if it occurs again.
I did notice that overtime the swap seems to increase in use over the week prior to oom-killer kicking in.
I think if this happens again in a week, my next step will be to increase swap size and maybe add the command mentioned to a cron to sort of log the oom-killer a bit.

edit: I forgot to add, that I do notice this more when watching something on jellyfin, but that could just be noticable because that is the app I use the most.
Do you transcode?
 

DonnerPlays

Cadet
Joined
Aug 8, 2023
Messages
3
I have new information to share about this problem.

It was gone for me for a long time after reinstalling...

Unfortunatly it has been back again.

This time I was able to catch it though so I can share some more info.

Let me know if any logs could be helpful now, as I was able to stop the nas from crashing, I also still have all logs (usually they would be gone after a crash)

My findings:

Apps stopped responding and looking at the webui
- The Apps tab was no longer loading
- WebUI reported the ram used up completly by "Services"

htop through ssh showed all ram used up by
/usr/bin/dockerd -H fd://
~71% and rising slowly

There were multiple rows with the same command using up the same exact percentage of ram
at the top was one that had a Time+ of ~405h (I think, wouldn't match my current uptime of ~9days though)

I killed that process using htop => F9 => SIGKILL

- the ram usage in htop immediately dropped to near zero
- the apps tab in the webui worked again
- the webui responded faster
- ram usage shown in webui returned back to near zero, slowly increasing on zfs cache & services
- all apps had been killed, slowly deploying again.


I would assume it's some app causing this suddenly... I just don't know a way to figure out WHICH app is using up all the ram...
 

brahmy

Dabbler
Joined
Mar 24, 2022
Messages
13
Sorry to bump this thread but the same thing as OP just happened to me on TrueNAS-SCALE-22.12.3.2.

Web UI and ssh were totally unresponsive.

I can add some additional details unique to my setup...
  • While TN was totally unresponsive, running apps were OK, including:
    • A Ubuntu VM running Docker, influxdb, grafana, and a bunch of other stuff. I was able to ssh into the VM and VM-hosted apps were experiencing no lag or issues.
    • Three TrueNAS-hosted apps includning Telegraf, Plex, and Tailscale...
... What's interesting about this failure mode is that Telegraf was able to fire metrics on what was happening on TrueNAS into influx (see attached). Basically, the CPU was absolutely pinned (though the system was still respecting the allocation to the VM which was running fine), RAM still had plenty of free capacity, and in the bottom right of the image, there was a hiccup in S.M.A.R.T data from my boot drive. Not sure if this is cause or effect.

I can dig into the other metrics saved in influxdb if anyone things there would be any good clues there.

Fortunately I was able to do a controlled remote power-down of the VM, and after a hard reset of TrueNAS everything came back. This is the first time this has happened to me on this version and hardware mix (been running this since about June).

Hardware (random consumer grade stuff):
  • Motherboard:: MSI PRO Z790-A
  • CPU: Intel i5-12400
  • RAM: Corsair Vengeance 64GB DDR5 5600MHz
  • 3x 8TB HDD in one RAIDZ1 pool (data)
  • 3x 1TB SSD in one RAIDZ1 pool (apps)
  • NVIDIA 2080 passed through to VM
  • Google Coral TPU passed through to VM
 

Attachments

  • issue.PNG
    issue.PNG
    1.3 MB · Views: 52

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
@brahmy thanks for providing all this detail. I would be curious to know what swap usage looked like.

Has anyone raised this as a bug with ixsystems?

I will also report that after moving from 32GB to 64GB my problem seems to be gone. Although I have had some other reasons to restart, I think I was able to get to 20 days uptime (longest over the past 7-10 restarts), time will tell if this is resolved.
 

brahmy

Dabbler
Joined
Mar 24, 2022
Messages
13
@brahmy thanks for providing all this detail. I would be curious to know what swap usage looked like.

Has anyone raised this as a bug with ixsystems?

I will also report that after moving from 32GB to 64GB my problem seems to be gone. Although I have had some other reasons to restart, I think I was able to get to 20 days uptime (longest over the past 7-10 restarts), time will tell if this is resolved.
If the screenshot from the TrueNAS dashboard can be believed, I was using 0 swap memory (thought Telegraf/influx metric was off but TN dashboard was the same).

Also pictured:
  • free mem.png - lots of free RAM, seems to indicate I'm not memory-constrained.
  • blocked and zombie processes shoot up around 4am CST when this issue started.
  • Possible smoking gun? A big spike on scrub_task.png indicates something going on with a boot pool scrub exactly just before the CPU usage and everything else spiked up. Both my pools got scrubbed last night so maybe that triggered something?
So MAYBE a good question for others experiencing the issue, does the freeze line up with the start or completion of a scrub task?
 

Attachments

  • swap.PNG
    swap.PNG
    92.3 KB · Views: 49
  • free mem.png
    free mem.png
    506.8 KB · Views: 38
  • blocked and zombie processes.PNG
    blocked and zombie processes.PNG
    51 KB · Views: 37
  • scrub_task.png
    scrub_task.png
    57.7 KB · Views: 44

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
ugh I spoke too soon, this happened to me today while watching a video on jellyfin.
The failure does not coincide with a scrub task.
As with previous all swap memory is used (0mb free) and oom-killer starts.
I raised this ticket, I encourage others with this issue to login there and provide context.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Apps stopped responding and looking at the webui
- The Apps tab was no longer loading
- WebUI reported the ram used up completly by "Services"

htop through ssh showed all ram used up by
/usr/bin/dockerd -H fd://
~71% and rising slowly

There were multiple rows with the same command using up the same exact percentage of ram
at the top was one that had a Time+ of ~405h (I think, wouldn't match my current uptime of ~9days though)

I killed that process using htop => F9 => SIGKILL

- the ram usage in htop immediately dropped to near zero
- the apps tab in the webui worked again
- the webui responded faster
- ram usage shown in webui returned back to near zero, slowly increasing on zfs cache & services
- all apps had been killed, slowly deploying again.


I would assume it's some app causing this suddenly... I just don't know a way to figure out WHICH app is using up all the ram...

This diagnosis seems very interesting . can anyone else confirm this dockerd problem with htop.

It seems to pint to a memory leak in dockerd... perhaps exaggerated when an app does something.

Interestingly, Cobia replaces dockerd with containerd. If anyone can conform the issue and then test with Cobia, it would be useful.

Cobia or SCALE 23.10-RC.1 comes out tomorrow. Don't try unless you have the skills to rollback if needed.
 
Top