SCALE becomes unreachable every few weeks

grigory · Sep 19, 2023

FYI, in the ticket with ixsystems it was suggested that one or more pods is overloading the system and I should try to narrow down the problem process/pod.
Since replicating this seems to take on average 2 weeks, I am not able to justify disabling an app and then waiting two weeks without that application.
For now I have set the following script to run every 5m and hopefully next time this happens I will capture something useful.

Code:

#!/bin/bash

# Define the directory where the stats files will be stored
output_directory="/root/dockerstats_output/"

# Define the file name format with date
output_file="${output_directory}$(date +"%Y%m%d").txt"

# Get current timestamp
timestamp=$(date +"%Y-%m-%d %H:%M:%S")

# Get top processes with headers
top_processes=$(top -b -n 1 | head -n 22)

# Run docker stats command and filter top 10 containers by memory usage
docker stats --format "table {{.ID}}\t{{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream | {
    echo "Timestamp: $timestamp"
    echo -e "\nDocker Stats:"
    echo -e "CONTAINER ID\tNAME\tCPU %\tMEM USAGE (LIMIT)"
    tail -n +2 | sort -k4 -h -r | head -n 10
    echo -e "\nTop Processes:"
    echo "$top_processes"
} >> "$output_file"

Davvo · Sep 19, 2023

grigory said:
I am not able to justify disabling an app and then waiting two weeks without that application.

Imho the justification is the resolution of the issue, but it's not not my system nor my time.

grigory · Sep 19, 2023

Davvo said:
Imho the justification is the resolution of the issue, but it's not not my system nor my time.

Yes fair, I mean I am not willing to disable each app for two weeks to figure out it isnt that app and I need to go on to the next one. If I could easily pick the bad one and disable it I think I could justify that.

Davvo · Sep 19, 2023

grigory said:
Yes fair, I mean I am not willing to disable each app for two weeks to figure out it isnt that app and I need to go on to the next one. If I could easily pick the bad one and disable it I think I could justify that.

Yup and I completely understand that... in the end someone has to put in some of his own time for the greater (including his) good.

morganL · Sep 19, 2023

Looking for similar problem and found this problem and later resolution:

PROBLEM: I've been troubleshooting my NAS crashing. I'm running the latest edition of TrueNAS Scale 22.12.3.3, but I've has this on previous versions too. This issues has popped up on and off since updating from Angelfish to Bluefin. Occasionally I will notice my NAS has crashed and have watched it crash. I have pinpointed it down to when the Plex app is running (Official or Truecharts). Right before it crashes, the dockerd process (seen in top and in the gui dashboard shows "services" consuming ram) consumes all system memory, causing other processes to be killed by oom-killer (seen in log/messages) until ultimately it needs a reboot. I have even tried backing up my config and reinstalling the OS fresh and restoring config to no avail.

SOLUTION: unmounted pool in apps setting window, nuked the ix-applications folder, mounted pool and reconfigured apps from scratch the same. This resolved my issue. Suspect leftover garbage code from updating from angelfish to bluefin as that is when this issue started.

So a question: does the problem occur only on systems that updated from angelfish to bluefin??

sdf786 · Sep 19, 2023

I am having the same issue on TrueNAS Scale 22.12.3.3 with a fresh install, no upgrade. I am not using any local apps or docker on the Truenas host itself. I only have a Rocky 9 VM going with podman/docker on it with a mounted NFS volume from Truenas. Single network port is in bridged mode.

After 2 days the network drops to all VMs and the Truenas Scale host.

Running with:
32GB ECC Ram
3900X CPU
ASUS AM4 TUF Gaming X570-Plus
rx 580 GPU
4x 8 TB WD Red plus
2 x 1 TB Samsung 870 evos

morganL · Sep 20, 2023

sdf786 said:
I am having the same issue on TrueNAS Scale 22.12.3.3 with a fresh install, no upgrade. I am not using any local apps or docker on the Truenas host itself. I only have a Rocky 9 VM going with podman/docker on it with a mounted NFS volume from Truenas. Single network port is in bridged mode.

After 2 days the network drops to all VMs and the Truenas Scale host.

Sounds like a different problem.... can you see if the same symptoms of memory shortages are causing the issue?

How much RAM was allocated to the VM?

grigory · Sep 20, 2023

morganL said:
Looking for similar problem and found this problem and later resolution:

PROBLEM: I've been troubleshooting my NAS crashing. I'm running the latest edition of TrueNAS Scale 22.12.3.3, but I've has this on previous versions too. This issues has popped up on and off since updating from Angelfish to Bluefin. Occasionally I will notice my NAS has crashed and have watched it crash. I have pinpointed it down to when the Plex app is running (Official or Truecharts). Right before it crashes, the dockerd process (seen in top and in the gui dashboard shows "services" consuming ram) consumes all system memory, causing other processes to be killed by oom-killer (seen in log/messages) until ultimately it needs a reboot. I have even tried backing up my config and reinstalling the OS fresh and restoring config to no avail.

SOLUTION: unmounted pool in apps setting window, nuked the ix-applications folder, mounted pool and reconfigured apps from scratch the same. This resolved my issue. Suspect leftover garbage code from updating from angelfish to bluefin as that is when this issue started.

So a question: does the problem occur only on systems that updated from angelfish to bluefin??

It would be interesting to compare the nuked ix-apps folder with the new fully configured one. What exactly would have changed there?
I do think this problem occurred for me after this upgrade but I also added probably 5+ apps and make several other changes (changed datasets to unencrypted and other config changes) after upgrading as well.

grigory · Sep 20, 2023

I tried to replicate this using the closest thing I could to the crash that happened yesterday (my kids watching a movie). Although this has happened at other times.
I was monitoring htop and jellyfin was using a lot of resources as I expected, although all cores pinned around 95~100% was not expected.
I let the movie play for 30m, and no crash, very high cpu. I noticed ffmpeg was taking a lot of CPU I think it is transcoding when it should not be, my understanding is that jellyfin will only transcode when necessary. I have explicitly disabled this in jellyfin now.
When I stopped the movie in jellyfin waiting about 5~10m saw all resources drop to reasonable levels (30~60% cpu). After this happened the server crashed.
I am hesitant to say this is the issue every time, but at least in my case this appears to maybe the most common cause. As I said it has happened in the middle of the night when no one is actively streaming media.

attached is some of the script output during this time.

morganL · Sep 20, 2023

grigory said:
It would be interesting to compare the nuked ix-apps folder with the new fully configured one. What exactly would have changed there?
I do think this problem occurred for me after this upgrade but I also added probably 5+ apps and make several other changes (changed datasets to unencrypted and other config changes) after upgrading as well.

There was a major change to overlayfs data management. I have no idea whether this could be a cause.
Just trying to see if this is a pattern.

grigory · Sep 21, 2023

I just had this happen again while I was uploading a large number of assets to immich.
however my script output is less helpful.

grigory · Sep 21, 2023

I have had this happen two more times while trying to finish uploading assets to immich. And one other time while the system was coming back up it just crashed without me trying to do anything at all, just as all the apps were deploying.
I noticed this discord thread about prometheus being a resource hog as well maybe useful for some. https://discord.com/channels/830763548678291466/1143544215809298522

I am very tempted to grasp the Cobia RC straw, and install it to see if this gets resolved.

grigory · Sep 22, 2023

I disabled prometheus metrics on all apps and shutdown the prometheus-operator and prometheus app (all truecharts).
So far, after doing that, I was able to perform the same action in immich that had crashed things previously.
So can this be chalked up to my system being overloaded? I will wait and see.

morganL · Sep 23, 2023

grigory said:
I disabled prometheus metrics on all apps and shutdown the prometheus-operator and prometheus app (all truecharts).
So far, after doing that, I was able to perform the same action in immich that had crashed things previously.
So can this be chalked up to my system being overloaded? I will wait and see.

Useful to know..... I don't know if its overloaded or whether the app or dockerd has a memory leak?

sdf786 · Sep 24, 2023

morganL said:
Sounds like a different problem.... can you see if the same symptoms of memory shortages are causing the issue?

How much RAM was allocated to the VM?

Should have mentioned I do have 64 GB of ECC Ram on the system with 16 GB allocated to the singular VM

sdf786 · Sep 24, 2023

Will add too, i moved to 23.10-RC.1 and still the same issue after a few days with the network hanging.

morganL · Sep 24, 2023

sdf786 said:
Will add too, i moved to 23.10-RC.1 and still the same issue after a few days with the network hanging.

If you have any diagnostics on the memory consumption.. eg htop.... could you open a NAS ticket (report a bug).

Simple configurations that exhibit the issue make diagnosis much easier.

Any Apps at all?

sdf786 · Sep 24, 2023

no, no apps. Had a fresh install of 22.12.3.3 from the ISO in which i moved away from using the app directly so never configured that part at all.

I have a separate VM for docker on RHEL which i upped the mem too.

Outside of that I have one NFS (mounted to VM) and SMB share and bridged networking (coming from an ethernet pcie card) so the VM and Scale can talk. Z1 array. Nothing crazy going in terms of config.

I am trying to find a way to get a crash dump/support bundle via cli the next time it nose dives, since I will probably only have console access. If you have the command I will see if i can get it for you guys

morganL · Sep 24, 2023

sdf786 said:
no, no apps. Had a fresh install of 22.12.3.3 from the ISO in which i moved away from using the app directly so never configured that part at all.

I have a separate VM for docker on RHEL which i upped the mem too.

Outside of that I have one NFS (mounted to VM) and SMB share and bridged networking (coming from an ethernet pcie card) so the VM and Scale can talk. Z1 array. Nothing crazy going in terms of config.

I am trying to find a way to get a crash dump/support bundle via cli the next time it nose dives, since I will probably only have console access. If you have the command I will see if i can get it for you guys

If you report the issue, the developers may tell you what they need. Thanks.

Important Announcement for the TrueNAS Community.

SCALE becomes unreachable every few weeks

Dabbler

MVP

Dabbler

MVP

Captain Morgan

Cadet

Captain Morgan

Dabbler

Dabbler

Attachments

Captain Morgan

Dabbler

Attachments

Dabbler

Dabbler

Captain Morgan

Cadet

Cadet

Captain Morgan

Cadet

Captain Morgan

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SCALE becomes unreachable every few weeks"

Similar threads