SCALE becomes unreachable every few weeks

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
FYI, in the ticket with ixsystems it was suggested that one or more pods is overloading the system and I should try to narrow down the problem process/pod.
Since replicating this seems to take on average 2 weeks, I am not able to justify disabling an app and then waiting two weeks without that application.
For now I have set the following script to run every 5m and hopefully next time this happens I will capture something useful.

Code:
#!/bin/bash

# Define the directory where the stats files will be stored
output_directory="/root/dockerstats_output/"

# Define the file name format with date
output_file="${output_directory}$(date +"%Y%m%d").txt"

# Get current timestamp
timestamp=$(date +"%Y-%m-%d %H:%M:%S")

# Get top processes with headers
top_processes=$(top -b -n 1 | head -n 22)

# Run docker stats command and filter top 10 containers by memory usage
docker stats --format "table {{.ID}}\t{{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream | {
    echo "Timestamp: $timestamp"
    echo -e "\nDocker Stats:"
    echo -e "CONTAINER ID\tNAME\tCPU %\tMEM USAGE (LIMIT)"
    tail -n +2 | sort -k4 -h -r | head -n 10
    echo -e "\nTop Processes:"
    echo "$top_processes"
} >> "$output_file"
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I am not able to justify disabling an app and then waiting two weeks without that application.
Imho the justification is the resolution of the issue, but it's not not my system nor my time.
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
Imho the justification is the resolution of the issue, but it's not not my system nor my time.
Yes fair, I mean I am not willing to disable each app for two weeks to figure out it isnt that app and I need to go on to the next one. If I could easily pick the bad one and disable it I think I could justify that.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Yes fair, I mean I am not willing to disable each app for two weeks to figure out it isnt that app and I need to go on to the next one. If I could easily pick the bad one and disable it I think I could justify that.
Yup and I completely understand that... in the end someone has to put in some of his own time for the greater (including his) good.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Looking for similar problem and found this problem and later resolution:

PROBLEM: I've been troubleshooting my NAS crashing. I'm running the latest edition of TrueNAS Scale 22.12.3.3, but I've has this on previous versions too. This issues has popped up on and off since updating from Angelfish to Bluefin. Occasionally I will notice my NAS has crashed and have watched it crash. I have pinpointed it down to when the Plex app is running (Official or Truecharts). Right before it crashes, the dockerd process (seen in top and in the gui dashboard shows "services" consuming ram) consumes all system memory, causing other processes to be killed by oom-killer (seen in log/messages) until ultimately it needs a reboot. I have even tried backing up my config and reinstalling the OS fresh and restoring config to no avail.

SOLUTION: unmounted pool in apps setting window, nuked the ix-applications folder, mounted pool and reconfigured apps from scratch the same. This resolved my issue. Suspect leftover garbage code from updating from angelfish to bluefin as that is when this issue started.

So a question: does the problem occur only on systems that updated from angelfish to bluefin??
 

sdf786

Cadet
Joined
Sep 29, 2022
Messages
5
I am having the same issue on TrueNAS Scale 22.12.3.3 with a fresh install, no upgrade. I am not using any local apps or docker on the Truenas host itself. I only have a Rocky 9 VM going with podman/docker on it with a mounted NFS volume from Truenas. Single network port is in bridged mode.

After 2 days the network drops to all VMs and the Truenas Scale host.

Running with:
32GB ECC Ram
3900X CPU
ASUS AM4 TUF Gaming X570-Plus
rx 580 GPU
4x 8 TB WD Red plus
2 x 1 TB Samsung 870 evos
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I am having the same issue on TrueNAS Scale 22.12.3.3 with a fresh install, no upgrade. I am not using any local apps or docker on the Truenas host itself. I only have a Rocky 9 VM going with podman/docker on it with a mounted NFS volume from Truenas. Single network port is in bridged mode.

After 2 days the network drops to all VMs and the Truenas Scale host.
Sounds like a different problem.... can you see if the same symptoms of memory shortages are causing the issue?

How much RAM was allocated to the VM?
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
Looking for similar problem and found this problem and later resolution:

PROBLEM: I've been troubleshooting my NAS crashing. I'm running the latest edition of TrueNAS Scale 22.12.3.3, but I've has this on previous versions too. This issues has popped up on and off since updating from Angelfish to Bluefin. Occasionally I will notice my NAS has crashed and have watched it crash. I have pinpointed it down to when the Plex app is running (Official or Truecharts). Right before it crashes, the dockerd process (seen in top and in the gui dashboard shows "services" consuming ram) consumes all system memory, causing other processes to be killed by oom-killer (seen in log/messages) until ultimately it needs a reboot. I have even tried backing up my config and reinstalling the OS fresh and restoring config to no avail.

SOLUTION: unmounted pool in apps setting window, nuked the ix-applications folder, mounted pool and reconfigured apps from scratch the same. This resolved my issue. Suspect leftover garbage code from updating from angelfish to bluefin as that is when this issue started.

So a question: does the problem occur only on systems that updated from angelfish to bluefin??
It would be interesting to compare the nuked ix-apps folder with the new fully configured one. What exactly would have changed there?
I do think this problem occurred for me after this upgrade but I also added probably 5+ apps and make several other changes (changed datasets to unencrypted and other config changes) after upgrading as well.
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
I tried to replicate this using the closest thing I could to the crash that happened yesterday (my kids watching a movie). Although this has happened at other times.
I was monitoring htop and jellyfin was using a lot of resources as I expected, although all cores pinned around 95~100% was not expected.
I let the movie play for 30m, and no crash, very high cpu. I noticed ffmpeg was taking a lot of CPU I think it is transcoding when it should not be, my understanding is that jellyfin will only transcode when necessary. I have explicitly disabled this in jellyfin now.
When I stopped the movie in jellyfin waiting about 5~10m saw all resources drop to reasonable levels (30~60% cpu). After this happened the server crashed.
I am hesitant to say this is the issue every time, but at least in my case this appears to maybe the most common cause. As I said it has happened in the middle of the night when no one is actively streaming media.

attached is some of the script output during this time.
 

Attachments

  • 2023-09-20 11_44_29-TrueNAS - 192.168.50.200 — Mozilla Firefox.png
    2023-09-20 11_44_29-TrueNAS - 192.168.50.200 — Mozilla Firefox.png
    804.7 KB · Views: 153

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It would be interesting to compare the nuked ix-apps folder with the new fully configured one. What exactly would have changed there?
I do think this problem occurred for me after this upgrade but I also added probably 5+ apps and make several other changes (changed datasets to unencrypted and other config changes) after upgrading as well.
There was a major change to overlayfs data management. I have no idea whether this could be a cause.
Just trying to see if this is a pattern.
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
I just had this happen again while I was uploading a large number of assets to immich.
however my script output is less helpful.
 

Attachments

  • 2023-09-21 16_58_54-TrueNAS - 192.168.50.200 — Mozilla Firefox.png
    2023-09-21 16_58_54-TrueNAS - 192.168.50.200 — Mozilla Firefox.png
    771 KB · Views: 154

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
I have had this happen two more times while trying to finish uploading assets to immich. And one other time while the system was coming back up it just crashed without me trying to do anything at all, just as all the apps were deploying.
I noticed this discord thread about prometheus being a resource hog as well maybe useful for some. https://discord.com/channels/830763548678291466/1143544215809298522

I am very tempted to grasp the Cobia RC straw, and install it to see if this gets resolved.
 

grigory

Dabbler
Joined
Dec 22, 2022
Messages
32
I disabled prometheus metrics on all apps and shutdown the prometheus-operator and prometheus app (all truecharts).
So far, after doing that, I was able to perform the same action in immich that had crashed things previously.
So can this be chalked up to my system being overloaded? I will wait and see.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I disabled prometheus metrics on all apps and shutdown the prometheus-operator and prometheus app (all truecharts).
So far, after doing that, I was able to perform the same action in immich that had crashed things previously.
So can this be chalked up to my system being overloaded? I will wait and see.

Useful to know..... I don't know if its overloaded or whether the app or dockerd has a memory leak?
 

sdf786

Cadet
Joined
Sep 29, 2022
Messages
5
Sounds like a different problem.... can you see if the same symptoms of memory shortages are causing the issue?

How much RAM was allocated to the VM?
Should have mentioned I do have 64 GB of ECC Ram on the system with 16 GB allocated to the singular VM
 

sdf786

Cadet
Joined
Sep 29, 2022
Messages
5
Will add too, i moved to 23.10-RC.1 and still the same issue after a few days with the network hanging.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Will add too, i moved to 23.10-RC.1 and still the same issue after a few days with the network hanging.
If you have any diagnostics on the memory consumption.. eg htop.... could you open a NAS ticket (report a bug).

Simple configurations that exhibit the issue make diagnosis much easier.

Any Apps at all?
 

sdf786

Cadet
Joined
Sep 29, 2022
Messages
5
no, no apps. Had a fresh install of 22.12.3.3 from the ISO in which i moved away from using the app directly so never configured that part at all.

I have a separate VM for docker on RHEL which i upped the mem too.

Outside of that I have one NFS (mounted to VM) and SMB share and bridged networking (coming from an ethernet pcie card) so the VM and Scale can talk. Z1 array. Nothing crazy going in terms of config.

I am trying to find a way to get a crash dump/support bundle via cli the next time it nose dives, since I will probably only have console access. If you have the command I will see if i can get it for you guys
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
no, no apps. Had a fresh install of 22.12.3.3 from the ISO in which i moved away from using the app directly so never configured that part at all.

I have a separate VM for docker on RHEL which i upped the mem too.

Outside of that I have one NFS (mounted to VM) and SMB share and bridged networking (coming from an ethernet pcie card) so the VM and Scale can talk. Z1 array. Nothing crazy going in terms of config.

I am trying to find a way to get a crash dump/support bundle via cli the next time it nose dives, since I will probably only have console access. If you have the command I will see if i can get it for you guys
If you report the issue, the developers may tell you what they need. Thanks.
 
Top