Some kind of memleak happening.

lfitt

Cadet
Joined
Oct 3, 2023
Messages
1
I have two systems that seems to be running out of RAM

Both devices are backup destinations:
#1) "new one" - >100TB of space w/ deduplication, 96Gb RAM, 2x 10GbE NICs, 2x xeon 4214, supermicro chassis.
#2) "old one" -> less pace, maybe dedupe?, 128Gb RAM, 2x 10GbE NICs, 2x much, much older xeons.

Services running on hosts: Middlewared. SMB. SNMP. SSH. NFS. SMART. No jails. No VMs. When there isn't a backup running, the machines do literally nothing.

Both are TrueNAS Core 13.0-U5.3

I've had a look around for solutions and the internet 12 months ago was saying this was middlewared and you could restart it and away you go.
But that was for CORE 12.n / Python 3.8, now by the time the issue happens, SSH / WebUI / Console are already unresponsive.

After they stop answering, the screen plugged into it will have a lot of "collectd low watermark reached discarding 100% of metrics" messages. Pressing enter in the console does not get me a menu or login prompt, i just get newlines on the screen. ctrl-c prints a ^C and basically, nothing useful happens.

Basically once the machines are in this state, the only button that does anything is a long press on the power button.

during the first boot after a power-cycle, "new one" will reboot after some "spa_misc.c" lines on the console. every time.
The second boot after a powercycle will seem to do nothing after those lines for a while, but will eventually continue.

all up from power-on to login, both take almost 30 minutes to boot up, lots of that time the fans are not ramped up, and the HDD lights aren't blinking, so i'm really not sure what it's doing, but it looks suspiciously like "nothing" (update; "old one" has two disks that blink constantly, it might be loading the dedupe tables?)

when "new one" does finally boot up, "services" according to the webUI uses a lot of RAM, last check: growing at 0.3Gb / tick until OOM and crash.

Just now I've been looking at it (not being able to run backups is getting distressing) the machine was almost OOM after 30 minutes of uptime. the "services" on the dashboard was using 30Gb initially but I told the UI to reboot at about 50Gb used by services.

after rebooting the memory leak situation seems to have abated?
40 minutes of runtime, and "only" 23.3GB used by services.

I just typed a bit of "i don't know what's happening but it seems to be good now" speil, but when back to check on it - it's memleaking again.

What find weird is I have a pair of dual-cpu xeons here at work, both with <96Gb of RAM and they go OOM randomly, it was aevery few months, but seems to have secelated to "as little as an hour" recently. At home I have a (cheapest ryzen at the time of build) with 32Gb of RAM and ~16Tb of storage (no dedupe) runnign a bunch of plugins and a VM, and it never has RAM issues.


ANYWAY. Does anyone have anything they can suggest?
 

jixam

Dabbler
Joined
May 1, 2015
Messages
47
We have similar problems with TrueNAS Core 13.0-U5.3 and it seems to be zettarepl that is leaking memory. Can you confirm that your backup destinations are handling the replication, i.e. the direction is "pull"?

It has been some time, did you work around the issue?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Depending on your data and how much dedup is happenning - you may (very probably are) running into inadequate hardware for running dedupe.

Dedupe on TrueNAS is possible - but requires specific hardware to make it work properly. Dedupe is incredibly hardware intensive
Loads of memory - which you don't have
dedupe special - which you don't appear to have.

The bad news is that you can't just add the required hardware and dedupe starts to work. You will have to remove any data from a deduped dataset / pool and then copy it back.
 
Last edited:
Top