Some kind of memleak happening.

lfitt · Oct 3, 2023

I have two systems that seems to be running out of RAM

Both devices are backup destinations:
#1) "new one" - >100TB of space w/ deduplication, 96Gb RAM, 2x 10GbE NICs, 2x xeon 4214, supermicro chassis.
#2) "old one" -> less pace, maybe dedupe?, 128Gb RAM, 2x 10GbE NICs, 2x much, much older xeons.

Services running on hosts: Middlewared. SMB. SNMP. SSH. NFS. SMART. No jails. No VMs. When there isn't a backup running, the machines do literally nothing.

Both are TrueNAS Core 13.0-U5.3

I've had a look around for solutions and the internet 12 months ago was saying this was middlewared and you could restart it and away you go.
But that was for CORE 12.n / Python 3.8, now by the time the issue happens, SSH / WebUI / Console are already unresponsive.

After they stop answering, the screen plugged into it will have a lot of "collectd low watermark reached discarding 100% of metrics" messages. Pressing enter in the console does not get me a menu or login prompt, i just get newlines on the screen. ctrl-c prints a ^C and basically, nothing useful happens.

Basically once the machines are in this state, the only button that does anything is a long press on the power button.

during the first boot after a power-cycle, "new one" will reboot after some "spa_misc.c" lines on the console. every time.
The second boot after a powercycle will seem to do nothing after those lines for a while, but will eventually continue.

all up from power-on to login, both take almost 30 minutes to boot up, lots of that time the fans are not ramped up, and the HDD lights aren't blinking, so i'm really not sure what it's doing, but it looks suspiciously like "nothing" (update; "old one" has two disks that blink constantly, it might be loading the dedupe tables?)

when "new one" does finally boot up, "services" according to the webUI uses a lot of RAM, last check: growing at 0.3Gb / tick until OOM and crash.

Just now I've been looking at it (not being able to run backups is getting distressing) the machine was almost OOM after 30 minutes of uptime. the "services" on the dashboard was using 30Gb initially but I told the UI to reboot at about 50Gb used by services.

after rebooting the memory leak situation seems to have abated?
40 minutes of runtime, and "only" 23.3GB used by services.

I just typed a bit of "i don't know what's happening but it seems to be good now" speil, but when back to check on it - it's memleaking again.

What find weird is I have a pair of dual-cpu xeons here at work, both with <96Gb of RAM and they go OOM randomly, it was aevery few months, but seems to have secelated to "as little as an hour" recently. At home I have a (cheapest ryzen at the time of build) with 32Gb of RAM and ~16Tb of storage (no dedupe) runnign a bunch of plugins and a VM, and it never has RAM issues.

ANYWAY. Does anyone have anything they can suggest?

jixam · Nov 17, 2023

We have similar problems with TrueNAS Core 13.0-U5.3 and it seems to be zettarepl that is leaking memory. Can you confirm that your backup destinations are handling the replication, i.e. the direction is "pull"?

It has been some time, did you work around the issue?

Davvo · Nov 17, 2023

@jixam I would suggest filing a bug report.

NugentS · Nov 17, 2023

Depending on your data and how much dedup is happenning - you may (very probably are) running into inadequate hardware for running dedupe.

Dedupe on TrueNAS is possible - but requires specific hardware to make it work properly. Dedupe is incredibly hardware intensive
Loads of memory - which you don't have
dedupe special - which you don't appear to have.

The bad news is that you can't just add the required hardware and dedupe starts to work. You will have to remove any data from a deduped dataset / pool and then copy it back.

jixam · Nov 24, 2023

Our issue is resolved by this: https://ixsystems.atlassian.net/browse/NAS-125338

Important Announcement for the TrueNAS Community.

Some kind of memleak happening.

lfitt

Cadet

jixam

Dabbler

Davvo

MVP

NugentS

MVP

jixam

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Some kind of memleak happening.

lfitt

Cadet

jixam

Dabbler

Davvo

MVP

NugentS

MVP

jixam

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Some kind of memleak happening."

Similar threads