I have two systems that seems to be running out of RAM
Both devices are backup destinations:
#1) "new one" - >100TB of space w/ deduplication, 96Gb RAM, 2x 10GbE NICs, 2x xeon 4214, supermicro chassis.
#2) "old one" -> less pace, maybe dedupe?, 128Gb RAM, 2x 10GbE NICs, 2x much, much older xeons.
Services running on hosts: Middlewared. SMB. SNMP. SSH. NFS. SMART. No jails. No VMs. When there isn't a backup running, the machines do literally nothing.
Both are TrueNAS Core 13.0-U5.3
I've had a look around for solutions and the internet 12 months ago was saying this was middlewared and you could restart it and away you go.
But that was for CORE 12.n / Python 3.8, now by the time the issue happens, SSH / WebUI / Console are already unresponsive.
After they stop answering, the screen plugged into it will have a lot of "collectd low watermark reached discarding 100% of metrics" messages. Pressing enter in the console does not get me a menu or login prompt, i just get newlines on the screen. ctrl-c prints a ^C and basically, nothing useful happens.
Basically once the machines are in this state, the only button that does anything is a long press on the power button.
during the first boot after a power-cycle, "new one" will reboot after some "spa_misc.c" lines on the console. every time.
The second boot after a powercycle will seem to do nothing after those lines for a while, but will eventually continue.
all up from power-on to login, both take almost 30 minutes to boot up, lots of that time the fans are not ramped up, and the HDD lights aren't blinking, so i'm really not sure what it's doing, but it looks suspiciously like "nothing" (update; "old one" has two disks that blink constantly, it might be loading the dedupe tables?)
when "new one" does finally boot up, "services" according to the webUI uses a lot of RAM, last check: growing at 0.3Gb / tick until OOM and crash.
Just now I've been looking at it (not being able to run backups is getting distressing) the machine was almost OOM after 30 minutes of uptime. the "services" on the dashboard was using 30Gb initially but I told the UI to reboot at about 50Gb used by services.
after rebooting the memory leak situation seems to have abated?
40 minutes of runtime, and "only" 23.3GB used by services.
I just typed a bit of "i don't know what's happening but it seems to be good now" speil, but when back to check on it - it's memleaking again.
What find weird is I have a pair of dual-cpu xeons here at work, both with <96Gb of RAM and they go OOM randomly, it was aevery few months, but seems to have secelated to "as little as an hour" recently. At home I have a (cheapest ryzen at the time of build) with 32Gb of RAM and ~16Tb of storage (no dedupe) runnign a bunch of plugins and a VM, and it never has RAM issues.
ANYWAY. Does anyone have anything they can suggest?
Both devices are backup destinations:
#1) "new one" - >100TB of space w/ deduplication, 96Gb RAM, 2x 10GbE NICs, 2x xeon 4214, supermicro chassis.
#2) "old one" -> less pace, maybe dedupe?, 128Gb RAM, 2x 10GbE NICs, 2x much, much older xeons.
Services running on hosts: Middlewared. SMB. SNMP. SSH. NFS. SMART. No jails. No VMs. When there isn't a backup running, the machines do literally nothing.
Both are TrueNAS Core 13.0-U5.3
I've had a look around for solutions and the internet 12 months ago was saying this was middlewared and you could restart it and away you go.
But that was for CORE 12.n / Python 3.8, now by the time the issue happens, SSH / WebUI / Console are already unresponsive.
After they stop answering, the screen plugged into it will have a lot of "collectd low watermark reached discarding 100% of metrics" messages. Pressing enter in the console does not get me a menu or login prompt, i just get newlines on the screen. ctrl-c prints a ^C and basically, nothing useful happens.
Basically once the machines are in this state, the only button that does anything is a long press on the power button.
during the first boot after a power-cycle, "new one" will reboot after some "spa_misc.c" lines on the console. every time.
The second boot after a powercycle will seem to do nothing after those lines for a while, but will eventually continue.
all up from power-on to login, both take almost 30 minutes to boot up, lots of that time the fans are not ramped up, and the HDD lights aren't blinking, so i'm really not sure what it's doing, but it looks suspiciously like "nothing" (update; "old one" has two disks that blink constantly, it might be loading the dedupe tables?)
when "new one" does finally boot up, "services" according to the webUI uses a lot of RAM, last check: growing at 0.3Gb / tick until OOM and crash.
Just now I've been looking at it (not being able to run backups is getting distressing) the machine was almost OOM after 30 minutes of uptime. the "services" on the dashboard was using 30Gb initially but I told the UI to reboot at about 50Gb used by services.
after rebooting the memory leak situation seems to have abated?
40 minutes of runtime, and "only" 23.3GB used by services.
I just typed a bit of "i don't know what's happening but it seems to be good now" speil, but when back to check on it - it's memleaking again.
What find weird is I have a pair of dual-cpu xeons here at work, both with <96Gb of RAM and they go OOM randomly, it was aevery few months, but seems to have secelated to "as little as an hour" recently. At home I have a (cheapest ryzen at the time of build) with 32Gb of RAM and ~16Tb of storage (no dedupe) runnign a bunch of plugins and a VM, and it never has RAM issues.
ANYWAY. Does anyone have anything they can suggest?