TrueNAS 12Stable Plex getting killed at 02:00:00 every day - "out of swap space"

StevoFNF · May 9, 2021

This has been the most mystifying technical issue I have dealt with in a very long time. Machine specs below:

AMD Ryzen 2700X
16GB Crucial DDR4 ECC
ASUS TUF B450M-Plus GAMING (supports ECC)
4xSeagate Ironwolf 8TB + 1xWD White Label WD80EMAZ 8TB (slotted for replacement over the Summer)
Boot drive: KingDian 240GB SSD (though same issue happens with any other boot or install when using this config)
Corsair CX650M

This machine has two purposes: SMB NAS storage and Plex server. The only services that are enabled in the Services tab are S.M.A.R.T. and SMB. There are two scheduled tasks: 1 ZFS scrub that occurs every 14 days on Sunday, and 1 short S.M.A.R.T. for all discs that happens nightly at 00:00:00. Plex is up to date and does not have any tasks scheduled in it like periodic library updates or anything. I have both enabled and disabled any autotune tunables and that doesn't make a difference. I've switched out the PCIE NIC, cables and ports to see if that was it but no dice there.

This issue started well before I even upgraded to TrueNAS12, it started on FreeNAS 11.3 U5 and I upgraded to TrueNAS to see if that was part of the issue. The rest of the machine is fine so I have two requests for assistance:

1) Determining the root cause of the issue without scrapping the Plex plugin instance. I have spent hundreds of hours meticulously modifying the metadata in this instance, which is in read-only mode, along with making playlists and I absolutely do not want to start over from scratch again if that can be avoided.
2) Is there a way to schedule a recycle of the Plex plugin at a specific time daily until the root issue can be uncovered? I really don't want to have to restart the Plex instance every day if it is something that can be automated.

Thanks so much for reading the thread and any help that can be provided.

StevoFNF · May 9, 2021

Quick update while I can sit down:

I was able to set up a Cron job to restart the jail that contains the Plex, so the second part as a temporary solution for the issue is taken care of.

In the course of looking at that, I did another check on my Plex to see about any other scheduled tasks. Looks like a few did sneak in there somehow and it, of course, is set to start running every day at 2AM. So for the time being I at least am able to try a few more changes to the tasks that run, and also can set them to run on the hour, giving me a lot more chances to look at the issue in occurrence as opposed to many hours after the fact. Will post back more after additional testing.

StevoFNF · May 9, 2021

Okay, so can confirm that it is something in the Plex scheduled tasks that is causing the jail to go belly-up. I'm going to try drilling down into the Plex logs and opening a simultaneous ticket with them, while also just disabling everything for the scheduled tasks entirely.

If anyone here has any suggestions for commands, log traces, etc. to set up on the TrueNAS end that could also help, that would be greatly appreciated.

StevoFNF · May 9, 2021

Thread opened on the Plex boards for it: https://forums.plex.tv/t/truenas12-...tect-intros-causes-memory-leak-crash/714892/2

The culprit ended up being a function which had been working for quite a while but suddenly decided to turn sour: Detect Intros. This is a Plex feature that detects the introductions to TV Shows and then allows users to skip them via OSD prompt. When it is turned off, I still get huge amounts of Plex CPU usage but it only nibbles at the RAM and finishes the maintenance window without issue. With it turned on, whenever it gets around to the Detect Intros part of the butler listing, consumes RAM until there's none left then TrueNAS kills the PMS service in jail. iocage restart of jail is the most reliable way to get out of that situation. It seems like it doesn't even try to access the swap, despite messages in the console to the contrary, and just gives up whenever RAM is full. The services segment of the live monitor reached 14GB of my 16GB available before clearing out.

I am now wondering if there is a way to limit the amount of RAM any specific service can request to see if it plays any nicer with a lower cap?

oCh33sYo · May 12, 2021

StevoFNF said:
I am now wondering if there is a way to limit the amount of RAM any specific service can request to see if it plays any nicer with a lower cap?

This might be along the lines of what you're looking for it that regard.

Reddit - The heart of the internet

www.reddit.com

Sorry I can't be of more help with the Plex issue itself.

sretalla · May 12, 2021

StevoFNF said:
I am now wondering if there is a way to limit the amount of RAM any specific service can request to see if it plays any nicer with a lower cap?

The one thing you can probably impact is the ZFS cache (ARC) size limit. That should normally be adjusted automatically by the system, but in heavy load situations (like your plex maintenance task window/intro detection as 02:00 would seem to be) it can fail to respond quickly enough, so artificially pushing it a bit lower might have a positive result.

The tunable for that would be a sysctl type with the name vfs.zfs.arc_max and a value in bytes of the largest amount of RAM you would allocate to ARC.

You may be able to get a bit of an idea of how much to allocate by looking at the output from arc_summary

You can also set it for testing (without a tunable, but will only last until reboot) with sysctl vfs.zfs.arc_max=62277025792 (in my example I had it set up for 58GB... worth noting I don't have it enabled currently) which will take immediate effect (you should be able to see it on the dashboard graph).

StevoFNF · May 15, 2021

oCh33sYo said:
This might be along the lines of what you're looking for it that regard.

Reddit - The heart of the internet

www.reddit.com

Sorry I can't be of more help with the Plex issue itself.

Thank you very much for the resource link. No worries about the Plex help. I'm currently biting the bullet, doing a full backup, and converting this particular machine from TrueNAS Core to TrueNAS SCALE. From what I've been reading, the underlying features that I use are already mature implementations on Debian (OpenZFS + scrubs, Plex, SMB, FTP) and the alpha stuff is mostly related to the UI and the Scaleable features. I'll definitely keep this in mind if I have to revert back to Core though, as I'd rather not have to buy more RAM just to fix this bug.

I will also say that in the intervening days since I posted this, the issue has happened at different levels of crashing. Sometimes it just kills Plex, sometimes the whole machine goes down, and I still can't really nail down what's causing it. Unfortunately I don't really have the time at the moment to delve too much deeper unless the same issues are experienced on SCALE. (Side note: the SCALE conversion is appealing for the addition of Plex HW Transcoding which, afaik, will likely never be available on a FreeBSD-based solution).

sretalla said:
The one thing you can probably impact is the ZFS cache (ARC) size limit. That should normally be adjusted automatically by the system, but in heavy load situations (like your plex maintenance task window/intro detection as 02:00 would seem to be) it can fail to respond quickly enough, so artificially pushing it a bit lower might have a positive result.

The tunable for that would be a sysctl type with the name vfs.zfs.arc_max and a value in bytes of the largest amount of RAM you would allocate to ARC.

You may be able to get a bit of an idea of how much to allocate by looking at the output from arc_summary

You can also set it for testing (without a tunable, but will only last until reboot) with sysctl vfs.zfs.arc_max=62277025792 (in my example I had it set up for 58GB... worth noting I don't have it enabled currently) which will take immediate effect (you should be able to see it on the dashboard graph).

Thanks for that. I've enabled the tunable with a limit of about 60% of my total RAM for now, just while I backup things and get ready to convert from Core to SCALE. On top of the ARC limitation, I've disabled the Plex and SMB services and, since doing that, I've yet to get a repeat on the issue. So one of those things, or some combination, is restoring the stability for now.

ThreeDee · May 15, 2021

...and on a side note .. you might want to invest in an Intel NIC at some point .. or sell your current motherboard and get a server orientated motherboard like:

ASRock Rack > X470D4U

www.asrockrack.com

or

ASRock Rack > X470D4U2-2T

www.asrockrack.com

StevoFNF · May 15, 2021

ThreeDee said:
...and on a side note .. you might want to invest in an Intel NIC at some point .. or sell your current motherboard and get a server orientated motherboard like:

ASRock Rack > X470D4U

www.asrockrack.com

or

ASRock Rack > X470D4U2-2T

www.asrockrack.com

Indeed, the reason why I was using the onboard NIC was because my previous Intel NIC had been indicated in part of the original problem (TrueNAS would release all IPs saying "err 61 connection refused" and would not bind back to any of the network interfaces until a hard reboot. Got my Intel NIC in on Thursday, but the issues were prevalent both with the original Intel NIC and without. In hindsight it seems it was a victim not a culprit in whatever the issue is here.

I do have SuperMicro A1SAM-2550F but the processor in that was buckling fiercly under the Plex transcode threads, unfortunately. It's a difficult balance between acceptable performance and reliable server infrastructure for my particular needs. Plex NAS also counts as my redundant media storage server, so high reliability and availability needs to be balanced with processing power. Once I am done my backup, the TrueNAS SCALE rig will look as follows:

AMD Ryzen 2700X
16GB Crucial DDR4 ECC
ASUS Prime B350M-Plus (supports ECC)
5xSeagate Ironwolf 8TB
Boot drive: KingDian 240GB SSD
2x Geforce GTX 960 2GB (for Plex hardware transcoding, since its fully supported on the Debian Docker image that TrueNAS SCALE has in its catalog)
Intel NIC
Corsair CX750M

The motherboards you linked I have bookmarked for whenever I need to go from the 2x GTX960 to a Quadro P2000 or P4000 for additional threads though. The suggestion is very much appreciated!

ThreeDee · May 15, 2021

My wife runs Windows 10 on a Prime B350, currently with a Ryzen 5 3600 .. soon to be a 3700x ..with 2 x 16GB 3200 non ECC stuff .. it's been a rock solid motherboard since I initially got it .. 1700>2200g>3400g>3600

ThreeDee · May 15, 2021

the server boards I listed have a built video chip so a video card isn't needed ..

a 4GB GTX960 only has a passmark score of 6023

your 2700x is about the same as a 3600 with passmark 17599 .. your kind of wasting electricity running old video cards.

Sell the 2700x with Prime B350 and GTX 960's .. buy a 3600 or 3700x (Passmark 22801) both 65w TDP and an AM4 server motherboard with onboard video and IPMI.. use same ECC RAM .. you'd probably have a couple bucks left over the way video cards are right now to buy more RAM

with my setup in sig I'll have up to 10 friends/family watching stuff on my plex at a given time without issue .. (all 1080p content though)

you can still run a Quadro or whatever down the road if you needed the extra processing power

just a friendly suggestion based on my own personal experience(s)

davidjwbailey · Dec 14, 2021

Probably worth linking this thread to the ones on corrupt metadata causing swap file overruns when metadata scanners run as scheduled tasks-

Plex docker is crashing my server when it runs maintenance cron

I can't seem to figure out why the PMS docker is crashing my server when it runs the maintenance task each night. All of my drives are in good condition, swapped out all the bad ones with SMART error. Hoping that would fix the problem, but it continues. I have 4 other docker images running and 2 ...

forums.unraid.net