swap_pager_getswapspace: failed

Amsoil_Jim · Oct 2, 2018

Running on 11.2Beta3
This morning a cronjob ran to do a scrub of my main pool and it usually takes about 2 hours and at 6am I get an email of scrub and smart test results and a config file.
Well I didn't get that email. I attempted the navigate to the GUI and it was unresponsive, the jails were also unresponsive. So i went to the IPMI and pulled up the iKVM viewer to see this error posting over and over.

Screen Shot 2018-10-02 at 8.13.15 AM.png

I don't get why the system is even using swap when it has 192GB of RAM. I restarted the system and is operating again.
Is this just an issue with the Beta?

DrKK · Oct 2, 2018

ok, well:

If you please, can we have the entire specs on the system in question?

And also, go to your GUI reporting tab, under "Memory", and verify indeed that swap was being pulled at the time in question, and take a look at those graphs so that you know when/if the behavior started (was it in fact coincident with the scrub? or perhaps another cron process?)

There's a distinct possibility something more severe than "using swap and running out of it" is at work.

Anyway, your first step, I think, is to identify when the unexpected behavior began by judicious use of the "reporting" information.

Amsoil_Jim · Oct 2, 2018

DrKK said:
ok, well:

If you please, can we have the entire specs on the system in question?

And also, go to your GUI reporting tab, under "Memory", and verify indeed that swap was being pulled at the time in question, and take a look at those graphs so that you know when/if the behavior started (was it in fact coincident with the scrub? or perhaps another cron process?)

There's a distinct possibility something more severe than "using swap and running out of it" is at work.

Anyway, your first step, I think, is to identify when the unexpected behavior began by judicious use of the "reporting" information.

System:
FreeNAS-11.2Beta3
Motherboard: X9DRI-LN4F+
CPU: Intel Xeon E5 2690v1 2.9Ghz x2
Memory: 192GB DDR3 SDRAM ECC (24x8GB)
Disks: 6 x WD Red 3TB Raidz1 & 1 x 128GB SSD stripe
Boot Disk: 2 x SanDisk Cruzer Fit 16GB "mirror"
Raid Controller: LSI 9211-4I HBA
Back Plane: SAS2-846EL1

Swap usage:
past week

Screen Shot 2018-10-02 at 8.51.11 AM.png

past day

Screen Shot 2018-10-02 at 8.51.51 AM.png

the peak happens when the scrubs start

Screen Shot 2018-10-02 at 8.57.28 AM.png

But the chart does show a steady use of 2GB of swap up to this point and I don't know why.

Since I started using the Betas there has been an increase in swap usage

past year

The CPU usage also show a small increase in usage and even continues after the reboot.

Screen Shot 2018-10-02 at 9.11.50 AM.png

DrKK · Oct 3, 2018

Well isn't that fun.

What do the Memory usage graphs on that same reporting tab look like when this is happening? What kind of jails do you have? What's running in them? How many are there?

Amsoil_Jim · Oct 3, 2018

DrKK said:
Well isn't that fun.

What do the Memory usage graphs on that same reporting tab look like when this is happening? What kind of jails do you have? What's running in them? How many are there?

When this is happening the system is not accessible, I’ll get a graph screen shot when I get home later.
jails:
Warden-
Sickrage
Tautulli

Iocage-
Unifi controller
Plex
Couch potato

DrKK · Oct 3, 2018

So, sickrage and tautulli are known to be trivial. That can't be the problem.

However, both Plex and particularly Unifi are really nasty. Beta + iocage + unifi sounds like a recipe for a problem to me. Plex has been known to leak memory.

If this were me, I'd shut down first Unifi controller for some period of time, and see if I get the runaway memory problem. If not, I'd do it with Plex. Almost certainly it's one of those two. Once we know which of the two it is, we might explore exactly what's gone wrong.

DrKK · Oct 3, 2018

One does not actually *NEED* the Unifi controller running to use the network stuff, so that's safe enough.

DrKK · Oct 3, 2018

Also, for the record, Plex's "automatic" maintenance tasks, I believe, default to local 0:00 to start....so that could be it.

Amsoil_Jim · Oct 4, 2018

Screen Shot 2018-10-04 at 9.57.55 AM.png

Screen Shot 2018-10-04 at 9.58.16 AM.png

here are the memory graphs
I'll try turning off the unifi Jail first. Plex is used a lot so turning that off for a while is not as easy.

Amsoil_Jim · Oct 4, 2018

ok, so I turned off the unifi jail and trying running a scrub and the system immediately became sluggish and the web GUI was un reachable. I could access via SSH but was very slow to respond.
after about 45 minutes the "swap_pager_getswapspace: failed" error started showing on the console. so I'm restarting the system and going to shutdown plex and try the scrub again, but I have a feeling I wont get the error because of the system restart cleared the memory.
Edit: added graphs

Screen Shot 2018-10-04 at 11.39.18 AM.png

Screen Shot 2018-10-04 at 11.39.03 AM.png

Amsoil_Jim · Oct 4, 2018

After reboot I have Unifi controller and Plex turned off and I have been scrubbing for almost 2 hours and every thing is fully accessible still with no lag. here are the memory graphs so far. So may we assume it has something to do with Plex in iocage? I only upgraded plex to an Iocage jail since using the 11.2beta. No issues like this before the beta or plex in an iocage jail.

Screen Shot 2018-10-04 at 1.34.22 PM.png

toadman · Oct 4, 2018

Very odd indeed. Looks like something has a memory leak?

At the very least you can add a large swap file on your pool to stabilize the system in the short term as you debug. (i.e. if swap starts growing you will at least have a bunch of space there where the system won't freak.) it's consuming about 3.5GB per hour according to that graph.

It would appear your VMs are playing a part for sure. Sorry I can't be of more help.

DrKK · Oct 5, 2018

Yeah so, this is very interesting, right. We've got tens of thousands of users with Plex jails on FreeNAS. Literally, if even the smallest thing is wrong with it, we hear about it in about 5 seconds. And, we have NOT been hearing about this.

I can only assume that the real culprit here is the Beta version of FreeNAS itself (which is why no one is supposed to run anything they care about on a beta...)

@Amsoil_Jim , sir, I assume there's no way to tell if your situation reproducing on a production release instead of a beta?

Amsoil_Jim · Oct 5, 2018

heres todays graph. Plex was restarted after the scrub completed around 5 pm, but the graph doesn't really climb until this morning around 4 am and plex starts task at 2am

Screen Shot 2018-10-05 at 10.24.42 AM.png

Screen Shot 2018-10-05 at 10.11.45 AM.png

DrKK said:
I can only assume that the real culprit here is the Beta version of FreeNAS itself (which is why no one is supposed to run anything they care about on a beta...)

How do you find issues if you don't run stuff in a Beta like it was a finished product?

DrKK said:
@Amsoil_Jim , sir, I assume there's no way to tell if your situation reproducing on a production release instead of a beta?

Do you mean reproducing this issue in the 11.1 version? If so I'm not sure if the iocage jails will run because they were built using the 11.2 iocage version and I know they would not run in 11.1U-5 but I could build another Plex iocage jail with the 11.1 version because I have the Plex database in its own dataset.

toadman · Oct 5, 2018

Amsoil_Jim said:
How do you find issues if you don't run stuff in a Beta like it was a finished product?

You don't. But I think that's why KK added the "anything they care about" part. i.e. if you know it's beta and everything might be destroyed, proceed. If you care use GA software. That's all. :)

I know iX and the community do value beta testers. I hope in this case the root cause of your issue can be identified!

DrKK · Oct 7, 2018

Indeed. Of course we love Beta testers. We just certainly would not expect anyone to be trusting important and/or production data to a beta build of this, or anything else.

As for your question, "how do you find issues if you don't run stuff in a Beta like it was a finished product?"

Allow me to answer. Sir: There are betas, and there are......betas. Every ecosystem has a different idea of how finished a "beta" is. Some ecosystems call something a "beta" when it hasn't been released yet, but there is every reason to believe that it is in really good shape and that only weird/edge case problems are going to crop up. I would not say we are one of those ecosystems. Our betas are more.......interesting.

Plus, of course, a totally jacked up Firefox beta or AMD video driver beta poses much less risk to your crucial data than even a mildly jacked up FreeNAS. I don't even run 11.x stable on my main FreeNAS, only on my backup FreeNAS.

adrianwi · Oct 8, 2018

I’m not convinced this is just related to the 11.2 beta as I’ve been having similar issues with 11.1-U6.

I need to do a little more investigation as what’s happening seems a little random and over a number of days/weeks, but overtime something is using all of the Swapspace and I’m getting the same error messages.

I am pushing my system pretty hard and would love to be able to drop in another 32GB RAM but that needs a new board and in reality is probably a new build retiring my current one to backup duties which current funds don’t allow.

jbrown705 · Nov 6, 2018

I found this thread looking to troubleshoot a similar issue. I am also getting the swap error and can't figure out why. The only thing we have in common is plex. One thing I noticed is that I am seeing Plex shut down within the jail and have to be restarted. I can't directly make a correlation to the swap errors, but it has been suggested that the 2 may be linked. I am wondering if you ever found out what was causing your issues? I am still seeing this in RC1 and am new to FreeNAS so my troubleshooting abilities are pretty limited.

SuperMicro MBD-X9SCL
32gb RAM
1 pool, 3 8TB drives
Xeon e3 1270 3.4ghz

ninjabucket · Nov 19, 2018

I am seeing the same issue in 11.2 RC1 and 11.2 RC2. I never had this issue until updating from stable. The system becomes very sluggish, to the point of not being usable. Checking /var/log/messages/ I see the following repeating over and over:

swap_pager_getswapspace(32) failed

I ran a top command and saw a bunch of python3.6 processes running. Digging further it looks like tons of autosnap.py processes were running. I did a "killall python3.6" command which unfortunately breaks the webUI, but stops that error in the logs and returns the system back to it's normal speed.

Is this something potentially in autosnap.py causing the issue, or maybe something with snapshots coming from a previous version?

ninjabucket · Nov 19, 2018

I think I resolved my issue. There were some snapshots of jails that the actual jail had been long since removed. After manually removing those snapshots, all of the autosnap.py processes went away and the issue has not re-occurred. So for anyone experiencing this, check and see if you have a pile of autosnap.py processes running. If so try manually removing snaphots and see if that corrects the issue.

Important Announcement for the TrueNAS Community.

swap_pager_getswapspace: failed

Contributor

FreeNAS Generalissimo

Contributor

FreeNAS Generalissimo

Contributor

FreeNAS Generalissimo

FreeNAS Generalissimo

FreeNAS Generalissimo

Contributor

Contributor

Contributor

Guru

FreeNAS Generalissimo

Contributor

Guru

FreeNAS Generalissimo

Guru

Explorer

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "swap_pager_getswapspace: failed"

Similar threads