Opening Dashboard causes Collectd, Python3.7, and rrdcached processes to spike CPU and system load through the roof

Dewey · May 9, 2020

Hoping someone can help me figure out what is going on. My FreeNAS box has been great, when it works. Lately it has been getting into a fit, where whenever I access the WebUI dashboard I have 3 processes nail the system to the wall to the point that unless I'm already logged in via SSH, I can't even login. The console is slow and unresponsive, slower than SSH, and even if I issue a shutdown, it never will after is has already unmounted all the pooled disks. If I am lucky to already be SSH'd in I can run top and see this:

Code:

last pid:  7979;  load averages: 22.66, 23.51, 22.78                                                                                  up 0+06:52:39  22:03:57

59 processes:  2 running, 57 sleeping

CPU: 19.3% user,  0.0% nice, 64.9% system,  5.9% interrupt,  9.9% idle

Mem: 222M Active, 1456M Inact, 1694M Wired, 59G Free

ARC: 547M Total, 126M MFU, 390M MRU, 9709K Anon, 5774K Header, 16M Other

     132M Compressed, 816M Uncompressed, 6.16:1 Ratio

Swap: 24G Total, 24G Free


  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND

2929 root         11  25    0   390M   340M nanslp  9 273:01 246.04% collectd

   79 root         37  46    0   316M   238M umtxn   4 189:26 228.95% python3.7

1427 root          8  21    0 31968K 11632K select 12  79:18  93.42% rrdcached

7836 root          1  81    0  7892K  3920K CPU5    5  30:22  25.00% top

3328 root          1  38    0 12924K  7912K select 14   7:50  15.79% sshd

1359 root          1  30    0 12484K 12580K select 14  11:22  11.84% ntpd

1504 root          1  26    0   127M   106M kqread  5   6:12   9.21% uwsgi-3.7

1407 root          1  28    0 38072K 22540K select  6   3:30   3.95% winbindd

1391 root          1  22    0 31052K 17860K select  2   3:03   2.63% nmbd

1442 root          1  20    0   120M   103M select  5   0:05   1.32% smbd

1035 root          1  22    0  9164K  5556K select 10   0:04   1.32% devd

FreeNAS is latest stable version as of this writing (Updated 2 nights ago). Storage/Boot Drives, Memory and RAM is barely touched, CPU is cranking on something from collectd, rrdcached and python. I see nothing in the logs, and eventually it gets to the point the metric offloads to Graphite stop, Web UI won't load (in fact I will see socket timeout messages on the console from Django), NFS won't mount, and existing mounts are very slow or time out. I'd like to figure out what is causing it and get it to stop. Or worst yet, find whatever hardware maybe went bad. I'll admit, I don't really know how to check if my backplanes or drives are okay, outside of extended SMART checks are returning okay.
Here is the hardware FreeNAS is running on:
Processor: 2x AMD Opteron AMD 6212 Octo (8) Core 2.6Ghz (Total 16 Cores)
Memory: 64GB DDR3 (8 x 8GB - DDR3 - REG PC3-10600R (1333MHZ)
Server Chassis/ Case: CSE-847E16-R1400UBMotherboard: H8DGU-F

Just last night it happened again while I was asleep. I woke up to find it in the problematic state. I was poking around trying to get more information while it was happening (not very successful, I'm normally a Linux admin guy and FreeBSD is proving to really test what I can do with limited tools I'm used to, but that's a learning problem). Suddenly I noticed the problem stopped. I double checked my graphite/graphana server (which I stood up to specifically figure this out), and all the metrics that weren't getting saved off, were now there. Here is what metrics I have over the 8 hour period that collectd, python3.7, and rrdcached were working hard in some mine and I don't even know why.

Unfortunately, when I opened the UI after this happened, it happened yet again, and the UI has yet to return. I suspect that it will take another 8 hours to return unless I force a reboot. UI seems fine if I get in and out quick enough. Otherwise CMD line is my only way to do anything. Any help would be greatly appreciated.

Samuel Tai · May 9, 2020

Where is your system dataset? If it's on the boot volume, and the boot volume is a USB stick, you may be encountering the USB stick doing write reallocation, and the FreeNAS system spiking CPU while it's waiting.

Dewey · May 9, 2020

Samuel Tai said:
Where is your system dataset? If it's on the boot volume, and the boot volume is a USB stick, you may be encountering the USB stick doing write reallocation, and the FreeNAS system spiking CPU while it's waiting.

Boot volume is a mirrored SSD pair. I believe the system dataset is going there, but I honestly can't remember. Is there a way to check via command line? Is this getting triggered with the UI opening because of the metrics the front page tried to show?

Samuel Tai · May 9, 2020

If you've allowed root to login via a password, then you could run the REST 2.0 API call via curl:

Code:

curl --basic -u root -k -X GET "http://<FreeNAS GUI IP>/api/v2.0/systemdataset" -H "accept: */*"

Enter root's password when prompted, and you'll get the name of the zpool hosting the system dataset. If your web GUI uses certificate, substitute https in the curl GET string.

All the collectd data gets put on the system dataset, including the dashboard UI. It's not good practice to use the boot zpool for the system dataset.

Samuel Tai · May 9, 2020

To get the names of the active zpools, run zpool status | grep pool. You can also use the REST 2.0 API to change the system dataset.

Code:

curl --basic -u root -k -X PUT "http://<FreeNAS GUI IP>/api/v2.0/systemdataset" -H "accept: */*" -H "Content-Type: application/json" -d '{"pool":"<name of new system dataset zpool>","pool_exclude":"","syslog":true}'

Samuel Tai · May 9, 2020

From your description, I suspect your system dataset is on zpool freenas-boot, the mirrored SSD pair, and the IO from the mirroring is killing collectd performance, lagging your UI performance to nothing. Try switching the system dataset to your data zpool.

Samuel Tai · May 9, 2020

If you've let collectd run on freenas-boot since 2016, then it's likely both members of the mirror are past the SSD's point of no return for free blocks, and both SSDs are actively garbage collecting, with the mirror trying to synchronize the garbage collection. This would definitely drop your IOPS to nothing.

Dewey · May 9, 2020

This particular install is only a few months old. I just double checked and the system dataset is not set to the freenas-boot pool. It is set to the only dataset I have created.

Code:

{
"id": 1,
"pool": "HotSwapSpindleDisks",
"uuid": "1d75fdc729874c0cb54b5da2552ea8c9",
"uuid_b": null,
"is_decrypted": true,
"basename": "HotSwapSpindleDisks/.system",
"uuid_a": "1d75fdc729874c0cb54b5da2552ea8c9",
"syslog": true,
"path": "/var/db/system"
}

So I already have that going for me. I honestly should have known that sooner just by checking the stats I have. All the disks that are reporting latency and busy, belong to that pool. So does that mean I have something wrong with my pool? Here is the pool in question:

Code:

# zpool status                     
  pool: HotSwapSpindleDisks
 state: ONLINE
  scan: scrub repaired 0 in 0 days 02:30:09 with 0 errors on Sun May  3 02:30:11 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    HotSwapSpindleDisks                             ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/373bc548-deb6-11e6-b9fc-0cc47aac6d22  ONLINE       0     0     0
        gptid/faf9a51f-dee6-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0
        gptid/7d09df39-dffb-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0
        gptid/d6ad9cee-e030-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0
        gptid/fb7dcd2e-dfa9-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0
        gptid/127e0e7f-df72-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0
        gptid/5ce13e43-e0c4-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0
        gptid/75b94b82-e073-11e6-9c8a-0cc47aac6d22  ONLINE       0     0     0

Samuel Tai · May 10, 2020

On https://github.com/Cacti/rrdproxy/issues/3, I found a suggestion that rrdtool may be hitting a socket too frequently, and tripping the OS's SYN flood protection. This causes a wait state leading to the high CPU.

One thing to try is to enable autotune in System->Advanced, and then reboot. This will set various sysctls under System->Tunables that may help mitigate this issue.

Dewey · May 10, 2020

I enabled autotune and the system setup the following settings:

I have noticed with the autotune set tunables the web UI isn't as snappy, but I can live with that if it means my server doesn't eat itself alive ;)
And now I wait. I'll let it sit for a couple of hours and try hitting the web ui (as that tends to trigger the problem more often than not) every so often. I'll also keep an eye on metrics and see if it triggers again. Fingers crossed it doesn't.
I'll let this run for the rest of the day with these and will report back what happens.

Dewey · May 10, 2020

And it barfed again. Opened the dashboard (UI front page), and the system spiked. same processes with high cpu and same load numbers. It isn't even a slow burn, it is instantaneous. System shows just fine and very idle, the moment I open the dashboard, the system become unresponsive and even top barely opened.

Samuel Tai · May 10, 2020

This is kind of brute-force, but try stopping collectd (/usr/local/etc/rc.d/collectd stop) and then nuking everything under /var/db/system/rrd-<long hex string>/localhost, before restarting collectd. There will be several rrd-* directories; use the one that's most recent.

Dewey · May 10, 2020

Nuked and restarted collectd and the host while I was at it. After a couple of hours it came back with the same processes with high cpu and high load.

Samuel Tai · May 10, 2020

Ok, usually when I'm stumped I look at /var/log/middlewared.log. What does the middlewared itself say is making it unhappy?

Dewey · May 10, 2020

I have attached the log. I looked through it and couldn't find anything that stuck out. Just that whenever the heavy load was occuring that it was reporting a task being blocked. I presume because of the heavy load. I have also attached daemon.log as it was also reporting some collectd errors, but sadly does not go as far back.

Samuel Tai · May 11, 2020

Did you lose a fan? The daemon.log shows your disks are screaming about temperature. They're all over 70 Celsius, and most disks get unhappy above 60 Celsius.

Samuel Tai · May 11, 2020

The logs show socket timeouts, especially when trying to pull the SMART data from your disks. I'm surprised your disk controllers are even working at those temps.

Dewey · May 11, 2020

I had it turned off overnight, booted it up about an hour ago, all disks were reporting 25-27C, I hit the UI 5mins ago, same thing. Collectd, rrdcached, and python3.7 high CPU and >20 Load Avg. The SMART Attribute 190 Airflow_Temperature_Cel is reporting the difference between SMART 194 - Temp and the value set by a manufacturer (or defaults to 100 if not set by them) before you are officially outside the manufacturers operating temps (https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes). The highest value I am seeing reported in the log for SMART 194 is no higher than 30.

Samuel Tai · May 11, 2020

My apologies. You are correct; I'd confused 190 and 194. However, the observation still applies:

Code:

May  9 05:54:07 freenas collectd[2000]: Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 60, in read
    with Client() as c:
  File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 250, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 93, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.7/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out

You've got some sort of timeout going on when trying to pull SMART temps off your disks.

This collectd bug report may be relevant. This Reddit thread may also be relevant. Unfortunately, it may be the only work-around is not to use graphite.

Dewey · May 11, 2020

Is Graphite running on FreeNAS? I enabled reporting to a Graphite server because of the high load and cpu problem to being with. That way I could look at metrics without being dependent on a UI that was unresponsive with the rest of the box.

Important Announcement for the TrueNAS Community.

Opening Dashboard causes Collectd, Python3.7, and rrdcached processes to spike CPU and system load through the roof

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Never underestimate your own stupidity

Never underestimate your own stupidity

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Attachments

Never underestimate your own stupidity

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Opening Dashboard causes Collectd, Python3.7, and rrdcached processes to spike CPU and system load through the roof"

Similar threads