Server becomes unresponsive after extended moderate load

Joined
Jul 13, 2013
Messages
286
My FreeNAS box, currently running whatever was the current release in the stable train late last week (it's locked up again as I started typing this), periodically becomes unresponsive. Sometimes it runs for several days of very light use without a problem, but when the weekly scrub of one of the pools comes along it hasn't lately made it through the scrub. Running a full backup of one pool to an external USB3 disk also rarely to never completes.

This current instance, it ran about 3 hours of the scrub (I've got a cron job that runs every 5 minutes and grabs the drive temps, and pushes them to another server, along with a graph for the last week; the scrub elevates drive and CPU temperatures enough that it's obvious when the scrub started (not high temps, but the drives idle at about 32C and jump to 35C during the scrub, and the CPU jumps nearly 15C, to 40C). In the past it runs longer sometimes, but it hasn't completed the scrub without becoming unresponsive for several weeks.

When it enters this bad state, the lights on the disk drives suggest that the scrub is still proceeding (constant activity on the disks in that pool, but not on other disks), the GUI is unresponsive, Samba shares are unresponsive, SSH is unresponsive, and the console is unresponsive. It can persist for more than 10 hours (my patience and my need for access to my data has so far blocked any longer experiments). CTRL-ALT_DEL from the keyboard has no effect, and pushing the power button briefly has no effect. Pushing reset, or forcing powerdown with a long power button push, works, and it reboots reasonably cleanly (and zpool status shows that the scrub completed without finding any errors). The console shows recent log activity when I view it while the system is unresponsive; it shows ARP errors "attempts to modify permanent entry for <server ip> on epair0b". I'm attaching a photo of the console, from an instance last week (but same message I'm seeing today), in case some detail is informative to someone else.

I've seen suggestions that power supply could be the issue. Some of this may relate to my having put one more disk drive in (this round of trouble started when I replaced a failing drive a month or two back; I also experimented with using a spare slot rather than USB for backup, and that drive is still present); then again it *first* came to my attention *before* I put the extra drive in. But maybe I'm on the ragged edge and the power supply is degrading, or something. Oh, I also disconnected one fan and installed another one (and the disk drive temps dropped significantly, they haven't been above 40C since then); but no net increase in fans so no big power draw change.

This system is years old and has been stable for most of that time. But degrading components happen, and I'd be happy if it's as simple as the power supply. The box currently has 6x6TB drives (one 3-way mirror, one 2-way mirror, and that drive I was attempting to make a backup onto; Hitachi, Toshiba, Whitelabel, maybe one other brand drive). Has ECC memory. Modern drives draw less and less power it looks like. MB is Asus M5A97LE R2.0, AMD FX-6300, 16GB ECC.

Any ideas on how to diagnose further? My problem, always, is that I'm maintaining this amount of data on an inadequate budget, and it's key data to me (decades of my photos, and a bit of other stuff), so replacing everything in sight at the first symptom of trouble isn't really an option.
 

Attachments

  • IMG_20190626_015437.jpg
    IMG_20190626_015437.jpg
    425.1 KB · Views: 254
Joined
Jul 13, 2013
Messages
286
Well...I've gotten the behavior to at least moderate (as of the weekly scrub run this morning), by replacing the power supply (and upgrading it 200 watts, 450 to 650).

During the scrub, the system was nearly unresponsive to the GUI, but responded fine to SSH connections, and I could at least see the shares over SMB (I deliberately avoided stressing SMB at all while the scrub was running; I need updated backups before I do anything stupid).

The scrub was showing rather strange status reports as it was running, like this:

Code:
[ddb@fsfs /mnt/zp1/public/misc/BlaisdellPoly/fsfs-temps]$ zpool status zp1
  pool: zp1
 state: ONLINE
  scan: scrub in progress since Tue Jul  9 02:00:24 2019
        4.01T scanned at 104M/s, 3.94T issued at 102M/s, 4.28T total
        0 repaired, 92.17% done, 0 days 00:57:16 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/56c10dc9-0d98-11e6-bc09-00074305ce14  ONLINE       0     0     0
            gptid/5787ccce-0d98-11e6-bc09-00074305ce14  ONLINE       0     0     0
            gptid/b0fb5217-72af-11e9-8d5c-2c4d54526dc1  ONLINE       0     0     0

errors: No known data errors


4.01T scanned, 3.94T issued??? That looks weird, but I'm not really sure what it means, so maybe it's not actually weird. My mind jumps to "issued" IO vs. "completed", which means "scanned" should be smaller than "issued", not bigger. But it probably just means something else .

The GUI being unresponsive was less total than before, also; but many pages wouldn't come up (notably the reporting pages). The dashboard came up at least sometimes, though.

And, after completion of the scrub, the GUI returned to the usual fairly quick response, without having to reboot the server.

So, I'm cautiously hopeful, but will continue testing things a bit carefully, and make sure my backups get fully up-to-date.
 
Joined
Jul 13, 2013
Messages
286
Now running a full backup, couple of hours in (full because one of the backup sets was damaged in earliest events; normally I do zfs send / receive incrementally from the last snapshot found on the backup volume). GUI and ssh are remaining responsive, and drive temp is staying down (so the physical part of the fix is holding up too).

Maybe the power supply getting old and weaker really was the main factor! I'll feel more confident of that when the backup *completes*, though.
 
Top