11.1-U1 Freezing But No Errors Recorded

Joined
Feb 6, 2018
Messages
8
tl;dr - System freezes when basically nothing is running. HW tests and logs don't indicate errors. System specs at bottom. Long history of troubleshooting follows.

I've been having some trouble since late December/early January with my FreeNAS system freezing, typically at night. I can't access the web console (it's disappeared from the network) and the monitor attached to the system displays but doesn't respond to keyboard input. The Reset button doesn't work, and I have to hold the Power button to get it to shutdown. When the system reboots, I get an email indicating an unauthorized system reboot. If I look back at the graphs in reporting, they abruptly end at whatever time the system froze, but otherwise don't really show anything suspicious. I have only a couple of hundred MB RAM free, but around 13GB of RAM in Wired. CPU temperatures are in the 20-30C range, and are not under heavy load. Sometimes the disks are under load (depending on the night and what is scheduled), but rarely. The error logs don't seem to indicate any problems either. They just abruptly stop.

The only thing I originally had running on this machine was a single bhyve Ubuntu VM (2 CPUs and 4GB of RAM allocated), which is itself only running Plex (I tried the jail for awhile, but found it frustrating).

Originally, I was running 11.0-U4, to my knowledge without issue. I upgraded to 11.1, and my trouble started not long afterward (possibly a coincidence). Reading about the memory leak in that version, I rolled back to 11.0-U4, but the problem persisted. I didn't have time to diagnose at the time, and knowing that 11.1-U1 was coming, I decided to wait. When 11.1-U1 was released, I upgraded to that version, and the problem remained.

With 11.1-U1, the "Check system health" bug message let me easily tell that the system was routinely locking up at 2am. Recognizing that this was when my Plex server did its routine tasks, I disabled the VM and ran a scrub on my Primary Data volume. The system froze during the scrub, which led me to believe it was a hardware failure.

I ran SMART tests on my drives, none of which reported errors. I ran memtest86 on my RAM sticks, which also didn't show errors. When I went to reboot into FreeNAS after the memtest, my System USB stick wasn't recognized by the BIOS. So I replaced the drive with a new USB stick, loaded my config, and everything ran fine (Plex included) for one day and night. I even did a scrub of my data volume to make sure. On the second night, the system froze again.

Not having a full backup of my data volume, I disabled the Plex VM, and setup a jail that's only job is to run rclone and sync everything to Google Drive. Given Google's per-day bandwidth limit, that's been taking awhile. This ran fine for a week, but two nights ago, my system froze again. The only thing that was running on my system when this froze was a single jail which was running rclone. It froze around 22:00, and all of my scheduled tasks are set to start after midnight.

So now I'm at a loss for what to do or test. I'm obviously going to keep doing my backup as long as it will run, but I have no idea what to do next for troubleshooting. Any ideas or suggestions are useful, either of things to try or of data points I have yet to collect.

System Specs (built 04/2017):
Build
FreeNAS-11.1-U1
Platform Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
Memory 16047MB (2x 8GB sticks, non-ECC)
MOBO ASRock Fatal1ty H270M Performance
System Disk SanDisk UltraFit 32GB USB 3.0 SDCZ43-032G-GAM46
Primary Volume
RaidZ2 with the following disks (less than ideal, I know)
  • 1x - 6TB WD Red
  • 4x - 4TB WD Red
Secondary Volumes
  • Single Disk - SAMSUNG 850 EVO 250GB SSD (stores my VMs and jails)
  • Single Disk - External 500GB HDD - Replication target for the previous disk
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
What PSU are you running? Do you have a different PSU that you can test with? The symptoms of the problem make me immediately think PSU, but the regularity of the problem makes me think something different.
 
Joined
Feb 6, 2018
Messages
8
That's an interesting thought. My PSU is a (copying the Newegg name) Thermaltake TR2 Series TR-600P 600W V2.3 & EPS 12V 2.91 80 PLUS BRONZE Certified Active PFC. In terms of ability, it should be more than enough, but it obviously could be faulty. I might be able to borrow a PSU from work. I'll look into that.

Another data point that I just thought of is that I have yet to have it freeze while I was actually using Plex. So even when things were being transcoded and streamed, I haven't had problems. I would think that if it were load or power draw related, doing a transcode would likely cause a freeze.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
How old is that PSU?

The most power-intensive thing you can do is scrub your disks. Can you trigger a scrub and see what happens?
 
Joined
Feb 6, 2018
Messages
8
I bought the PSU new when I built the system last April 2017.

I just ran a scrub of my tank pool, while also leaving the jail running doing the rclone backup. The scrub did find a single checksum error on one disk, but otherwise it completed fine. Should I be worried about that checksum error? The phrasing of the error message seems both urgent and not a big deal: "The volume tank state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected."

*EDIT* I scrubbed all of my drives and none of them caused the freeze. The only issue was the single checksum error noted above.
 
Last edited:

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
If you can complete a scrub, it's unlikely that the issue is a PSU.

The checksum error means exactly what you saw: ZFS read a block of data, it didn't match the stored checksum, and then it repaired the data using the parity information in your array.

Errors happen, but you want to be mindful. Make sure you are doing regular SMART tests, and review the numbers. If you start seeing more checksum errors, you could have a problem with the drive, cable, or controller. For this reason, regular scrubs + SMART tests are a great tool for discovering problems before they become real problems.
 
Joined
Feb 6, 2018
Messages
8
That makes sense. With the phrasing of it being "unrecoverable" and that it made "an attempt to correct" didn't necessarily mean it actually succeeded in correcting it. Glad to know that it did. I did already have regular scrubs setup (monthly) and I just enabled SMART tests. Will keep a closer eye on those.

Otherwise, the server will have been up and running for a week as of tomorrow. I'm still in the midst of the long backup, so I'm still waiting a bit before I do more hardware tests. I'll check back in when I have updates.
 
Joined
Feb 6, 2018
Messages
8
It took 2 months to backup all of my data, but I'm finally in a state where I can start testing things again.

Another thread here in the forums suggested that the USB HDD (in particular an overloaded USB interface) I was using for my backup might have been the culprit, so I disconnected that drive. I have only had one freeze since then, which seems like a pretty good indicator that something with that drive causing was most of my issues.

Troubleshooting is still a work in progress.
 

boynep

Dabbler
Joined
Jan 9, 2012
Messages
29
Sorry to hijack your thread. But I am having exactly same issue. As part of troubleshooting I have disabled most of the services and will runemtest amd scrub today. Would be curious to know if you managed to fix it. My spec are in my signature.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Sorry to hijack your thread. But I am having exactly same issue. As part of troubleshooting I have disabled most of the services and will runemtest amd scrub today. Would be curious to know if you managed to fix it. My spec are in my signature.

Is the whole system freezing? Can you still ping it? Mine was partially freezing. The pool would just lock up. Someone mentioned heat so I set the fan to full on and haven't had a freeze since.

I'd run a thorough memtest. Then I'd fire up a jail and run stress while monitoring CPU temps.
 
Top