SOLVED Freenas hanging every 1-2 weeks with Error 60 Can't Drain AEN queue after reset/Controller reset failed

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
Freenas server info:
Motherboard: Supermicro PDSM4+
RAM: 8GB
CPU: Intel Xeon X3220 (4 cores)
Freenas OS Drive: Samsung Sata SSD 860 PRO 256 GB
Storage Controller: 3ware 12 Drive sata raid controller (9650SE SATA-II RAID PCIe)
Storage Config: 3x4 (3 pools with 4 drives per pool) RaidZ2. 12 1.5 TB Drives ~10TB in the pool
Network: 2 x Intel 82573E Gigabit Ethernet. Only 1 currently configured.
FreeNas Version: FreeNAS-11.3-U1 (The pool has been upgraded to the latest version with this upgrade)


The Problem:
In the last month or so, probably a little after we upgraded to FreeNas 11.3 (may be related, hoping it's not) we started needing to go manually reset our freenas server.
We would come in some morning and find that we couldn't mount the server or some other related function and go investigate to find the a server unresponsive to keystrokes with the following errors on the console screen
0310200750.jpg

I searched the forums for related errors and haven't found anything. Hoping that someone can help.
It appears that something is going wrong during the verify of the 3ware controller drives. I'm hoping there's something I can do to fix this and that the controller isn't just dying.

We're not entirely sure when this is happening or exactly how often, but we've had to manually reset once every 1-2 weeks for the last month. We updated to 11.3 U1 on the 2/28 hoping that the issue would be fixed in the update, but sadly it has returned about a week and a half later.

Thanks in advnance!
-JJ
 

hervon

Patron
Joined
Apr 23, 2012
Messages
353
I searched the forum and got that. This might be relevant :

 

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
I searched the forum and got that. This might be relevant :


I appreciate the reply. I read through that and it basically just calls the controller we have crap. In a perfect world I would love to just migrate to some fancy new hardware, but this is what we have. It's been working fine with freenas for years for us. We have everything set up in the 3ware side as single disk, so all raid operations is handled by freenas with zfs. Hoping to find a solution that doesn't require purchasing new hardware and migrating almost 8 TB of data. But I'll keep that in mind.
 

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
I also just noticed that I had autotune enabled. I have disabled that just in case it may have been related. If someone thinks that it was I can post what settings were added by autotune.

However I just noticed something else. I used to get a daily email that had similar output to what is freezing in the console.

It looked like this

Code:
Checking status of 3ware RAID controllers:

Controller c0:
--- /var/log/3ware_raid_c0.yesterday    2020-02-01 03:01:55.253183982 -0700
+++ /var/log/3ware_raid_c0.today    2020-02-02 03:01:49.800767341 -0700
@@ -1,31 +1,31 @@
 
 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
 u0    SINGLE    OK             -       -       -       1862.63   RiW    OFF   
 u1    SINGLE    OK             -       -       -       1862.63   RiW    OFF   
 u2    SINGLE    OK             -       -       -       1862.63   RiW    OFF   
 u3    SINGLE    OK             -       -       -       1862.63   RiW    ON     
 u4    SINGLE    OK             -       -       -       1862.63   RiW    ON     
 u5    SINGLE    OK             -       -       -       1862.63   RiW    ON     
-u6    SINGLE    VERIFYING      -       92%     -       1862.63   RiW    ON     
-u7    SINGLE    VERIFYING      -       78%     -       1862.63   RiW    ON     
-u8    SINGLE    VERIFYING      -       37%     -       1862.63   RiW    ON     
-u9    SINGLE    VERIFYING      -       0%      -       1862.63   RiW    ON     
-u10   SINGLE    VERIFY-PAUSED  -       0%      -       1862.63   RiW    ON     
-u11   SINGLE    VERIFY-PAUSED  -       0%      -       1862.63   RiW    ON     
+u6    SINGLE    OK             -       -       -       1862.63   RiW    ON     
+u7    SINGLE    OK             -       -       -       1862.63   RiW    ON     
+u8    SINGLE    OK             -       -       -       1862.63   RiW    ON     
+u9    SINGLE    OK             -       -       -       1862.63   RiW    ON     
+u10   SINGLE    OK             -       -       -       1862.63   RiW    ON     
+u11   SINGLE    OK             -       -       -       1862.63   RiW    ON     
 
 VPort Status         Unit Size      Type  Phy Encl-Slot    Model
 ------------------------------------------------------------------------------
 p0    OK             u0   1.82 TB   SATA  0   -            ST2000VN004-2E4164 
-p1    VERIFYING      u7   1.82 TB   SATA  1   -            ST2000DM006-2DM164 
+p1    OK             u7   1.82 TB   SATA  1   -            ST2000DM006-2DM164 
 p2    OK             u2   1.82 TB   SATA  2   -            ST2000DM006-2DM164 
-p3    VERIFYING      u9   1.82 TB   SATA  3   -            ST2000DM006-2DM164 
+p3    OK             u9   1.82 TB   SATA  3   -            ST2000DM006-2DM164 
 p4    OK             u3   1.82 TB   SATA  4   -            ST2000DM006-2DM164 
 p5    OK             u4   1.82 TB   SATA  5   -            ST2000VN004-2E4164 
 p6    OK             u1   1.82 TB   SATA  6   -            ST2000VN004-2E4164 
-p7    VERIFYING      u10  1.82 TB   SATA  7   -            ST2000DM006-2DM164 
-p8    VERIFYING      u6   1.82 TB   SATA  8   -            ST2000DM006-2DM164 
-p9    VERIFYING      u8   1.82 TB   SATA  9   -            ST2000DM006-2DM164 
-p10   VERIFYING      u11  1.82 TB   SATA  10  -            ST2000DM006-2DM164 
+p7    OK             u10  1.82 TB   SATA  7   -            ST2000DM006-2DM164 
+p8    OK             u6   1.82 TB   SATA  8   -            ST2000DM006-2DM164 
+p9    OK             u8   1.82 TB   SATA  9   -            ST2000DM006-2DM164 
+p10   OK             u11  1.82 TB   SATA  10  -            ST2000DM006-2DM164 
 p11   OK             u5   1.82 TB   SATA  11  -            ST2000DM006-2DM164 
 
Alarms (most recent first):
+++ /var/log/3ware_raid_alarms.today    2020-02-02 03:01:49.912970462 -0700
@@ -190,4 +190,13 @@
 c0   [Sat Feb 01 2020 06:37:14]  INFO      Verify started: unit=7
 c0   [Sat Feb 01 2020 06:38:24]  INFO      Verify completed: unit=5
 c0   [Sat Feb 01 2020 06:38:25]  INFO      Verify started: unit=8
+c0   [Sat Feb 01 2020 10:30:24]  INFO      Verify completed: unit=6
+c0   [Sat Feb 01 2020 10:30:25]  INFO      Verify started: unit=9
+c0   [Sat Feb 01 2020 11:19:24]  INFO      Verify completed: unit=7
+c0   [Sat Feb 01 2020 11:19:25]  INFO      Verify started: unit=10
+c0   [Sat Feb 01 2020 13:18:56]  INFO      Verify completed: unit=8
+c0   [Sat Feb 01 2020 13:18:57]  INFO      Verify started: unit=11
+c0   [Sat Feb 01 2020 15:26:20]  INFO      Verify completed: unit=9
+c0   [Sat Feb 01 2020 17:01:48]  INFO      Verify completed: unit=10
+c0   [Sat Feb 01 2020 18:39:58]  INFO      Verify completed: unit=11
 

-- End of daily output --


A day after that email I have a notification of the available upgrade. Which I'm pretty sure I applied right around then after reviewing the release notes. Since the upgrade to 11.3 I haven't got this verifying daily output email. I just re-reviewed the release notes over at https://www.ixsystems.com/blog/library/freenas-11-3-release/ and found this snippet

The Alert system has been improved:
  • Support for one-shot critical alerts has been added. These alerts remain active until dismissed by the user.
  • Alert Settings has been reorganized: alerts are grouped functionally rather than alphabetically and per-alert severity and alert thresholds are configurable.
  • Periodic alert scripts have been replaced by the Alert framework. Periodic alert emails are disabled by default and previous email alert conditions have been added to the FreeNAS alert system. E-mail or other alert methods can be configured in Alert Services.

So it seems there were indeed changes to something in the way this report that stopped showing up in my email is generated. Looking through the new alert settings I'm not sure which alert this report pertains too so I can try adjusting the settings so that this verifying either runs less frequently or just runs the way it used to so it doesn't freeze up the server when it errors out. Any guidance or further thoughts would be appreciated.
 

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
Thus far having changed the alert system settings to having the new snmp method disabled and reconfiguring the email alert system with an email addresss (my address seemed to have gotten cleared by the upgrade) the problem may have subsided or this may have fixed part of the problem. I'm still not getting regular notification of the drives being verified. Over this last weekend there was an 'unscheduled' reboot. So it may have hung and then managed to reboot itself, or there may have been a power outage long enough for the battery backup to die and I now need to wait another 2 weeks to see if it freezes up again. But I am thinking that the hanging on verify is either related to the new alert system (more likely) or it could have been the auto tune that is now disabled.
 

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
1585060146470.png

After a good while of it working it froze on us again. The error message is similar but a bit different this time around
 

Attachments

  • 1585060154187.png
    1585060154187.png
    2.9 MB · Views: 192

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
If I am reading this error correctly, the timeout and freeze happens after trying to read disk temps. So per the documentation I am going to try enabling a standby timer on each drive, thus disabling temp monitoring, and see if that stops it from freezing
 

DarkSideMilk

Dabbler
Joined
Feb 22, 2019
Messages
13
Due to Covid-19 I was away from this for quite some time. However whatever I last did seems to have fixed the issue as it didn't freeze again for a good 2 months.
I wish I wrote it down here cause I can't remember for sure now.
Looks like I removed the snmp alert system that was added and removed any and all remnants of the old alert system.
I also removed and recreated my scrub schedules.
I also made removed and reset the email alert configurations setting those to only critical and added slack notificiations for alerts for all normal info which is working much better than the email.
I believe I also disabled temp monitoring on the disks by setting the critical, difference, and informational values of each disk to 0. Was gonna try standby but read other posts about that hurting disk life so decided against it.

So either something in the new alert system was causing an issue on my instance or something in that 11.3 U1 update broke temp monitoring for old 3ware controllers.

I'm now going to daringly update to 11.3 u3 and hope it continues to be fixed. Marking as solved for now
 
Top