Error report more than two weeks late; alert indicator still green

Status
Not open for further replies.

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
FreeNAS-9.2.1.2-RELEASE-x64 (002022c)
Six 2TB drives currently powered up; RAID-Z2
Everything else as in my sig.

This morning I found that I had an email with a time stamp of 03:01am this morning (Monday) notifying me of an error followed by resilvering on Saturday March 1:

Code:
Checking status of zfs pools:
NAME    SIZE  ALLOC  FREE    CAP  DEDUP  HEALTH  ALTROOT
Pool1  10.9T  8.43T  2.45T    77%  1.00x  DEGRADED  /mnt
 
  pool: Pool1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: resilvered 11.0M in 0h0m with 0 errors on Sat Mar  1 14:57:37 2014
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    Pool1                                          DEGRADED    0    0    0
      raidz2-0                                      DEGRADED    0    0    0
        gptid/7bc9fb7b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/7c972217-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/7d5d5566-14c5-11e3-b6d4-001b21c4dc34  FAULTED      1  147    0  too many errors
        gptid/7e979187-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/7f8bd18b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/80a62fee-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
 
errors: No known data errors
 
I had been checking the GUI from time to time and observing that the Alert indicator was still green -- and it is still but obviously is not giving reliable information.

Why did the error message not get emailed immediately? And why did other daily run output messages since March 1 report only backups of mail aliases?

And would someone please remind me how to find out which daN device corresponds to the gptid designation.
 

ser_rhaegar

Patron
Joined
Feb 2, 2014
Messages
358
I believe this will give you a map of GPTIDs to DA#
Code:
glabel status


If you don't completely refresh the GUI page, and just have it left open for days on end, the indicator light doesn't change. Or at least that has been my experience.

Regarding the email, is it possible the email from March 1st went to your junk folder or something similar? Have you checked the logs?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So you need to think about this as 2 different events...

Likely the resilver that occured was because a drive started failing and disconnected and reconnected. The system simply fixed the problem by resilvering the drive for you automatically(no problem there).

Now, the "too many errors" and faulted condition likely happened in the last 24 hours, triggering the email.

I don't think it's appropriate to see the resilver and the FAULT drive as the same event, they were likely 2 separate events.

Now, you are probably wondering why you didn't get any messages back on March 1st. For that I'd have to ask "so when does your system do SMART short tests, long tests, and is SMART monitoring even setup?" My guess is the answer is probably somthing like "no or never".
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
I believe this will give you a map of GPTIDs to DA#
Code:
glabel status

Thanks. The system is shut down at present, but I'll try that command when it's back up.

If you don't completely refresh the GUI page, and just have it left open for days on end, the indicator light doesn't change. Or at least that has been my experience.

I've refreshed the GUI page several times.

Edit: And clicking on the Alert indicator showed the pool as Healthy.
Regarding the email, is it possible the email from March 1st went to your junk folder or something similar? Have you checked the logs?
Nothing goes to my Junk folder automatically. Everything comes to Thunderbird, where some things get marked as potentially Junk, but they don't go to the Junk folder until I confirm that they are Junk.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
So you need to think about this as 2 different events...

Likely the resilver that occured was because a drive started failing and disconnected and reconnected. The system simply fixed the problem by resilvering the drive for you automatically(no problem there).

Now, the "too many errors" and faulted condition likely happened in the last 24 hours, triggering the email.

I don't think it's appropriate to see the resilver and the FAULT drive as the same event, they were likely 2 separate events.

Now, you are probably wondering why you didn't get any messages back on March 1st. For that I'd have to ask "so when does your system do SMART short tests, long tests, and is SMART monitoring even setup?" My guess is the answer is probably somthing like "no or never".
SMART is set up to check every 720 minutes (12 hours). I don't see an option to select whether it does short or long tests.
 

ser_rhaegar

Patron
Joined
Feb 2, 2014
Messages
358
When you setup the test you have to select the type. Short, long, convey, offline.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It should be checking every 30 mins.. that's why it monitors... the default (30 mins) is a pretty good setting. 12 hours is absurdly long for monitoring.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
It should be checking every 30 mins.. that's why it monitors... the default (30 mins) is a pretty good setting. 12 hours is absurdly long for monitoring.
OK. I've reset it to 30 minutes, but I can't see anything about selecting short. long. convey, offline, as ser_rhaegar suggested -- only:

Check interval, Power mode, Difference, Informational, Critical, Email to report.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
OK. I've reset it to 30 minutes, but I can't see anything about selecting short. long. convey, offline, as ser_rhaegar suggested -- only:

Check interval, Power mode, Difference, Informational, Critical, Email to report.
OK, I discovered how to setup SMART tests of different kinds. Now I have a Short test every 60 minutes and a Long test at 2am every day.

Yesterday when I clicked the green Alert button to check that everything was OK, it told me that the pool was DEGRADED, with the same number of write errors on the same drive that I remembered from before. A manual Long SMART test revealed no drive errors. I rebooted and did a scrub, and the status was no longer DEGRADED. Earlier today I updated to 9.2.1.3-RELEASE. Now the Alert button is flashing yellow, and clicking it gives me:
Code:
WARNING: The volume Pool1 (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.


Executing
Code:
zpool status


gives me:
Code:
[root@freenas ~]# zpool status                                                                                                     
  pool: Pool1                                                                                                                     
state: ONLINE                                                                                                                     
status: One or more devices has experienced an unrecoverable error.  An                                                           
        attempt was made to correct the error.  Applications are unaffected.                                                       
action: Determine if the device needs to be replaced, and clear the errors                                                         
        using 'zpool clear' or replace the device with 'zpool replace'.                                                           
  see: http://illumos.org/msg/ZFS-8000-9P                                                                                         
  scan: scrub repaired 0 in 0h0m with 0 errors on Fri Mar 21 22:42:01 2014                                                         
config:                                                                                                                           
                                                                                                                                   
        NAME                                            STATE    READ WRITE CKSUM                                                 
        Pool1                                          ONLINE      0    0    0                                                 
          raidz2-0                                      ONLINE      0    0    0                                                 
            gptid/7bc9fb7b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/7c972217-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/7d5d5566-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    2                                                 
            gptid/7e979187-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/7f8bd18b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/80a62fee-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
                                                                                                                                   
errors: No known data errors                                                                                                       
[root@freenas ~]#          



But
Code:
smartctl -a /dev/daN


shows no errors for any drive.

And notice that the zpool status command reports that the previous scrub repaired 0 errors in 0 time.

What am I supposed to make of all this?
 
Status
Not open for further replies.
Top