Error report more than two weeks late; alert indicator still green

Z300M · Mar 17, 2014

FreeNAS-9.2.1.2-RELEASE-x64 (002022c)
Six 2TB drives currently powered up; RAID-Z2
Everything else as in my sig.

This morning I found that I had an email with a time stamp of 03:01am this morning (Monday) notifying me of an error followed by resilvering on Saturday March 1:

Code:

Checking status of zfs pools:
NAME    SIZE  ALLOC  FREE    CAP  DEDUP  HEALTH  ALTROOT
Pool1  10.9T  8.43T  2.45T    77%  1.00x  DEGRADED  /mnt
 
  pool: Pool1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: resilvered 11.0M in 0h0m with 0 errors on Sat Mar  1 14:57:37 2014
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    Pool1                                          DEGRADED    0    0    0
      raidz2-0                                      DEGRADED    0    0    0
        gptid/7bc9fb7b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/7c972217-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/7d5d5566-14c5-11e3-b6d4-001b21c4dc34  FAULTED      1  147    0  too many errors
        gptid/7e979187-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/7f8bd18b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
        gptid/80a62fee-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0
 
errors: No known data errors

I had been checking the GUI from time to time and observing that the Alert indicator was still green -- and it is still but obviously is not giving reliable information.

Why did the error message not get emailed immediately? And why did other daily run output messages since March 1 report only backups of mail aliases?

And would someone please remind me how to find out which daN device corresponds to the gptid designation.

ser_rhaegar · Mar 17, 2014

I believe this will give you a map of GPTIDs to DA#

Code:

glabel status

If you don't completely refresh the GUI page, and just have it left open for days on end, the indicator light doesn't change. Or at least that has been my experience.

Regarding the email, is it possible the email from March 1st went to your junk folder or something similar? Have you checked the logs?

cyberjock · Mar 17, 2014

So you need to think about this as 2 different events...

Likely the resilver that occured was because a drive started failing and disconnected and reconnected. The system simply fixed the problem by resilvering the drive for you automatically(no problem there).

Now, the "too many errors" and faulted condition likely happened in the last 24 hours, triggering the email.

I don't think it's appropriate to see the resilver and the FAULT drive as the same event, they were likely 2 separate events.

Now, you are probably wondering why you didn't get any messages back on March 1st. For that I'd have to ask "so when does your system do SMART short tests, long tests, and is SMART monitoring even setup?" My guess is the answer is probably somthing like "no or never".

Z300M · Mar 17, 2014

ser_rhaegar said:
I believe this will give you a map of GPTIDs to DA#

Code:
glabel status

Thanks. The system is shut down at present, but I'll try that command when it's back up.

If you don't completely refresh the GUI page, and just have it left open for days on end, the indicator light doesn't change. Or at least that has been my experience.

I've refreshed the GUI page several times.

Edit: And clicking on the Alert indicator showed the pool as Healthy.

Regarding the email, is it possible the email from March 1st went to your junk folder or something similar? Have you checked the logs?

Nothing goes to my Junk folder automatically. Everything comes to Thunderbird, where some things get marked as potentially Junk, but they don't go to the Junk folder until I confirm that they are Junk.

Z300M · Mar 17, 2014

cyberjock said:
So you need to think about this as 2 different events...

Likely the resilver that occured was because a drive started failing and disconnected and reconnected. The system simply fixed the problem by resilvering the drive for you automatically(no problem there).

Now, the "too many errors" and faulted condition likely happened in the last 24 hours, triggering the email.

I don't think it's appropriate to see the resilver and the FAULT drive as the same event, they were likely 2 separate events.

Now, you are probably wondering why you didn't get any messages back on March 1st. For that I'd have to ask "so when does your system do SMART short tests, long tests, and is SMART monitoring even setup?" My guess is the answer is probably somthing like "no or never".

SMART is set up to check every 720 minutes (12 hours). I don't see an option to select whether it does short or long tests.

ser_rhaegar · Mar 17, 2014

When you setup the test you have to select the type. Short, long, convey, offline.

cyberjock · Mar 17, 2014

It should be checking every 30 mins.. that's why it monitors... the default (30 mins) is a pretty good setting. 12 hours is absurdly long for monitoring.

Z300M · Mar 17, 2014

cyberjock said:
It should be checking every 30 mins.. that's why it monitors... the default (30 mins) is a pretty good setting. 12 hours is absurdly long for monitoring.

OK. I've reset it to 30 minutes, but I can't see anything about selecting short. long. convey, offline, as ser_rhaegar suggested -- only:

Check interval, Power mode, Difference, Informational, Critical, Email to report.

Z300M · Mar 22, 2014

Z300M said:
OK. I've reset it to 30 minutes, but I can't see anything about selecting short. long. convey, offline, as ser_rhaegar suggested -- only:

Check interval, Power mode, Difference, Informational, Critical, Email to report.

OK, I discovered how to setup SMART tests of different kinds. Now I have a Short test every 60 minutes and a Long test at 2am every day.

Yesterday when I clicked the green Alert button to check that everything was OK, it told me that the pool was DEGRADED, with the same number of write errors on the same drive that I remembered from before. A manual Long SMART test revealed no drive errors. I rebooted and did a scrub, and the status was no longer DEGRADED. Earlier today I updated to 9.2.1.3-RELEASE. Now the Alert button is flashing yellow, and clicking it gives me:

Code:

WARNING: The volume Pool1 (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

Executing

Code:

zpool status

gives me:

Code:

[root@freenas ~]# zpool status                                                                                                     
  pool: Pool1                                                                                                                     
state: ONLINE                                                                                                                     
status: One or more devices has experienced an unrecoverable error.  An                                                           
        attempt was made to correct the error.  Applications are unaffected.                                                       
action: Determine if the device needs to be replaced, and clear the errors                                                         
        using 'zpool clear' or replace the device with 'zpool replace'.                                                           
  see: http://illumos.org/msg/ZFS-8000-9P                                                                                         
  scan: scrub repaired 0 in 0h0m with 0 errors on Fri Mar 21 22:42:01 2014                                                         
config:                                                                                                                           
                                                                                                                                   
        NAME                                            STATE    READ WRITE CKSUM                                                 
        Pool1                                          ONLINE      0    0    0                                                 
          raidz2-0                                      ONLINE      0    0    0                                                 
            gptid/7bc9fb7b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/7c972217-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/7d5d5566-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    2                                                 
            gptid/7e979187-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/7f8bd18b-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
            gptid/80a62fee-14c5-11e3-b6d4-001b21c4dc34  ONLINE      0    0    0                                                 
                                                                                                                                   
errors: No known data errors                                                                                                       
[root@freenas ~]#

But

Code:

smartctl -a /dev/daN

shows no errors for any drive.

And notice that the zpool status command reports that the previous scrub repaired 0 errors in 0 time.

What am I supposed to make of all this?

Important Announcement for the TrueNAS Community.

Error report more than two weeks late; alert indicator still green

Z300M

Guru

ser_rhaegar

Patron

cyberjock

Inactive Account

Z300M

Guru

Z300M

Guru

ser_rhaegar

Patron

cyberjock

Inactive Account

Z300M

Guru

Z300M

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Error report more than two weeks late; alert indicator still green

Guru

Patron

Inactive Account

Guru

Guru

Patron

Inactive Account

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Error report more than two weeks late; alert indicator still green"

Similar threads