Why do "Offline Uncorrectable" sectors not trigger an alert?

Status
Not open for further replies.

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
I have SMART enabled on all drives. I do a SMART short self-test nightly and a long self-test every two weeks.

When "Offline Uncorrectable" or "Current Pending Sector" counts are not zero, I never receive an email alert and the SMART self-test reports "Completed without error". Also, the drive reports: "the SMART overall-health self-assessment test result: PASSED".

Why is that? Shouldn't these types of failures trigger an alert and an email?

This has happened several times now. Each time, I discovered it by chance because I happened to check via SSH terminal. I would like Freenas to alert me when these errors occur. Yes, I could write a script to do this, but I'm wondering why this is not the default behavior.

Thanks.
Note: yes, my email is setup properly, I receive security reports, etc.
Running FreeNAS 9.2.1.5
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You shouldn't get an 'alert' if by alert you mean the stoplight turns red. Either of those being non-zero isn't an immediate jump to action situation.

You should be getting email if you have SMART monitoring setup, your hardware supports SMART, as well as emails. You gave zero information on your hardware or FreeNAS version so i'll wait for you to provide that information before going further.
 

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
Actually I had edited my post right away when I realized I forgot to put my version: 9.2.1.5

Yes, I have full SMART support, smartctl works fine on all drives and SMART monitoring is turned on, emails are confirmed to work.

That's my point, these types of errors never get reported by email. Is this a bug in FreeNAS? This has happened to me on two different FreeNAS boxes (both running 9.2.x).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Can you post the output of smartctl -a /dev/(somedisk). Please include the command line you use.

Also if you go to the SMART settings, is the email address field completed?

I will tell you that it definitely emails for both of those. There's probably 50 threads that could vouch that it works. The question is why it doesn't work for you. ;)

Keep in mind that you get an email when those number change. Not just when they are non-zero. If it goes from 0 to 10 you'll get an email when it changes, if it goes from 100 to 101 it'll only email you when it goes to 101.
 

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
Here's the output. Note that I already RMAed the failing drive, so this one has no errors showing.

Yes, the email field is filled in and is correct. I use a Gmail address. Here's a question: does it use the same SMTP authentication that I configured for root user? Or does SMART use something different?


Code:
[root@freenas] ~# smartctl -q noserial -a /dev/ada1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda Green (AF)
Device Model:    ST2000DL003-9VT166
Firmware Version: CC3C
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed May 28 16:59:10 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  612) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 341) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  119  099  006    Pre-fail  Always      -      1923384
  3 Spin_Up_Time            0x0003  095  095  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      6
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  069  060  030    Pre-fail  Always      -      9768740
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      480
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      6
183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      3
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
190 Airflow_Temperature_Cel 0x0022  063  057  045    Old_age  Always      -      37 (Min/Max 28/43)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      3
193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      6
194 Temperature_Celsius    0x0022  037  043  000    Old_age  Always      -      37 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a  036  006  000    Old_age  Always      -      1923384
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      264243567919584
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      3391048428
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      1668432684
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      466        -
# 2  Short offline      Completed without error      00%      442        -
# 3  Short offline      Completed without error      00%      419        -
# 4  Short offline      Completed without error      00%      395        -
# 5  Short offline      Completed without error      00%      371        -
# 6  Short offline      Completed without error      00%      347        -
# 7  Short offline      Completed without error      00%      323        -
# 8  Short offline      Completed without error      00%      299        -
# 9  Short offline      Completed without error      00%      275        -
#10  Short offline      Completed without error      00%      251        -
#11  Short offline      Completed without error      00%      227        -
#12  Short offline      Completed without error      00%      203        -
#13  Short offline      Completed without error      00%      179        -
#14  Short offline      Completed without error      00%      155        -
#15  Short offline      Completed without error      00%      131        -
#16  Short offline      Completed without error      00%      107        -
#17  Short offline      Aborted by host              10%        91        -
#18  Short offline      Completed without error      00%        58        -
#19  Extended offline    Completed without error      00%        41        -
#20  Short offline      Completed without error      00%        34        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
SMART send to the email you've specified in the SMART settings. It doesn't email "root". That's why I asked if you set up your SMART emailing.
 

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
Yes, it's setup correctly in the SMART settings. But it needs to use an SMTP server, user ID and password to send it. And there are no fields to specify this in SMART. So does it just use the built in "sendmail"? Or does it use the root user's email credentials? That's a valid question.

To test, I just did a test using "mail xxx@gmail.com" as root from the command line, and the email was received correctly.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It uses the server info you provided in Settings -> Email.
 

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
Ok, thanks. So that eliminates that possibility, because all of that is working. I can send a test mail from the Settings -> Email menu, I can also send a test mail from command line, and I already receive Security emails nightly.

So there is something else going on. I'm definitely not getting email notices when the Offline Uncorrectable count goes up.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah, I don't have any recommendations at all. Either you've found a bug that nobody else has been able to find, you've got something wonky going on that I can't explain from my position, you've done something wrong but don't realize it, or you don't understand what I'm asking when I ask if you've done A, B, and C.

I got nothing else other than that unfortunately. Sorry.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I was under the impression that smartd would only send mail if the drive's thresholds were exceeded or if the drive self-assessment changed. An error or two on a hard drive is not necessarily a problem, though most frequently it will lead to more problems rapidly appearing...
 

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
I was under the impression that smartd would only send mail if the drive's thresholds were exceeded or if the drive self-assessment changed. An error or two on a hard drive is not necessarily a problem, though most frequently it will lead to more problems rapidly appearing...


That certainly appears to be the case. I don't think it's a bug, I think it's just that it ignores increases in Offline Uncorrectable. I guess I'll have to write a script and cron job to get the functionality I want.
 

ixion

Dabbler
Joined
Dec 22, 2011
Messages
30
Yeah, I don't have any recommendations at all. Either you've found a bug that nobody else has been able to find, you've got something wonky going on that I can't explain from my position, you've done something wrong but don't realize it, or you don't understand what I'm asking when I ask if you've done A, B, and C.

I got nothing else other than that unfortunately. Sorry.


Thanks for the help CJ. And yup, I understand, I've been a Unix/Linux developer for 25 years (since the Xenix and SCO days! :smile:.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Nope. I get emails if it goes from 0 to 1 on 9.2.0 for both offline uncorrectable and pending sectors. I haven't seen it in 9.2.1+ personally since I haven't had a disk fail in my FreeNAS Mini. But we've had plenty of people in the forum with single or double digit failures that had various 9.2.1 versions.

The thresholds are when the disk is officially labeled as 'failed'. Unfortunately may disk manufacturers are setting the threshold to zero and since its set at zero and the worst case value can only be one, you will never get a "failed disk" warning. smartd solved this problem by monitoring the actual raw values themselves so you can still RMA drives that are manufactured by more and more scrupulous manufacturers.

Here's one of my WD Reds:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   173   171   021    Pre-fail  Always       -       4325
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       31
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       803
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       15
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       156
194 Temperature_Celsius     0x0022   118   094   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0


Notice that only 3 rows have a THRESH. In order to get a failure from the disk the VALUE <(or ≤ I forget which) THRESH. Well, since VALUE must be a non-zero positive integer it's not possible to get to zero. Hence, you will never get a failure indicator no matter how hard you try. ;)

Ain't life great?
 
Status
Not open for further replies.
Top