SOLVED Why didn't I get a warning on this?

Status
Not open for further replies.

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
So last night I had an apparent glitch in my server--one of my drives dropped offline, then came back online a few minutes later. I got an email alert that the pool was degraded, and another that it was resilvering. After four minutes, the disk was resilvered and the pool is fine. When I got up this morning, I saw the emails, identified the drive that had dropped offline, and checked its SMART status:

Code:
[root@freenas2] ~# smartctl -a /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    Z2F0TD8E
LU WWN Device Id: 5 000c50 05089b845
Firmware Version: CC26
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Aug 13 06:58:18 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  35)    The self-test routine was interrupted
                    by the host with a hard or soft reset.
Total time to complete Offline
data collection:         (  617) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   1) minutes.
Extended self-test routine
recommended polling time:     ( 268) minutes.
Conveyance self-test routine
recommended polling time:     (   2) minutes.
SCT capabilities:           (0x3085)    SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   101   099   006    Pre-fail  Always       -       234545136
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       52
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   065   055   030    Pre-fail  Always       -       820922786641
  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       27480
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       52
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   091   000    Old_age   Always       -       2 2 9
189 High_Fly_Writes         0x003a   025   025   000    Old_age   Always       -       75
190 Airflow_Temperature_Cel 0x0022   065   046   045    Old_age   Always       -       35 (Min/Max 29/39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       212
194 Temperature_Celsius     0x0022   035   054   000    Old_age   Always       -       35 (0 9 0 0 0)
197 Current_Pending_Sector  0x0012   048   048   000    Old_age   Always       -       8528
198 Offline_Uncorrectable   0x0010   048   048   000    Old_age   Offline      -       8528
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       27477h+17m+19.461s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       77266091192
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       339503383130

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      30%     27478         -
# 2  Short offline       Completed without error       00%     27473         -
# 3  Short offline       Completed without error       00%     27449         -
# 4  Short offline       Completed without error       00%     27425         -
# 5  Short offline       Completed without error       00%     27401         -
# 6  Short offline       Completed without error       00%     27377         -
# 7  Short offline       Completed without error       00%     27353         -
# 8  Short offline       Completed without error       00%     27329         -
# 9  Extended offline    Completed without error       00%     27311         -
#10  Short offline       Completed without error       00%     27305         -
#11  Short offline       Completed without error       00%     27281         -
#12  Short offline       Completed without error       00%     27257         -
#13  Short offline       Completed without error       00%     27233         -
#14  Short offline       Completed without error       00%     27209         -
#15  Short offline       Completed without error       00%     27185         -
#16  Short offline       Completed without error       00%     27161         -
#17  Extended offline    Completed without error       00%     27142         -
#18  Short offline       Completed without error       00%     27137         -
#19  Short offline       Completed without error       00%     27113         -
#20  Short offline       Completed without error       00%     27089         -
#21  Short offline       Completed without error       00%     27065         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

8500 bad sectors (and climbing--a few minutes later, and the count has jumped to over 10k)? Ouch! Obviously it's time to replace the drive. I'm OK with that; it's fully depreciated. But what I'm concerned about is that there was no warning of this. I have SMART monitoring enabled, I have my email address entered, and I've gotten SMART email alerts in the past, but nothing on this. I don't even have a yellow light in the web GUI.

Seems the system should have recognized this and said something. What can I check to track down why it isn't? Build specs in sig.

Edit: It's worse than I thought. Here's what another drive is saying:
Code:
[root@freenas2] ~# smartctl -a /dev/da13
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WL6000GSA6457
Serial Number:    WOL240336066
LU WWN Device Id: 0 000000 000000000
Firmware Version: 01.00F.3
User Capacity:    6,001,424,400,384 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
Local Time is:    Sat Aug 13 07:41:23 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:         ( 6524) seconds.
Offline data collection
capabilities:             (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 719) minutes.
SCT capabilities:           (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       5800
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   198   198   140    Pre-fail  Always       -       71
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2603
 10 Spin_Retry_Count        0x0033   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   054   000    Old_age   Always       -       39
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   113   106   000    Old_age   Always       -       39
195 Hardware_ECC_Recovered  0x0036   200   200   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       13

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2597         4198496
# 2  Short offline       Completed: read failure       40%      2596         4198496
# 3  Short offline       Completed: read failure       50%      2572         4198496
# 4  Short offline       Completed: read failure       50%      2548         4198496
# 5  Short offline       Completed: read failure       50%      2524         4198496
# 6  Short offline       Completed: read failure       40%      2500         4198504
# 7  Short offline       Completed: read failure       50%      2476         4198496
# 8  Short offline       Completed without error       00%      2452         -
# 9  Extended offline    Completed: read failure       90%      2429         4199088
#10  Short offline       Completed: read failure       50%      2428         4198496
#11  Short offline       Completed: read failure       50%      2404         4198496
#12  Short offline       Completed: read failure       50%      2380         4198504
#13  Short offline       Completed: read failure       20%      2356         4199096
#14  Short offline       Completed: read failure       30%      2332         4198496
#15  Short offline       Completed: read failure       50%      2308         4198504
#16  Short offline       Completed: read failure       50%      2284         4198496
#17  Extended offline    Completed: read failure       90%      2261         4198496
#18  Short offline       Completed: read failure       50%      2260         4198504
#19  Short offline       Completed: read failure       50%      2236         4198496
#20  Short offline       Completed: read failure       50%      2212         4198496
#21  Short offline       Completed: read failure       50%      2188         4198496

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It's been consistently failing SMART tests for at least the last three weeks. No warning. And another one:
Code:
[root@freenas2] ~# smartctl -a /dev/da15
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WL6000GSA6457
Serial Number:    WOL240336064
LU WWN Device Id: 0 000000 000000000
Firmware Version: 01.00F.3
User Capacity:    6,001,424,400,384 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
Local Time is:    Sat Aug 13 07:44:10 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 244)    Self-test routine in progress...
                    40% of test remaining.
Total time to complete Offline 
data collection:         ( 7424) seconds.
Offline data collection
capabilities:             (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 727) minutes.
SCT capabilities:           (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       5858
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       61
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2604
 10 Spin_Retry_Count        0x0033   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   055   000    Old_age   Always       -       39
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   113   107   000    Old_age   Always       -       39
195 Hardware_ECC_Recovered  0x0036   200   200   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       60%      2596         3131594576
# 2  Short offline       Completed: read failure       60%      2572         3131594576
# 3  Short offline       Completed: read failure       60%      2548         3131594576
# 4  Short offline       Completed: read failure       60%      2524         3131594576
# 5  Short offline       Completed: read failure       60%      2500         3131594576
# 6  Short offline       Completed: read failure       50%      2476         3131594640
# 7  Short offline       Completed: read failure       10%      2452         3131594640
# 8  Extended offline    Completed: read failure       10%      2443         3131594576
# 9  Short offline       Completed: read failure       60%      2428         3131594640
#10  Short offline       Completed: read failure       50%      2404         3131594736
#11  Short offline       Completed: read failure       60%      2380         3131594576
#12  Short offline       Completed: read failure       60%      2356         3131594576
#13  Short offline       Completed without error       00%      2332         -
#14  Short offline       Completed: read failure       60%      2308         3131594632
#15  Short offline       Completed: read failure       60%      2284         3131594640
#16  Extended offline    Completed: read failure       10%      2275         3131594640
#17  Short offline       Completed: read failure       50%      2260         3131594672
#18  Short offline       Completed: read failure       60%      2236         3131594624
#19  Short offline       Completed: read failure       60%      2212         3131594576
#20  Short offline       Completed: read failure       60%      2188         3131594576
#21  Short offline       Completed: read failure       60%      2164         3131594672

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


That's three drives that should have been throwing big red warning lights, but aren't. Why?
 
Last edited:

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
That's three drives that should have been throwing big red warning lights, but aren't. Why?
Shooting in the dark here...
I seem to recall an old cyberjock post regarding drive errors appearing during periods of high temps.
The first drive's smart numbers obviously indicate a failing drive. I think if the other two @ 39 degrees,
are idle temps, perhaps these two drives may be just plain too hot and are throwing errors due to that.
Are your logs reporting these errors? I would also compare firmware versions for the drives of the same
model number...
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Yeah, the drives are a little warm. They're better than they used to be, but they still get into the low 40s pretty regularly. That might account for early failure, but it shouldn't account for failure to report the failure. And yes, all of these errors are in the logs:
Code:
[root@freenas2] /var/log# grep da13 messages
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da13 [SAT], previous self-test completed with error (read test element)
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da13 [SAT], new Self-Test Log error at hour timestamp 2596
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da13 [SAT], previous self-test completed with error (read test element)
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da13 [SAT], new Self-Test Log error at hour timestamp 2596
Aug 13 01:38:44 freenas2 smartd[4796]: Device: /dev/da13 [SAT], previous self-test completed with error (read test element)
Aug 13 01:38:44 freenas2 smartd[4796]: Device: /dev/da13 [SAT], new Self-Test Log error at hour timestamp 2597
Aug 13 01:38:44 freenas2 smartd[4796]: Device: /dev/da13 [SAT], previous self-test completed with error (read test element)
Aug 13 01:38:44 freenas2 smartd[4796]: Device: /dev/da13 [SAT], new Self-Test Log error at hour timestamp 2597
Aug 13 07:38:25 freenas2 smartd[91191]: Device: /dev/da13 [SAT], previous self-test completed with error (read test element)
Aug 13 07:38:25 freenas2 smartd[91191]: Device: /dev/da13 [SAT], previous self-test completed with error (read test element)
[root@freenas2] /var/log# grep da15 messages
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da15 [SAT], new Self-Test Log error at hour timestamp 2596
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da15 [SAT], previous self-test completed with error (read test element)
Aug 13 00:38:41 freenas2 smartd[4796]: Device: /dev/da15 [SAT], new Self-Test Log error at hour timestamp 2596
[root@freenas2] /var/log# grep da1\  messages
Aug 13 05:08:43 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 856 Currently unreadable (pending) sectors
Aug 13 05:08:43 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 856 Offline uncorrectable sectors
Aug 13 05:08:44 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 856 Currently unreadable (pending) sectors
Aug 13 05:08:44 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 856 Offline uncorrectable sectors
Aug 13 05:32:08 freenas2 da1 at mps0 bus 0 scbus0 target 9 lun 0
Aug 13 05:32:24 freenas2 da1 at mps0 bus 0 scbus0 target 9 lun 0
Aug 13 05:39:13 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 3488 Currently unreadable (pending) sectors (changed +2632)
Aug 13 05:39:13 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 3488 Offline uncorrectable sectors (changed +2632)
Aug 13 05:39:55 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 3592 Currently unreadable (pending) sectors (changed +2736)
Aug 13 05:39:55 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 3592 Offline uncorrectable sectors (changed +2736)
Aug 13 05:41:03 freenas2 da1 at mps0 bus 0 scbus0 target 9 lun 0
Aug 13 05:41:10 freenas2 da1 at mps0 bus 0 scbus0 target 9 lun 0
Aug 13 06:08:43 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 5480 Currently unreadable (pending) sectors (changed +1992)
Aug 13 06:08:43 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 5480 Offline uncorrectable sectors (changed +1992)
Aug 13 06:08:45 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 5480 Currently unreadable (pending) sectors (changed +1888)
Aug 13 06:08:45 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 5480 Offline uncorrectable sectors (changed +1888)
Aug 13 06:38:45 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 7776 Currently unreadable (pending) sectors (changed +2296)
Aug 13 06:38:45 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 7776 Offline uncorrectable sectors (changed +2296)
Aug 13 06:38:49 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 7776 Currently unreadable (pending) sectors (changed +2296)
Aug 13 06:38:49 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 7776 Offline uncorrectable sectors (changed +2296)
Aug 13 07:08:49 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 9296 Currently unreadable (pending) sectors (changed +1520)
Aug 13 07:08:49 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 9296 Offline uncorrectable sectors (changed +1520)
Aug 13 07:08:54 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 9296 Currently unreadable (pending) sectors (changed +1520)
Aug 13 07:08:54 freenas2 smartd[4796]: Device: /dev/da1 [SAT], 9296 Offline uncorrectable sectors (changed +1520)
Aug 13 07:38:32 freenas2 smartd[91191]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 07:38:32 freenas2 smartd[91191]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 08:08:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 08:08:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 08:38:43 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 08:38:43 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 09:08:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 09:08:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 09:38:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 09:38:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 10:08:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 10:08:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 10:38:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 10:38:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 11:08:43 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 11:08:43 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
Aug 13 11:38:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Currently unreadable (pending) sectors
Aug 13 11:38:42 freenas2 smartd[91493]: Device: /dev/da1 [SAT], 10736 Offline uncorrectable sectors
[root@freenas2] /var/log#
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Lets take a look at just one drive, da15 should be generating an email after each SMART Short test failure, and it's been failing for a while and at the same LBA range. I would RMA this drive. I would assume the email is all setup properly since you got a message in the first place.

I'm not sure what drive temperature would have to do with not getting email notifications. Is FreeNAS broken? Based on what I've read above I think it's a Bug Report with a very high importance value. If you find out that this is a setup error, I'd really like to know what that was so I can ensure I'm not making the same mistake.

Lets talk a bit more about the drives listed above... da13 and da15 both are having reading data issues and fail the SMART testing yet the overall health status is PASSED. This could be the smoking gun why FreeNAS isn't reporting the errors, but I've been too out of the game to know what the trigger is to consider it a failure and generate the email. I think da1 is also failing even though it passes the extended test.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
It's obvious (to me, at least), that all three drives need to be replaced. I've already ordered a replacement for da1, and I'm checking with the vendor on da13 and da15 if I can do an advance exchange (since I don't want to degrade my pool if I can avoid it). I'm also seriously considering replacing da13 and da15 with name-brand drives rather than the white labels.

I've submitted a bug report, but it's marked private since my debug file is attached (which it seems would be very important for the devs to figure out what's going on). I'll update here if I learn anything.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
That's why I made a script to email me the SMART infos after each SMART test (and same for the scrubs), until this problem is solved I recommend you to do the same so you don't miss any other drive failure ;)
 

LIGISTX

Guru
Joined
Apr 12, 2015
Messages
525
Whelp, looks like now that I am trying to be apart of this FreeNAS community and really really like warm and fuzzies, I should prob keep up with stuff like this ;)

I really do have SO much to learn.
 

ethereal

Guru
Joined
Sep 10, 2012
Messages
762
i use Bidule0hm's script and i have an email of all my drives smart tests once a week.

i check every drive for any problems and also verify that the smart test are running as and when they should be.

early in my freenas experience i realised i needed to be proactive with my drives health - the emails weren't coming (my fault). and sometimes the smart tests weren't running on the drives i thought they were.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
My bug has been parked as a duplicate of 15898, which appears to have been resolved in the 9.10.1 update. That will get me the P21 firmware warning and who knows what boot problems, but I guess that's why I have IPMI and boot environments.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
For closure, I updated to 9.10.1, and the alert notifications are now working as expected.
Just out of curiosity... are you using an email address with a '+' character?
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
Just out of curiosity... are you using an email address with a '+' character?
Gmail, right? I am using an email with a + character, and whether or not I had the + character, it wouldn't work for me in 9.10. Once I upgraded to 9.10.1, it fixed the problem.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Gmail, right? I am using an email with a + character, and whether or not I had the + character, it wouldn't work for me in 9.10. Once I upgraded to 9.10.1, it fixed the problem.
Yes, gmail. I don't use it for my monitoring account, but I read the bug report and it mentioned problems with email addresses containing '+' chars.

I think I'm getting all of the status mails I should be getting from my FreeNAS server... but I intend to find out for certain, given the email problems described in this thread.

Problem is... I'm running 9.10 and don't have a lot of confidence in 9.10.1 after seeing all the problems folks have posted about it.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'm not sure which versions of 9.10 had the bug in smart_alert.py, but at least 20160607 did. The result of that appears to be that any SMART alerts will not be emailed.
 
Status
Not open for further replies.
Top