Is a scrub required after a SMART error?

Status
Not open for further replies.

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
I just got an alert about a SMART error during a long selftest on one of the WD-REDs in my FreeNAS mini. The SMART status looks like below. Should I run a scrub so zfs can fix the file(s) which use the sectors where the error(s) were encountered?

Full smartctl output: http://pastebin.com/naWqwDtr

Edit: pool status and all datasets are healthy.

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       12
  3 Spin_Up_Time            0x0027   179   178   021    Pre-fail  Always       -       8016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2594
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       37
194 Temperature_Celsius     0x0022   119   110   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   198   000    Old_age   Offline      -       196

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2585         426023488
# 2  Short offline       Completed without error       00%      2561         -
# 3  Short offline       Completed without error       00%      2537         -
# 4  Short offline       Completed without error       00%      2489         -
# 5  Short offline       Completed without error       00%      2465         -
# 6  Short offline       Completed without error       00%      2441         -
# 7  Extended offline    Completed without error       00%      2427         -
# 8  Short offline       Completed without error       00%      2393         -
# 9  Short offline       Completed without error       00%      2369         -
#10  Short offline       Completed without error       00%      2321         -
#11  Short offline       Completed without error       00%      2297         -
#12  Short offline       Completed without error       00%      2273         -
#13  Extended offline    Completed without error       00%      2259         -
#14  Short offline       Completed without error       00%      2225         -
#15  Short offline       Completed without error       00%      2201         -
#16  Short offline       Completed without error       00%      2153         -
#17  Short offline       Completed without error       00%      2129         -
#18  Short offline       Completed without error       00%      2105         -
#19  Extended offline    Completed without error       00%      2091         -
#20  Short offline       Completed without error       00%      2057         -
#21  Short offline       Completed without error       00%      2033         -


Thanks,
Saurav.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not exactly. But your disk is basically saying "I'm starting to fail" so more importantly than a scrub is to look at replacing the disk.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
No point in trying to fix things if the drive is just going to screw up more stuff.

If the disk is still in warranty, that failed SMART test is enough to get a replacement. If not, just buy a new one.
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
Of course the disc is in warranty (has been only 4 months), but since it came with the FreeNAS mini I have to check with iXSystems to see if/how a replacement is possible. Also, I don't live in the US so WD shows that disk as "Out of Region", so I'll have to see how it goes...
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
So ixSystems has shipped a replacement, but I will only get it in Monday (1/12) or Tuesday (1/13). In the meanwhile, I now have Pending Sectors on that disc (after a short selftest). The pool and all datasets are still healthy, and this is what the SMART status looks right now for that disc:

Full SMART status: http://pastebin.com/k8qst4Fg

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       36
  3 Spin_Up_Time            0x0027   179   178   021    Pre-fail  Always       -       8016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2632
10 Spin_Retry_Count         0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count  0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   119   110   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   198   000    Old_age   Offline      -       196

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2609         -
# 2  Extended offline    Completed: read failure       90%      2585         426023488
# 3  Short offline       Completed without error       00%      2561         -
# 4  Short offline       Completed without error       00%      2537         -
# 5  Short offline       Completed without error       00%      2489         -
# 6  Short offline       Completed without error       00%      2465         -
# 7  Short offline       Completed without error       00%      2441         -
# 8  Extended offline    Completed without error       00%      2427         -
# 9  Short offline       Completed without error       00%      2393         -
#10  Short offline       Completed without error       00%      2369         -
#11  Short offline       Completed without error       00%      2321         -
#12  Short offline       Completed without error       00%      2297         -
#13  Short offline       Completed without error       00%      2273         -
#14  Extended offline    Completed without error       00%      2259         -
#15  Short offline       Completed without error       00%      2225         -
#16  Short offline       Completed without error       00%      2201         -
#17  Short offline       Completed without error       00%      2153         -
#18  Short offline       Completed without error       00%      2129         -
#19  Short offline       Completed without error       00%      2105         -
#20  Extended offline    Completed without error       00%      2091         -
#21  Short offline       Completed without error       00%      2057         -


Should I skip the burn-in and just pop-in the replacement and resilver? That will save me 2 days of running with a failing disc. Probably a disc that's failing fast.

Just asking for opinions. I know there is no technically correct answer to this...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
It's a good question. If all the other drives were burned-in, I'd put it straight in, set up monitoring, and run jgreco's script only on the new drive (since it's supposed to be data-safe).

Ideally, one would have a burned-in spare always ready (I'll admit that that's still something I have to take care of - do as I say and not as I do :p).
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
The RMA arrived today.

When I tried to offline the failing disk (ada3), I got this error:

Code:
Jan 12 20:07:49 freenas-primary manage.py: [middleware.exceptions:38] [MiddlewareError: Disk offline failed: "cannot offline gptid/c6e200a8-3abf-11e4-bf2e-d05099265082: no such device in pool, "]


First few results Google turns up for this refer to a similar looking bug in some earlier versions of PC-BSD and only one reference in this forum:
https://forums.freenas.org/index.php?threads/drive-will-not-go-offline-to-replace.17508/

But that guy had some other h/w.

What to do now?

Code:
  pool: tank
state: ONLINE
  scan: scrub repaired 0 in 4h20m with 0 errors on Mon Jan  5 06:20:14 2015
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/c6140475-3abf-11e4-bf2e-d05099265082  ONLINE       0     0     0
        gptid/c67a8825-3abf-11e4-bf2e-d05099265082  ONLINE       0     0     0
        gptid/c6e200a8-3abf-11e4-bf2e-d05099265082  ONLINE       0     0     0
        gptid/c749abb4-3abf-11e4-bf2e-d05099265082  ONLINE       0     0     0

errors: No known data errors


gpart show: http://pastebin.com/qKDnt3M2
zdb -l /dev/gptid/c6e200a8-3abf-11e4-bf2e-d05099265082 tank: http://pastebin.com/akcbH37P
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
This might also be useful. It shows the swap partition only from the problem disk
Code:
glabel status
                                      Name  Status  Components
gptid/c6140475-3abf-11e4-bf2e-d05099265082     N/A  ada0p2
gptid/c67a8825-3abf-11e4-bf2e-d05099265082     N/A  ada1p2
                             ufs/FreeNASs3     N/A  ada2s3
                             ufs/FreeNASs4     N/A  ada2s4
gptid/c6e200a8-3abf-11e4-bf2e-d05099265082     N/A  ada3p2
gptid/c749abb4-3abf-11e4-bf2e-d05099265082     N/A  ada4p2
                    ufsid/53e2e057a7d7f957     N/A  ada2s1a
                            ufs/FreeNASs1a     N/A  ada2s1a
                            ufs/FreeNASs2a     N/A  ada2s2a
gptid/c6ce0117-3abf-11e4-bf2e-d05099265082     N/A  ada3p1
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
It seems like the exact same issue as this one

https://bugs.freenas.org/issues/5035

Which was resolved as a "User Configuration Issue" after the user was able to offline his disk in the shell.
 
Status
Not open for further replies.
Top