Possible Bad Drive?

Status
Not open for further replies.

raidflex

Guru
Joined
Mar 14, 2012
Messages
531
I have a drive that was disappearing out of my raid array, so I put in an RMA and replaced the drive. All was a well for a couple weeks and now I am seeing checksum errors after running a scrub that are related to the same drive that I just replaced. I also did not see any SMART errors displayed.

My weekly scrub has run a couple times since I replaced the drive and I did a zpool clear and the checksum errors returned.

I guess its possible I could have a bad refurbished drive that I received from seagate.

I have attached info below.


ZPOOL STATUS

Code:
[root@freenas] ~# zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 419M in 7h48m with 0 errors on Wed May  8 10:48:35 2013
config:

        NAME                                            STATE     READ WRITE CKS                                                     UM
        Data                                            ONLINE       0     0                                                          0
          raidz2-0                                      ONLINE       0     0                                                          0
            gptid/e902c40d-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0                                                          0
            gptid/e91172b0-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0                                                          0
            gptid/e91f0e06-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0 19.                                                     2K
            gptid/f96ec369-a47c-11e2-863b-e06995ebe0de  ONLINE       0     0                                                          0
            gptid/e93a03e7-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0                                                          0
            gptid/e94867a0-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0                                                          0
            gptid/e956b27d-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0                                                          0
            gptid/e964d6cf-3326-11e1-acd0-e06995ebe0de  ONLINE       0     0                                                          0

errors: No known data errors


Camcontrol devlist

Code:
[root@freenas] ~# camcontrol devlist
<ATA ST2000DL003-9VT1 CC32>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA ST2000DL003-9VT1 CC32>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA ST2000DL003-9VT1 CC3C>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA ST2000DL003-9VT1 CC3C>        at scbus0 target 3 lun 0 (pass3,da3)
<ATA ST2000DL003-9VT1 CC32>        at scbus0 target 5 lun 0 (pass4,da4)
<ATA ST2000DL003-9VT1 CC32>        at scbus0 target 6 lun 0 (pass5,da5)
<ATA ST2000DL003-9VT1 CC32>        at scbus0 target 7 lun 0 (pass6,da6)
<ATA ST32000542AS CC34>            at scbus0 target 8 lun 0 (pass7,da7)
<ATAPI DVD A  DH16A3L 8H1F>        at scbus3 target 0 lun 0 (pass8,cd0)
<ST31500341AS CC1H>                at scbus4 target 0 lun 0 (pass9,ada0)
<ADATA USB Flash Drive 1.00>       at scbus7 target 0 lun 0 (pass10,da8)


glabel status

Code:
[root@freenas] ~# glabel status
                                      Name  Status  Components
gptid/e964d6cf-3326-11e1-acd0-e06995ebe0de     N/A  da0p2
gptid/e956b27d-3326-11e1-acd0-e06995ebe0de     N/A  da1p2
gptid/e94867a0-3326-11e1-acd0-e06995ebe0de     N/A  da2p2
gptid/e93a03e7-3326-11e1-acd0-e06995ebe0de     N/A  da3p2
gptid/e91f0e06-3326-11e1-acd0-e06995ebe0de     N/A  da4p2
gptid/e91172b0-3326-11e1-acd0-e06995ebe0de     N/A  da5p2
gptid/e902c40d-3326-11e1-acd0-e06995ebe0de     N/A  da6p2
gptid/f96ec369-a47c-11e2-863b-e06995ebe0de     N/A  da7p2
                             ufs/FreeNASs3     N/A  da8s3
                             ufs/FreeNASs4     N/A  da8s4
                            ufs/FreeNASs1a     N/A  da8s1a
                    ufsid/5144fed8f696ca23     N/A  da8s2a
                            ufs/FreeNASs2a     N/A  da8s2a
gptid/42c1112e-b5f8-11e2-9553-e06995ebe0de     N/A  ada0p2



GPART SHOW

Code:
[root@freenas] ~# gpart show
=>        34  3907029101  da0  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da1  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da2  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da3  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da4  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da5  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da6  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>        34  3907029101  da7  GPT  (1.8T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  3902834703    2  freebsd-zfs  (1.8T)

=>      63  30883777  da8  MBR  (14G)
        63   1930257    1  freebsd  [active]  (942M)
   1930320        63       - free -  (31k)
   1930383   1930257    2  freebsd  (942M)
   3860640      3024    3  freebsd  (1.5M)
   3863664     41328    4  freebsd  (20M)
   3904992  26978848       - free -  (12G)

=>      0  1930257  da8s1  BSD  (942M)
        0       16         - free -  (8.0k)
       16  1930241      1  !0  (942M)

=>      0  1930257  da8s2  BSD  (942M)
        0       16         - free -  (8.0k)
       16  1930241      1  !0  (942M)

=>        34  2930277101  ada0  GPT  (1.4T)
          34          94        - free -  (47k)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  2926082703     2  freebsd-zfs  (1.4T)
 

lpittman

Dabbler
Joined
May 2, 2013
Messages
35
Might be a long shot, but have you considered a bad cable/connection?
 

raidflex

Guru
Joined
Mar 14, 2012
Messages
531
Might be a long shot, but have you considered a bad cable/connection?

Its possible, this is using a SAS connector, which I do not have more of. I could order one, just to rule it out.

Is it possible this could be controller related, maybe the one particular port that the HDD is plugged into?
 

lpittman

Dabbler
Joined
May 2, 2013
Messages
35
I was going to suggest that next. Do you have the luxury of killing your volumes or is the box in production? I wouldn't suggest it on a production machine, but your next step would be to shuffle the connections (keeping track of course) and checking the result. When I was reading up on purchasing my LSI HBA I read a couple of posts where users had a single port that was not functioning *entirely* .... not sure what entirely meant but I believe it could have been similar to what you are talking about.
 

raidflex

Guru
Joined
Mar 14, 2012
Messages
531
Unfortunately it is a production machine, although I do have a backup I would prefer to be on the safe side. There are movies I have ripped from my DVD/Blu-ray that I do not backup due to the expense of storage.

I was considering pulling the drive out and running some hard drive tests, but I would find it hard to believe the HDD I received from Seagate was not tested prior to shipping. Although anything is possible, it could have been even bumped around during shipping.
 

lpittman

Dabbler
Joined
May 2, 2013
Messages
35
Oh, you haven't tested the new drive yet? Definitely do that first. Yes it is unlikely you received another bad drive, but not impossible.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
Code:
smartctl -q noserial -a /dev/da4
 

raidflex

Guru
Joined
Mar 14, 2012
Messages
531
Code:
smartctl -q noserial -a /dev/da4


Looks like there are no errors.

Code:
[root@freenas] ~# smartctl -q noserial -a /dev/da4
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (Adv. Format)
Device Model:     ST2000DL003-9VT166
Firmware Version: CC32
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed May  8 21:51:37 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  612) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 335) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       9407136
  3 Spin_Up_Time            0x0003   070   070   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       101
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -       163928206
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13169
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       99
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       4295032845
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   075   069   045    Old_age   Always       -       25 (Min/Max 22/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       96
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       101
194 Temperature_Celsius     0x0022   025   040   000    Old_age   Always       -       25 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   021   007   000    Old_age   Always       -       9407136
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       91521458123643
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3474488223
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2398625808

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13160         -
# 2  Short offline       Completed without error       00%     13148         -
# 3  Short offline       Completed without error       00%     13136         -
# 4  Short offline       Completed without error       00%     13124         -
# 5  Short offline       Completed without error       00%     13112         -
# 6  Short offline       Completed without error       00%     13100         -
# 7  Short offline       Completed without error       00%     13088         -
# 8  Short offline       Completed without error       00%     13076         -
# 9  Short offline       Completed without error       00%     13064         -
#10  Short offline       Completed without error       00%     13052         -
#11  Short offline       Completed without error       00%     13040         -
#12  Short offline       Completed without error       00%     13028         -
#13  Short offline       Completed without error       00%     13016         -
#14  Short offline       Completed without error       00%     13004         -
#15  Short offline       Completed without error       00%     12992         -
#16  Short offline       Completed without error       00%     12980         -
#17  Short offline       Completed without error       00%     12968         -
#18  Short offline       Completed without error       00%     12956         -
#19  Short offline       Completed without error       00%     12944         -
#20  Short offline       Completed without error       00%     12932         -
#21  Short offline       Completed without error       00%     12920         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas] ~# clear
[root@freenas] ~# smartctl -q noserial -a /dev/da4
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (Adv. Format)
Device Model:     ST2000DL003-9VT166
Firmware Version: CC32
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed May  8 21:51:44 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  612) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 335) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       9407136
  3 Spin_Up_Time            0x0003   070   070   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       101
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -       163928206
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13169
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       99
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       4295032845
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   075   069   045    Old_age   Always       -       25 (Min/Max 22/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       96
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       101
194 Temperature_Celsius     0x0022   025   040   000    Old_age   Always       -       25 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   021   007   000    Old_age   Always       -       9407136
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       121453085209467
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3474488223
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2398625808

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13160         -
# 2  Short offline       Completed without error       00%     13148         -
# 3  Short offline       Completed without error       00%     13136         -
# 4  Short offline       Completed without error       00%     13124         -
# 5  Short offline       Completed without error       00%     13112         -
# 6  Short offline       Completed without error       00%     13100         -
# 7  Short offline       Completed without error       00%     13088         -
# 8  Short offline       Completed without error       00%     13076         -
# 9  Short offline       Completed without error       00%     13064         -
#10  Short offline       Completed without error       00%     13052         -
#11  Short offline       Completed without error       00%     13040         -
#12  Short offline       Completed without error       00%     13028         -
#13  Short offline       Completed without error       00%     13016         -
#14  Short offline       Completed without error       00%     13004         -
#15  Short offline       Completed without error       00%     12992         -
#16  Short offline       Completed without error       00%     12980         -
#17  Short offline       Completed without error       00%     12968         -
#18  Short offline       Completed without error       00%     12956         -
#19  Short offline       Completed without error       00%     12944         -
#20  Short offline       Completed without error       00%     12932         -
#21  Short offline       Completed without error       00%     12920         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas] ~#
 

raidflex

Guru
Joined
Mar 14, 2012
Messages
531

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
So does this basically mean it is a drive issue and not the controller?
:confused: I'd try this first.
Or maybe it could be just a bad SATA cable ...
Also, suggested above. It would have been interesting to see the SMART info for the original drive. In addition to the cable it could be an issue with the particular port.

Feel free to read HolyKiller's follow-up post in that thread.
 

lpittman

Dabbler
Joined
May 2, 2013
Messages
35
How'd ya make out?
 

raidflex

Guru
Joined
Mar 14, 2012
Messages
531
I ended up cleaning the connectors inside my hot swap bay and since then I have no had an issue. Was able to successfully run a scrub and its been about a week now with any errors. I am still keeping an eye on it and I did purchase another SAS cable just in case it starts acting up. But it looks like it just may have been some dust. Possibly got in there during the replacement of the old HDD.
 

lpittman

Dabbler
Joined
May 2, 2013
Messages
35
Ah! Good to hear man. Glad your stable. Cheers.
 
Status
Not open for further replies.
Top