Scrub shows '(repairing)', but no errors

Aaron C. de Bruyn · May 15, 2014

I'm curious why ZFS is repairing two drives without showing any read, write, or checksum errors.

The drives are definitely having problems--dmesg shows a bunch of ATA errors.

-A

Code:

  pool: tank
state: ONLINE
  scan: scrub in progress since Thu May 15 09:00:00 2014
        229G scanned out of 1.22T at 43.2M/s, 6h45m to go
        3.73M repaired, 18.24% done
config:
 
    NAME                                                STATE    READ WRITE CKSUM
    tank                                                ONLINE      0    0    0
      raidz1-0                                          ONLINE      0    0    0
        gptid/c6825711-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0  (repairing)
        gptid/c6cdf6d0-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
        gptid/c71821cf-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
      raidz1-1                                          ONLINE      0    0    0
        gptid/c7a7294e-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
        gptid/c820a68a-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0  (repairing)
        gptid/c8ae5b9b-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
    logs
      gptid/276ccffe-b913-11e3-81a7-002590dc4ed0.eli    ONLINE      0    0    0
    spares
      gptid/5b96f81c-b913-11e3-81a7-002590dc4ed0.eli    AVAIL 
 
errors: No known data errors

cyberjock · May 15, 2014

What's your hardware? Are you using some SATA/SAS/RAID controller that isn't recommended?

Aaron C. de Bruyn · May 15, 2014

It's a supermicro 2U chassis with eight drives plugged into the motherboard SATA ports.

I'm seeing messages like this in dmesg:

Code:

(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 f8 30 52 83 40 3c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 e8 52 83 40 3c 00 00 00 00
(ada0:ahcich0:0:0:0): Retrying command

We have ordered two replacement drives.

I guess I just don't know when those counters are incremented.

When a totally different NAS started throwing errors today, the checksum column started counting up when the drive said '(repairing)':

Code:

(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 f8 30 52 83 40 3c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 e8 52 83 40 3c 00 00 00 00
(ada0:ahcich0:0:0:0): Retrying command

One NAS (four drives) has been in production for over a year without issues, and the other NAS with 8 drives has been in production for about a month.

Code:

(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 f8 30 52 83 40 3c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 e8 52 83 40 3c 00 00 00 00
(ada0:ahcich0:0:0:0): Retrying command

We have approximately 20 boxes total in production, and they are all identical hardware (either a 1U 4-bay SuperMicro chassis or a 2U 8-bay SuperMicro chassis). All of them have 32 GB RAM.

To me it appears we have a few failing drives--just curious why those counters aren't incrementing.

Aaron C. de Bruyn · May 15, 2014

Oops--one of those was a bad copy/paste.

One of the drives in our 2U 8-bay system is reporting this:

Code:

(da0:isci0:0:0:0): READ(10). CDB: 28 00 3c cb 9c 88 00 01 00 00 
(da0:isci0:0:0:0): CAM status: SCSI Status Error
(da0:isci0:0:0:0): SCSI status: Check Condition
(da0:isci0:0:0:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da0:isci0:0:0:0): Info: 0x3ccb9d47
(da0:isci0:0:0:0): Retrying command (per sense data)
(da0:isci0:0:0:0): READ(10). CDB: 28 00 3c 95 62 40 00 00 08 00 
(da0:isci0:0:0:0): CAM status: CCB request terminated by the host
(da0:isci0:0:0:0): Retrying command

So bad drive, bad cable, or bad port, right?

But why wouldn't 'zpool status' show read/write/checksum errors?

Is it just that the drive keeps retrying and eventually reads/writes the data? So ZFS doesn't see any problem yet?

-A

cyberjock · May 15, 2014

I don't know... telling me the brand of your motherboard and chassis is pretty useless to me for diagnostic purposes. The fact that you have a problem with da0 and ada0 tells me you are on at least 2 different controllers, so saying "motherboard ports" isn't useful because I've got evidence there are at *least* 2 controllers involved in this.

My guess is the MEDIUM ERRORs are from your disks, and the STATUS ERRORs could be disks, cables, or controller issues. Since I don't know your hardware the MEDIUM ERRORs could be causing problems that are spilling over to STATUS ERRORs. I don't know. I need to know more about your hardware before I can give good advice and how to best identify(or rule out) the problems.

Aaron C. de Bruyn · May 15, 2014

I agree it's a hardware issue. Possibly a bad drive or cables.

I don't need help troubleshooting the hardware, I'm simply curious why 'zpool status' says it's repairing a drive, but it doesn't list read/write/checksum errors. If there are zero errors, what is there to repair? I'm fairly certain that has nothing to do with the hardware. ;)

The 2U servers we use are: http://www.supermicro.com/products/system/2U/5027/SYS-5027R-WRF.cfm
The 1U servers we use are: http://www.supermicro.com/products/system/1U/5018/SYS-5018D-MTLN4F.cfm

We don't turn on the Intel RAID in the BIOS.

-A

cyberjock · May 15, 2014

Well, those both look like Intel controllers. So they *should* work fine. You may just have one failing disk and one with a cable problem. You should check the connectiosn tight. I'd check out the SMART data for your disks (smartctl -a /dev/XXX) and see what they all say. If you don't know how to interpret the info post them all in the forum and I'll let you know what they all say.

Aaron C. de Bruyn · May 15, 2014

Oops--yeah, I should have posted the output of smartctl earlier.

The system that shows a bunch of checksum errors:

Code:

  pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 1.79M in 3h48m with 0 errors on Thu May 15 12:47:56 2014
config:
 
    NAME                                                STATE    READ WRITE CKSUM
    tank                                                ONLINE      0    0    0
      raidz1-0                                          ONLINE      0    0    0
        gptid/77e6fbf4-d19b-11e3-ad30-002590d4774b.eli  ONLINE      0    0  378
        gptid/76a74ea1-1197-11e3-8855-002590d4774b.eli  ONLINE      0    0    0
        gptid/b53f0f11-1108-11e3-9e52-002590d4774b.eli  ONLINE      0    0    0
        gptid/b5884c45-1108-11e3-9e52-002590d4774b.eli  ONLINE      0    0    0
 
errors: No known data errors
uswbjofnas01#

...doesn't show anything wrong when looking at the SMART status:

Code:

=== START OF INFORMATION SECTION ===
Model Family:    Western Digital RE4
Device Model:    WDC WD1003FBYX-01Y7B1
Serial Number:   
LU WWN Device Id: 5 0014ee 25e753a91
Firmware Version: 01.01V02
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu May 15 14:44:24 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (16860) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 166) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x303f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  197  174  051    Pre-fail  Always      -      1972
  3 Spin_Up_Time            0x0027  100  253  021    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      1
  5 Reallocated_Sector_Ct  0x0033  153  153  140    Pre-fail  Always      -      963
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      332
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      1
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      0
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      0
194 Temperature_Celsius    0x0022  106  105  000    Old_age  Always      -      41
196 Reallocated_Event_Count 0x0032  154  154  000    Old_age  Always      -      46
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      29
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
uswbjofnas01#

The 8-bay NAS has two drives failing:

Code:

  pool: tank
state: ONLINE
  scan: scrub in progress since Thu May 15 09:00:00 2014
        1.03T scanned out of 1.22T at 51.9M/s, 1h6m to go
        10.4M repaired, 83.89% done
config:
 
    NAME                                                STATE    READ WRITE CKSUM
    tank                                                ONLINE      0    0    0
      raidz1-0                                          ONLINE      0    0    0
        gptid/c6825711-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0  (repairing)
        gptid/c6cdf6d0-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
        gptid/c71821cf-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
      raidz1-1                                          ONLINE      0    0    0
        gptid/c7a7294e-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
        gptid/c820a68a-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0  (repairing)
        gptid/c8ae5b9b-b912-11e3-81a7-002590dc4ed0.eli  ONLINE      0    0    0
    logs
      gptid/276ccffe-b913-11e3-81a7-002590dc4ed0.eli    ONLINE      0    0    0
    spares
      gptid/5b96f81c-b913-11e3-81a7-002590dc4ed0.eli    AVAIL
 
errors: No known data errors
ushqzofnas01#

....only shows one drive reporting error in dmesg and SMART appears good:

Code:

ushqzofnas01# smartctl --all /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital RE4
Device Model:    WDC WD1003FBYX-01Y7B1
Serial Number:    
LU WWN Device Id: 5 0014ee 2b3afbd2e
Firmware Version: 01.01V02
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu May 15 14:47:10 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (16800) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 165) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x303f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      1559
  3 Spin_Up_Time            0x0027  100  253  021    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      3
  5 Reallocated_Sector_Ct  0x0033  195  195  140    Pre-fail  Always      -      114
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  099  099  000    Old_age  Always      -      1039
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      3
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      1
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      1
194 Temperature_Celsius    0x0022  106  104  000    Old_age  Always      -      41
196 Reallocated_Event_Count 0x0032  196  196  000    Old_age  Always      -      4
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      68
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
ushqzofnas01#

Meh. Regardless we have replacement drives and will be swapping out the bad ones over the course of the next 48 hours.

-A

cyberjock · May 15, 2014

Well, the top drive has non-zero for parameters 1, 5, 196 and 197. The temp is 41, which is above the maximum 40C. That drive is definitely bad though.

The bottom drive has non-zero for parameters 1, 5, 196 and 197. The drive is also above the maximum 40C. That drive should be replaced too.

Additionally you've never done any SMART tests, so you are a bad boy. Spank yourself! < ---------- really bad not to do this!

I also know you haven't setup any SMART monitoring. If you have, you'd have gotten so many emails about failing disks you'd be asking how to turn them off. <------ really really bad to not do this!

Both of those drives are relatively new and might be under warranty.

Aaron C. de Bruyn · May 15, 2014

The RE4 specs say 5 to 55 C is acceptable.

http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-701338.pdf

But yeah, they are in closets and the customer doesn't want to pay for AC.

I noticed the tests hadn't been run. For some reason I had it in my head that FreeNAS automatically ran tests every now and then. I probably confused it with the monthly scrub. I'll get on that. (Better check out the API so I can easily deploy a job to a bunch of boxes and update Icinga to catch SMART results that say 'passed' but obviously have lots of reallocations.)

I totally scanned down the 'When Failed' column and didn't see any failures and called it good. I wonder why smartctl isn't showing them as failed. They are obviously over the threshold. My linux boxen usually show me a failure status.

Does ZFS pay attention to the SMART values on drives and/or handle things like reallocation? I was under the impression after reading the docs that it lets the drive controller handle that stuff and it only pays attention to the data integrity (regardless of the controller relocating it).

Thanks for the kick in the pants.

-A

cyberjock · May 15, 2014

Yeah.. and if you read the Google white paper on hard drives there's a definite correlation between temperature and drive lifespan.

The "passed" is somewhat a misnomer. The "passed" is nothing more than a function on whether your SMART tests have failed or not. If you don't do any then the drive can't fail the SMART test. If it hasn't failed a SMART test it'll always show passed. Vicious little circle eh?

The threshold is also a reverse value. It slowly lowers to the threshold. When your value < threshold then you get a failed warning. See all the ones that are zero? Guess what.. those will *never* trigger a failure since they can't be < 0. So you have to be able to read and interpret this stuff for yourself and not trust the manufacturer. They're interested in selling disks that you won't RMA. So not surprisingly they don't use most of the parameters anymore because it saves them money. ;)

SirMaster · May 22, 2014

I saw this issue on my box too.

For me it seems to have been a faulty SATA cable.

The SATA cable was causing SMART errors on my disks that showed themselves as (attribute 197) Current_Pending_Sector.

I replaced the disk in the same slot, and sure enough saw the same behavior on the new disk. I then replaced the SATA cable, and the errors stopped.

So you can't always trust SMART attributes. My disk got 90/100 Report_Uncorrect because of a bad SATA cable, but the disk is still fine.

This is a ticket I had created.
https://github.com/zfsonlinux/zfs/issues/2246

Aaron C. de Bruyn · May 22, 2014

I thought the SMART tests were run directly on the drive controller.

SirMaster · May 22, 2014

Yes, SMART short, long and conveyance tests are run internally on the drive, but the SMART attributes listed above that section in smartctl are what log errors during any disk activity, including normal external disk usage.

Important Announcement for the TrueNAS Community.

Scrub shows '(repairing)', but no errors

Aaron C. de Bruyn

Dabbler

cyberjock

Inactive Account

Aaron C. de Bruyn

Dabbler

Aaron C. de Bruyn

Dabbler

cyberjock

Inactive Account

Aaron C. de Bruyn

Dabbler

cyberjock

Inactive Account

Aaron C. de Bruyn

Dabbler

cyberjock

Inactive Account

Aaron C. de Bruyn

Dabbler

cyberjock

Inactive Account

SirMaster

Patron

Aaron C. de Bruyn

Dabbler

SirMaster

Patron

Similar threads

Important Announcement for the TrueNAS Community.

Scrub shows '(repairing)', but no errors

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Patron

Dabbler

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scrub shows '(repairing)', but no errors"

Similar threads