My 2-drive backup mirror zvol dropped out

Status
Not open for further replies.

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
I awoke this morning to a few SMART emails and an email saying my backup volume was degraded. I checked the logs and I am seeing a lot of this:
Code:
Jun  1 05:57:43 thumper kernel: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
Jun  1 05:57:43 thumper kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun  1 05:57:43 thumper kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 50 ec b4 40 64 00 00 01 00 00
Jun  1 05:57:43 thumper kernel: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error

These messages repeated for at least 6 minutes. Then...
Code:
Jun  1 06:04:23 thumper kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jun  1 06:04:53 thumper kernel: ahcich0: Timeout on slot 18 port 0
Jun  1 06:04:53 thumper kernel: ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00080000 cmd 0004d217
Jun  1 06:04:53 thumper kernel: (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jun  1 06:04:53 thumper kernel: (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Jun  1 06:04:53 thumper kernel: (aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
Jun  1 06:05:24 thumper kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jun  1 06:05:54 thumper kernel: ahcich0: Timeout on slot 18 port 0
Jun  1 06:05:54 thumper kernel: ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00080000 cmd 0004d217
Jun  1 06:05:54 thumper kernel: (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jun  1 06:05:54 thumper kernel: (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Jun  1 06:05:54 thumper kernel: (aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
Jun  1 06:05:54 thumper kernel: ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
Jun  1 06:05:54 thumper kernel: ada0: <ST4000VN000-1H4168 SC43> s/n Z3015ENK detached
Jun  1 06:06:26 thumper kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jun  1 06:06:45 thumper kernel: ahcich0: Timeout on slot 18 port 0
Jun  1 06:06:45 thumper kernel: ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00080000 cmd 0004d217
Jun  1 06:06:45 thumper kernel: (ada0:ahcich0:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
Jun  1 06:06:45 thumper kernel: (ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-queue Request
Jun  1 06:06:45 thumper kernel: (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
Jun  1 06:06:47 thumper smartd[5540]: Device: /dev/ada0, failed to read SMART Attribute Data
Jun  1 06:07:26 thumper kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jun  1 06:07:38 thumper kernel: ahcich0: Poll timeout on slot 18 port 0
Jun  1 06:07:38 thumper kernel: ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00080000 cmd 0004d217
Jun  1 06:07:38 thumper kernel: (aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jun  1 06:07:38 thumper kernel: (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Jun  1 06:07:38 thumper kernel: (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
Jun  1 06:08:08 thumper kernel: ahcich0: Timeout on slot 18 port 0
Jun  1 06:08:08 thumper kernel: ahcich0: is 00000000 cs 000c0000 ss 000c0000 rs 000c0000 tfd 80 serr 00080000 cmd 0004d217
Jun  1 06:08:08 thumper kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 08 8b 55 40 69 00 00 01 00 00
Jun  1 06:08:08 thumper kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jun  1 06:08:08 thumper kernel: (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
Jun  1 06:08:08 thumper kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 08 8c 55 40 69 00 00 01 00 00
Jun  1 06:08:08 thumper kernel: (ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-queue Request
Jun  1 06:08:08 thumper kernel: (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
Jun  1 06:09:00 thumper kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jun  1 06:09:00 thumper kernel: ahcich0: Poll timeout on slot 19 port 0
Jun  1 06:09:00 thumper kernel: ahcich0: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00080000 cmd 0004d317
Jun  1 06:09:00 thumper kernel: (aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jun  1 06:09:00 thumper kernel: (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Jun  1 06:09:00 thumper kernel: (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted


So it looks like something was wrong with ada0. But both ada0 and ada1 (the other drive in the mirror) are missing from camcontrol devlist:
Code:
[root@thumper] ~# camcontrol devlist
<ATA ST4000VN000-1H41 SC43>        at scbus0 target 0 lun 0 (da0,pass0)
<ATA ST4000VN000-1H41 SC43>        at scbus0 target 1 lun 0 (da1,pass1)
<ATA ST4000VN000-1H41 SC43>        at scbus0 target 2 lun 0 (da2,pass2)
<ATA ST4000VN000-1H41 SC43>        at scbus0 target 3 lun 0 (da3,pass3)
<ATA WDC WD2500AAJS-7 1B02>        at scbus0 target 4 lun 0 (da4,pass4)
<ATA Maxtor 6V300F0 1900>          at scbus0 target 5 lun 0 (da5,pass5)
<ATA ST2000DL003-9VT1 CC32>        at scbus0 target 6 lun 0 (da6,pass6)
<ATA ST4000VN000-1H41 SC43>        at scbus0 target 7 lun 0 (da7,pass7)
<ST31500541AS CC35>                at scbus3 target 0 lun 0 (ada2,pass10)
<ST31500541AS CC35>                at scbus4 target 0 lun 0 (ada3,pass11)
<ST31500541AS CC35>                at scbus5 target 0 lun 0 (ada4,pass12)
<ST31500541AS CC35>                at scbus6 target 0 lun 0 (ada5,pass13)
<SanDisk Cruzer 1.27>              at scbus8 target 0 lun 0 (pass14,da8)


I am suspecting a bad drive, but I've had issues that might also indicate cables. What is safe to do? Should I reboot? I've already backed up some critical data from my array that was previously backing up to this array.

Thanks.
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Here's the emails that I got, in reverse chronological order. First, ada1 reports SMART issues. Then the volume degrades. Then ada0 reports issues. I couldn't get the volume degrade email to show in a view that also showed the ada1 and ada0 snippets from the email, but it happened 3 minutes after the SMART ada1 emails stopped. Then, 11 minutes later, the ada0 emails came.
2014-06-02 23_55_12-Search results - shaun.currier@gmail.com - Gmail - Internet Explorer.png


Changing gears... if I do zpool status on my zvol that dropped out, I get this:
Code:
[root@thumper] /# zpool status backupone
  pool: backupone
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
  see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Sun Jun  1 03:32:01 2014
        815G scanned out of 1.49T at 5.20M/s, 38h53m to go
        0 repaired, 53.41% done
config:
 
        NAME                      STATE    READ WRITE CKSUM
        backupone                UNAVAIL    43    23    0
          mirror-0                UNAVAIL    91    20    21
            16577047649382721991  REMOVED      0    0    0  was /dev/gptid/c5f97ea2-c650-11e3-a8b9-002590f06808
            4011128556085398247  REMOVED      0    0    0  was /dev/gptid/c64e0866-c650-11e3-a8b9-002590f06808
 
errors: 16 data errors, use '-v' for a list


My other zvol is still running just fine. That zvol is on another controller, but other drives attached to the controller hosting the degraded vol appear to be ok.

My best guess here is that the something at a lower level than zfs dropped the drives and that's why zfs says they are removed. Based on the message log, it looks like they had some kind of timeout or something.

Can I reboot and see if I can get some SMART data off of them?
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Here's the output of zpool status -v:
Code:
[root@thumper] /# zpool status -v backupone
  pool: backupone
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
  see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Sun Jun  1 03:32:01 2014
        815G scanned out of 1.49T at 5.18M/s, 39h2m to go
        0 repaired, 53.41% done
config:
 
        NAME                      STATE    READ WRITE CKSUM
        backupone                UNAVAIL    43    23    0
          mirror-0                UNAVAIL    91    20    21
            16577047649382721991  REMOVED      0    0    0  was /dev/gptid/c5f97ea2-c650-11e3-a8b9-002590f06808
            4011128556085398247  REMOVED      0    0    0  was /dev/gptid/c64e0866-c650-11e3-a8b9-002590f06808
 
errors: Permanent errors have been detected in the following files:
 
        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x1b>
        <metadata>:<0x1d>
        <metadata>:<0x73>
        <metadata>:<0x74>
        <metadata>:<0x78>
        <metadata>:<0xb4>
        <metadata>:<0xb5>
        <metadata>:<0xb6>
        backupone:<0x4>
        backupone/hookupbackup:<0x0>
        backupone/hookupbackup:<0x483a5>
        backupone/hookupbackup:<0x483a6>
        backupone/hookupbackup:<0xfffffffffffffffe>
        backupone/hookupbackup:<0xffffffffffffffff>


One thing I forgot to mention is that it was scrubbing when this happened. Nice.

I think I need to run "zpool clear" after making sure that the drives are connected. But does "connected" mean that the system sees them in camcontrol devlist? Or just physically attached? I don't want to do something irreversible here and I want to learn.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, you appear to have some kind of catastrophic failure. Not sure what exactly went on though. Can you post the SMART raw data on your disks?(smartctl -a /dev/daX)
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Can't access smart on the disks. It's as if they are gone. I would probably need to reboot to see them again.

-------------
Welcome to FreeNAS
[root@thumper] ~# smartctl -a /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/ada0: No such file or directory
Please specify device type with the -d option.

Use smartctl -h to get a usage summary
-------------------
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You will need to reboot. In fact, you will probably need to cold boot your box...
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
OK. I will try that tomorrow since I need to hit the sack. Thanks for your help!
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Here's "disk 0":
Code:
[root@thumper] ~# smartctl -a -q noserial /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Device Model:    ST4000VN000-1H4168
Firmware Version: SC43
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun  7 15:08:56 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  97) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 523) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  117  100  006    Pre-fail  Always      -      119437832
  3 Spin_Up_Time            0x0003  091  091  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      8
  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  073  060  030    Pre-fail  Always      -      24003694
  9 Power_On_Hours          0x0032  098  098  000    Old_age  Always      -      1917
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      8
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  071  000    Old_age  Always      -      1127
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      382
190 Airflow_Temperature_Cel 0x0022  080  074  045    Old_age  Always      -      20 (Min/Max 18/21)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      4
193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      8
194 Temperature_Celsius    0x0022  020  040  000    Old_age  Always      -      20 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      2296
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%      1836        -
# 2  Short offline      Completed without error      00%      1677        -
# 3  Extended offline    Completed without error      00%      1517        -
# 4  Short offline      Completed without error      00%      1341        -
# 5  Short offline      Completed without error      00%      1173        -
# 6  Extended offline    Completed without error      00%        83        -
# 7  Short offline      Completed without error      00%        1        -
# 8  Conveyance offline  Completed without error      00%        1        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




Here's "disk 1":
Code:
[root@thumper] ~# smartctl -a -q noserial /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Device Model:    ST4000VN000-1H4168
Firmware Version: SC43
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun  7 15:10:50 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  97) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 523) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  117  100  006    Pre-fail  Always      -      119437832
  3 Spin_Up_Time            0x0003  091  091  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      8
  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  073  060  030    Pre-fail  Always      -      24003694
  9 Power_On_Hours          0x0032  098  098  000    Old_age  Always      -      1917
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      8
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  071  000    Old_age  Always      -      1127
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      382
190 Airflow_Temperature_Cel 0x0022  080  074  045    Old_age  Always      -      20 (Min/Max 18/21)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      4
193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      8
194 Temperature_Celsius    0x0022  020  040  000    Old_age  Always      -      20 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      2296
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%      1836        -
# 2  Short offline      Completed without error      00%      1677        -
# 3  Extended offline    Completed without error      00%      1517        -
# 4  Short offline      Completed without error      00%      1341        -
# 5  Short offline      Completed without error      00%      1173        -
# 6  Extended offline    Completed without error      00%        83        -
# 7  Short offline      Completed without error      00%        1        -
# 8  Conveyance offline  Completed without error      00%        1        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I'm noting very high UDMA_CRC_Error_Count and high Command_Timeout. I think I want to replace all my sata cables.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm noting very high UDMA_CRC_Error_Count and high Command_Timeout. I think I want to replace all my sata cables.

I agree! Disks look healthy though...
 
Status
Not open for further replies.
Top