Noticing drive errors and degraded zpool

Status
Not open for further replies.

oversteer80

Cadet
Joined
Dec 22, 2013
Messages
2
I am running FreeNAS 9.2.1.5 on a HP Microserver N54L, with 4x WD Green 3TB drives running in RAID-Z2.

Recently I have been alerted that the zpool is degraded as a drive has been removed:

Code:
[root@freenas] ~# zpool status
  pool: files
state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 227M in 0h3m with 0 errors on Fri May  2 07:30:42 2014
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        files                                          DEGRADED    0    0    0
          raidz2-0                                      DEGRADED    0    0    0
            gptid/add09ba6-65de-11e3-8169-38eaa7a92520  ONLINE      0    0    0
            gptid/aed59706-65de-11e3-8169-38eaa7a92520  ONLINE      0    0    0
            3495928376370431422                        REMOVED      0    0    0  was /dev/gptid/afdde02c-65de-11e3-8169-38eaa7a92520
            gptid/b0e42727-65de-11e3-8169-38eaa7a92520  ONLINE      0    0    0


Checking the logs I see this:
Code:
May  4 09:23:56 freenas kernel: ahcich2: Timeout on slot 17 port 0
May  4 09:23:56 freenas kernel: ahcich2: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd c0 serr 00000000 cmd 0000f117
May  4 09:23:56 freenas kernel: (ada2:ahcich2:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
May  4 09:23:56 freenas kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
May  4 09:23:56 freenas kernel: (ada2:ahcich2:0:0:0): Retrying command
May  4 09:24:52 freenas kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
May  4 09:24:52 freenas kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May  4 09:24:52 freenas kernel: (aprobe0:ahcich2:0:0:0): CAM status: Unconditionally Re-queue Request
May  4 09:24:52 freenas kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked
May  4 09:24:52 freenas kernel: ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
May  4 09:24:52 freenas kernel: ada2: <WDC WD30EZRX-00DC0B0 80.00A80> s/n WD-WCC1T1525005 detached
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted
May  4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
May  4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): CAM status: ATA Status Error
May  4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
May  4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
May  4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): Error 5, Retries exhausted
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
May  4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted



At this point I cannot even run smartctl as the drive has disappeared:
Code:
[root@freenas] ~# smartctl -i /dev/ada2
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/ada2: No such file or directory


After a reboot the zpool operates normally, and all is well, for a period of time (hours, days..)

BUT the smartctl output makes me think something is wrong with the drive - note the result of the long test output which failed.

Code:
[root@freenas] ~# smartctl -a /dev/ada2
...
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      7
  3 Spin_Up_Time            0x0027  253  177  021    Pre-fail  Always      -      1033
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      9
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  096  096  000    Old_age  Always      -      3290
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      9
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      5
193 Load_Cycle_Count        0x0032  153  153  000    Old_age  Always      -      141874
194 Temperature_Celsius    0x0022  121  116  000    Old_age  Always      -      29
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      2
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      72
...
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure      60%      3240        1708496136
# 2  Short offline      Completed without error      00%      3173        -
# 3  Extended offline    Aborted by host              90%      3156        -
# 4  Short offline      Completed without error      00%      3102        -
# 5  Short offline      Completed without error      00%      3087        -
# 6  Short offline      Completed without error      00%      3086        -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 


After a period of time (could be hours, or days) the drive drops out again, always the same drive.

I think with the smart information I will RMA the drive, but I just wanted to check that these ATA errors weren't symptoms of some other issue, motherboard problems etc?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
See that extended test with the read failure? That disk is dying. Replace it. ;)

There's lots of other evidence the disk is failing, but failure is failure. So just RMA or replace it using the FreeNAS manual. ;)
 

oversteer80

Cadet
Joined
Dec 22, 2013
Messages
2
Thanks .. I want to RMA it, but the WD site won't let me! Their RMA site seems to be broken. Annoying, I am in the UK so it should be possible. Will call them in the morning.
 
Status
Not open for further replies.
Top