oversteer80
Cadet
- Joined
- Dec 22, 2013
- Messages
- 2
I am running FreeNAS 9.2.1.5 on a HP Microserver N54L, with 4x WD Green 3TB drives running in RAID-Z2.
Recently I have been alerted that the zpool is degraded as a drive has been removed:
Checking the logs I see this:
At this point I cannot even run smartctl as the drive has disappeared:
After a reboot the zpool operates normally, and all is well, for a period of time (hours, days..)
BUT the smartctl output makes me think something is wrong with the drive - note the result of the long test output which failed.
After a period of time (could be hours, or days) the drive drops out again, always the same drive.
I think with the smart information I will RMA the drive, but I just wanted to check that these ATA errors weren't symptoms of some other issue, motherboard problems etc?
Recently I have been alerted that the zpool is degraded as a drive has been removed:
Code:
[root@freenas] ~# zpool status pool: files state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: resilvered 227M in 0h3m with 0 errors on Fri May 2 07:30:42 2014 config: NAME STATE READ WRITE CKSUM files DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/add09ba6-65de-11e3-8169-38eaa7a92520 ONLINE 0 0 0 gptid/aed59706-65de-11e3-8169-38eaa7a92520 ONLINE 0 0 0 3495928376370431422 REMOVED 0 0 0 was /dev/gptid/afdde02c-65de-11e3-8169-38eaa7a92520 gptid/b0e42727-65de-11e3-8169-38eaa7a92520 ONLINE 0 0 0
Checking the logs I see this:
Code:
May 4 09:23:56 freenas kernel: ahcich2: Timeout on slot 17 port 0 May 4 09:23:56 freenas kernel: ahcich2: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd c0 serr 00000000 cmd 0000f117 May 4 09:23:56 freenas kernel: (ada2:ahcich2:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 May 4 09:23:56 freenas kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout May 4 09:23:56 freenas kernel: (ada2:ahcich2:0:0:0): Retrying command May 4 09:24:52 freenas kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) May 4 09:24:52 freenas kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 May 4 09:24:52 freenas kernel: (aprobe0:ahcich2:0:0:0): CAM status: Unconditionally Re-queue Request May 4 09:24:52 freenas kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked May 4 09:24:52 freenas kernel: ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 May 4 09:24:52 freenas kernel: ada2: <WDC WD30EZRX-00DC0B0 80.00A80> s/n WD-WCC1T1525005 detached May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT ) May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted May 4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 May 4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): CAM status: ATA Status Error May 4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT ) May 4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): RES: d1 04 ff ff ff ff ff ff ff ff ff May 4 09:24:56 freenas kernel: (aprobe1:ahcich2:0:15:0): Error 5, Retries exhausted May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT ) May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff May 4 09:24:56 freenas kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted
At this point I cannot even run smartctl as the drive has disappeared:
Code:
[root@freenas] ~# smartctl -i /dev/ada2 smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org /dev/ada2: No such file or directory
After a reboot the zpool operates normally, and all is well, for a period of time (hours, days..)
BUT the smartctl output makes me think something is wrong with the drive - note the result of the long test output which failed.
Code:
[root@freenas] ~# smartctl -a /dev/ada2 ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 7 3 Spin_Up_Time 0x0027 253 177 021 Pre-fail Always - 1033 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3290 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 5 193 Load_Cycle_Count 0x0032 153 153 000 Old_age Always - 141874 194 Temperature_Celsius 0x0022 121 116 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 72 ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 60% 3240 1708496136 # 2 Short offline Completed without error 00% 3173 - # 3 Extended offline Aborted by host 90% 3156 - # 4 Short offline Completed without error 00% 3102 - # 5 Short offline Completed without error 00% 3087 - # 6 Short offline Completed without error 00% 3086 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
After a period of time (could be hours, or days) the drive drops out again, always the same drive.
I think with the smart information I will RMA the drive, but I just wanted to check that these ATA errors weren't symptoms of some other issue, motherboard problems etc?