Drive Failure? ATA Error Count - Advices need

SmallGuy · Jun 1, 2016

Hello Guys,

Set-Up:
ASRock E3C226D2I; OnBoard NIC INTEL I210AT ; Intel Core i3-4330 CPU @ 3.50GHz ; 2x Kingston KVR16E11/8 ; 6x WD RED WD20EFRX - ZFS RAIDZ 2 ; FreeNAS-9.10-STABLE

Just have trouble with ada1 of my RAIDZ2 6 disks pool.
This is the second time it happend in less than a month.
Let me explain the circonstances of the appaerance of the failure.

I have recently upgrade from 9.2.1.9 to 9.10 by clean installation:
-BIOS upgrade to latest version
-BMC upgrade to latest version

-Buy a USB Adaptor (1->2)
-Buy 2 San disk cruser 16gB
-insert all of those inside the box
-Clean Install following the manual
-The pool was imported automagically
-reload configuration file
-Done

What is intersting here is that I have open the case (and potentialy that the upgrade was smouthly from the software point of view)...

After the installation was completed, just after the post-installation auto-reboot, ada1 was detected as failing by FreeNAS (ATA Error Count), and the disk has been set offline automatically by the system.
So I think of a bad connection due to the fact I have put my BIG fingers inside the case for the USB devices installation, and probably touch the wiring:
-Shutdown the system, check connections and reboot... The disk is attached back to the pool automagicaly.
-Ran a long smart test succesfully: Errors recorded at the first failure are still there, but everything else looks good, and the pool status is reported as Healthy.

So I have decided to continue using the drive as is, and to keep an eye on it.
Tuesday morning, my schedulled scrub was launched, and I receved this wenesday morning a critical alert e-mail:

Code:

The volume Volume1 (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Device: /dev/ada1, ATA error count increased from 10 to 20

Thanks to the @Bidule0hm script, I got the following report:

Code:

########## SMART status report for ada1 drive (Western Digital Red: WD-WMC3012xxxxx) ##########
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE-p3 amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  174  172  021  Pre-fail  Always  -  4291
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  142
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  063  063  000  Old_age  Always  -  27188
10 Spin_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  130
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  86
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  55
194 Temperature_Celsius  0x0022  113  106  000  Old_age  Always  -  34
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0
ATA Error Count: 20 (device log contains only the most recent five errors)
  CR = Command Register [HEX]
  FR = Features Register [HEX]
  SC = Sector Count Register [HEX]
  SN = Sector Number Register [HEX]
  CL = Cylinder Low Register [HEX]
  CH = Cylinder High Register [HEX]
  DH = Device/Head Register [HEX]
  DC = Device Command Register [HEX]
  ER = Error register [HEX]
  ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 20 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 80 fc 3f 40  Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.312  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.308  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.304  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.300  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.295  READ DMA
Error 19 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 80 fc 3f 40  Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.308  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.304  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.300  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.295  READ DMA
Error 18 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 80 fc 3f 40  Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.304  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.300  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.295  READ DMA
Error 17 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 80 fc 3f 40  Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.300  READ DMA
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.295  READ DMA
Error 16 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 80 fc 3f 40  Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 fc 3f 40 08  24d+13:38:01.295  READ DMA
Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline  Aborted by host  90%  26604  -

Surprisingly, there are some errors I associate to communication, but no UDMA_CRC error.
No Raw_Read_Error_Rate, nothing regarding the traditional errors generally met.
I suspect that when I will reboot the system this evening, the disk will be import automaticaly as it happend the first time, and the result of the new long SMART test I will launch, will be PASSED.
I will post the full result of the long smart test ASAP (The last extended test has been stop by myself because I have erronously repeted the long test using the command history, and the script report only the last one...)
I have already order a new drive as the warranty period is finished, but want to have some advice/clarification.
Just to have a lightened way for troubleshooting, is somebody able to tell me with confidence what are "ABRT at LBA = 0x003ffc80 = 4193408" Error and "READ DMA" command and if this kind of error is generally related to the drive electronic, to the connectic (cable or bad connection), power supply or to the disk controler?

Bidule0hm · Jun 1, 2016

SmallGuy said:
and the script report only the last one...

You can change the line 92 with something like:

Code:

smartctl -l selftest /dev/da0 | grep "# 1 \|# 2 \|# 3 \|Num" | cut -c6-

for example, to print the last 3 tests results instead of just the last one ;)

Regarding your problem: tried another cable?

SmallGuy · Jun 2, 2016

Bidule0hm said:
You can change the line 92 with something like:

Code:
smartctl -l selftest /dev/da0 | grep "# 1 \|# 2 \|# 3 \|Num" | cut -c6-

for example, to print the last 3 tests results instead of just the last one ;)

Thanks for the trick.

The disk wont attach to the pool any more, and impossible to access the smart data.
This what I found in dmseg,

Code:

(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 50 d6 40 40 7e 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada1:ahcich1:0:0:0): RES: 41 40 50 d6 40 40 7e 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 10 d7 40 40 7e 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT )
(ada1:ahcich1:0:0:0): RES: 61 04 50 d6 40 40 7e 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 68 82 e1 40 e1 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT )
(ada1:ahcich1:0:0:0): RES: 61 04 50 d6 40 40 7e 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 68 82 e1 40 e1 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT )
(ada1:ahcich1:0:0:0): RES: 61 04 50 d6 40 40 7e 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 68 82 e1 40 e1 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT )
(ada1:ahcich1:0:0:0): RES: 61 04 50 d6 40 40 7e 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 68 82 e1 40 e1 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT )
(ada1:ahcich1:0:0:0): RES: 61 04 50 d6 40 40 7e 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command

I have received the new drive, replace the suspect crappy one with it. It has been reconized immediately as ada1 and I'm curently burn-in it .
Seems the old drive was responsible of the trouble and is dead for ever.
Will update the thread when burn-in and resilvering will be finnished.

Important Announcement for the TrueNAS Community.

Drive Failure? ATA Error Count - Advices need

SmallGuy

Guru

Bidule0hm

Server Electronics Sorcerer

SmallGuy

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Drive Failure? ATA Error Count - Advices need

SmallGuy

Guru

Bidule0hm

Server Electronics Sorcerer

SmallGuy

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drive Failure? ATA Error Count - Advices need"

Similar threads