SOLVED Help with badblocks on one of my drives

Status
Not open for further replies.

victorbrca

Dabbler
Joined
Oct 1, 2013
Messages
19
Hey everyone, I'm running FreeNAS 9.1.1 for quite a few years, and one of my drives finally failed. I think it's toast, but I wanted to see if there's a way to attempt to fix it so I can recover some files.

Some background on my setup. Because of budget at the time of the build, I spent most of my money on the PC that I run my FreeNAS, and not on the drives. I have 2x pools; one has 5 drives with different sizes and no parity, which is where I save my files; and another pool (called Backup) with 2x WD greens that are mirrored. I have rsync jobs that copy my files to this Backup pool.

All my really important files were backed up, but I found out that some of the jobs for the not so important files weren't running for quite a while, and I would like to attempt to get these files back.

After resolving this issue my plan is to install FreeNAS 11, import the backup pool, and then setup a new RAIDZ1 with 3 new WD reds that I just got.

Getting to the point... here's the error message I started to see:

Code:
Jun 13 14:10:33 FreeNAS kernel: (ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jun 13 14:10:33 FreeNAS kernel: (ada2:ahcich2:0:0:0): RES: 41 40 ac 00 40 40 00 00 00 00 00
Jun 13 14:10:33 FreeNAS kernel: (ada2:ahcich2:0:0:0): Retrying command
Jun 13 14:10:33 FreeNAS kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 80 00 40 40 00 00 00 01 00 00
Jun 13 14:10:33 FreeNAS kernel: (ada2:ahcich2:0:0:0): CAM status: ATA Status Error



Output from `smartctl` with read failure

Code:
# smartctl -a /dev/ada2
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Caviar Blue Serial ATA
Device Model:	 WDC WD5000AAKS-22TMA0
Serial Number:	WD-WCAPW4102538
LU WWN Device Id: 5 0014ee 255bcb94d
Firmware Version: 12.01C01
User Capacity:	500,107,862,016 bytes [500 GB]
Sector Size:	  512 bytes logical/physical
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:	Wed Jun 13 14:30:55 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)   Offline data collection activity
				   was completed without error.
				   Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)   The previous self-test routine completed
				   without error or no self-test has ever
				   been run.
Total time to complete Offline
data collection:		(12000) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 150) minutes.
Conveyance self-test routine
recommended polling time:	 (   6) minutes.
SCT capabilities:		   (0x303f)   SCT Status supported.
				   SCT Error Recovery Control supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   200   200   051	Pre-fail  Always	   -	   284
  3 Spin_Up_Time			0x0003   177   168   021	Pre-fail  Always	   -	   6116
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   520
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000e   200   200   051	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   019   019   000	Old_age   Always	   -	   59419
 10 Spin_Retry_Count		0x0012   100   100   051	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0012   100   100   051	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   466
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   380
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   520
194 Temperature_Celsius	 0x0022   111   090   000	Old_age   Always	   -	   39
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0012   194   194   000	Old_age   Always	   -	   489
198 Offline_Uncorrectable   0x0010   195   195   000	Old_age   Offline	  -	   455
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   197   197   051	Old_age   Offline	  -	   248

SMART Error Log Version: 1
ATA Error Count: 3473 (device log contains only the most recent five errors)
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]
   CH = Cylinder High Register [HEX]
   DH = Device/Head Register [HEX]
   DC = Device Command Register [HEX]
   ER = Error register [HEX]
   ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3473 occurred at disk power-on lifetime: 59418 hours (2475 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 5c 38 40  Error: UNC at LBA = 0x00385c80 = 3693696

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 80 5c 38 3a 08	  00:04:12.832  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:10.601  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:08.373  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:06.433  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:04.485  READ DMA EXT

Error 3472 occurred at disk power-on lifetime: 59418 hours (2475 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 5c 38 40  Error: UNC at LBA = 0x00385c80 = 3693696

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 80 5c 38 3a 08	  00:04:10.601  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:08.373  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:06.433  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:04.485  READ DMA EXT
  2f 00 01 10 00 00 00 08	  00:04:04.482  READ LOG EXT

Error 3471 occurred at disk power-on lifetime: 59418 hours (2475 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 5c 38 40  Error: UNC at LBA = 0x00385c80 = 3693696

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 80 5c 38 3a 08	  00:04:08.373  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:06.433  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:04.485  READ DMA EXT
  2f 00 01 10 00 00 00 08	  00:04:04.482  READ LOG EXT
  60 00 00 80 5a 38 3a 08	  00:04:02.387  READ FPDMA QUEUED

Error 3470 occurred at disk power-on lifetime: 59418 hours (2475 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 5c 38 40  Error: UNC at LBA = 0x00385c80 = 3693696

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 80 5c 38 3a 08	  00:04:06.433  READ DMA EXT
  25 00 00 80 5c 38 3a 08	  00:04:04.485  READ DMA EXT
  2f 00 01 10 00 00 00 08	  00:04:04.482  READ LOG EXT
  60 00 00 80 5a 38 3a 08	  00:04:02.387  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 08	  00:04:02.385  READ LOG EXT

Error 3469 occurred at disk power-on lifetime: 59418 hours (2475 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 5c 38 40  Error: UNC at LBA = 0x00385c80 = 3693696

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 80 5c 38 3a 08	  00:04:04.485  READ DMA EXT
  2f 00 01 10 00 00 00 08	  00:04:04.482  READ LOG EXT
  60 00 00 80 5a 38 3a 08	  00:04:02.387  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 08	  00:04:02.385  READ LOG EXT
  60 00 00 80 5a 38 3a 08	  00:03:59.898  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 59311		 62207

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


And I also ran 'badblocks' non-destructive, and it did find a high number of badblocks. I can't find where I saved the output, so I'm running again and will post here.

Additional system info:

IXB7oDy.png


Code:
# camcontrol devlist
<WDC WD30EZRX-00D8PB0 80.00A80>	at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD30EZRX-00DC0B0 80.00A80>	at scbus1 target 0 lun 0 (ada1,pass1)
<WDC WD5000AAKS-22TMA0 12.01C01>   at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD30EZRX-00DC0B0 80.00A80>	at scbus3 target 0 lun 0 (ada3,pass3)
<WDC WD10EADS-00L5B1 01.01A01>	 at scbus4 target 0 lun 0 (ada4,pass4)
<WDC WD20EARS-00MVWB0 50.0AB50>	at scbus5 target 0 lun 0 (ada5,pass5)
<Kingmax USB2.0 FlashDisk 0.00>	at scbus6 target 0 lun 0 (pass6,da0)


Code:
# zpool status -v
  pool: Backup
 state: ONLINE
  scan: scrub repaired 0 in 2h8m with 0 errors on Sun Jun 10 02:08:43 2018
config:

   NAME										  STATE	 READ WRITE CKSUM
   Backup										ONLINE	   0	 0	 0
	 gptid/2bcb9274-74ba-11e3-a683-94de806dfb29  ONLINE	   0	 0	 0
	 gptid/2cc595fd-74ba-11e3-a683-94de806dfb29  ONLINE	   0	 0	 0

errors: No known data errors


Code:
# /sbin/ifconfig | grep media
   media: Ethernet autoselect (1000baseT <full-duplex>)


Thanks for any help.
 
Last edited:
Status
Not open for further replies.
Top