Has my drive failed?

Jarrodspencerharper · Oct 16, 2016

I have a HP N40L , 8gb ram running freenas 8.3 with two 2tb drives running as a zfs mirror. I set this up about 5 years ago and have never had any issues. I use it for personal use at home to store photos, movies, music etc.

I noticed the other day i got an alert saying some corruption was detected so i ran a zpool status -v and i got the output in the attachment.

My question is has one of my drives failed? If so how would i know which one?

Thanks in advance.

Robert Trevellyan · Oct 16, 2016

Not enough information.

Start with this (you might have to translate some points for your obsolete version):
https://forums.freenas.org/index.ph...leshooting-guide-basic-common-failures.41026/

SweetAndLow · Oct 16, 2016

Yep you have a problem and have lost some data already. What is the output of zpool status -v and what is the smart data output?

Sent from my Nexus 5X using Tapatalk

rs225 · Oct 16, 2016

I would check cables first. I've never seen such a strange combination of error counts. Your best hope is that they are phantoms.

Stux · Oct 16, 2016

Run the -v and see what the damage is.

You have backups?

Jarrodspencerharper · Oct 17, 2016

I had attached the output of the zpool status -v in my original post. I am attaching the smart data output to this post. Would be greatly appreciated if anyone can help me understand what the smart data output is saying.

Drive 1

Code:

[root@freenas] ~# smartctl -a /dev/ada0
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model:	 ST2000DM001-1CH164
Serial Number:	S1E1BXA1
LU WWN Device Id: 5 000c50 05c35cdee
Firmware Version: CC26
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:	Mon Oct 17 02:55:29 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (  575) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 227) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:			(0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   094   089   006	Pre-fail  Always	   -	   115286552
  3 Spin_Up_Time			0x0003   097   096   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   116
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   248
  7 Seek_Error_Rate		 0x000f   076   060   030	Pre-fail  Always	   -	   47625093
  9 Power_On_Hours		  0x0032   078   078   000	Old_age   Always	   -	   20134
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   116
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   001   001   000	Old_age   Always	   -	   1335
188 Command_Timeout		 0x0032   096   096   000	Old_age   Always	   -	   12
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   075   060   045	Old_age   Always	   -	   25 (Min/Max 18/25)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   53
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   205
194 Temperature_Celsius	 0x0022   025   040   000	Old_age   Always	   -	   25 (0 14 0 0 0)
197 Current_Pending_Sector  0x0012   093   093   000	Old_age   Always	   -	   1168
198 Offline_Uncorrectable   0x0010   093   093   000	Old_age   Offline	  -	   1168
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   159278862192296
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   8811850508
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   102406162799

SMART Error Log Version: 1
ATA Error Count: 2518 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2518 occurred at disk power-on lifetime: 20134 hours (838 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00	  00:12:15.582  READ DMA EXT
  b0 d5 01 00 4f c2 40 00	  00:12:15.548  SMART READ LOG
  25 00 00 ff ff ff 4f 00	  00:12:12.712  READ DMA EXT
  b0 da 00 00 4f c2 40 00	  00:12:12.539  SMART RETURN STATUS
  25 00 00 ff ff ff 4f 00	  00:12:09.695  READ DMA EXT

Error 2517 occurred at disk power-on lifetime: 20134 hours (838 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00	  00:12:15.582  READ DMA EXT
  b0 d5 01 00 4f c2 40 00	  00:12:15.548  SMART READ LOG
  25 00 00 ff ff ff 4f 00	  00:12:12.712  READ DMA EXT
  b0 da 00 00 4f c2 40 00	  00:12:12.539  SMART RETURN STATUS
  25 00 00 ff ff ff 4f 00	  00:12:09.695  READ DMA EXT

Error 2516 occurred at disk power-on lifetime: 20134 hours (838 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00	  00:12:09.695  READ DMA EXT
  b0 d1 01 01 4f c2 40 00	  00:12:08.941  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  25 00 00 ff ff ff 4f 00	  00:12:05.776  READ DMA EXT
  b0 d0 01 00 4f c2 40 00	  00:12:05.623  SMART READ DATA
  25 00 00 ff ff ff 4f 00	  00:12:02.774  READ DMA EXT

Error 2515 occurred at disk power-on lifetime: 20134 hours (838 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00	  00:12:09.695  READ DMA EXT
  b0 d1 01 01 4f c2 40 00	  00:12:08.941  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  25 00 00 ff ff ff 4f 00	  00:12:05.776  READ DMA EXT
  b0 d0 01 00 4f c2 40 00	  00:12:05.623  SMART READ DATA
  25 00 00 ff ff ff 4f 00	  00:12:02.774  READ DMA EXT

Error 2514 occurred at disk power-on lifetime: 20134 hours (838 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00	  00:12:05.776  READ DMA EXT
  b0 d0 01 00 4f c2 40 00	  00:12:05.623  SMART READ DATA
  25 00 00 ff ff ff 4f 00	  00:12:02.774  READ DMA EXT
  ec 00 01 00 00 00 40 00	  00:12:02.739  IDENTIFY DEVICE
  25 00 00 ff ff ff 4f 00	  00:11:59.861  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed: read failure	   90%	 19234		 3907028096
# 2  Extended offline	Completed: read failure	   90%	 19211		 3907028096
# 3  Short offline	   Completed: read failure	   90%	 19186		 3907028096
# 4  Short offline	   Completed: read failure	   90%	 19138		 3907028096
# 5  Short offline	   Completed: read failure	   90%	 19116		 3907028096
# 6  Short offline	   Completed: read failure	   90%	 19067		 3907028096
# 7  Short offline	   Completed: read failure	   90%	 19019		 3907028096
# 8  Short offline	   Completed: read failure	   90%	 18971		 3907028096
# 9  Extended offline	Completed: read failure	   90%	 18904		 3907028096
#10  Short offline	   Completed: read failure	   90%	 18879		 3907028096
#11  Short offline	   Completed: read failure	   90%	 18831		 3907028096
#12  Short offline	   Completed: read failure	   90%	 18784		 3907028096
#13  Short offline	   Completed: read failure	   90%	 18736		 3907028096
#14  Short offline	   Completed: read failure	   90%	 18688		 3907028096
#15  Short offline	   Completed: read failure	   90%	 18640		 3907028096
#16  Short offline	   Completed: read failure	   90%	 18595		 3907028096
#17  Short offline	   Completed: read failure	   90%	 18594		 3907028096
#18  Extended offline	Completed: read failure	   90%	 18571		 3907028096
#19  Short offline	   Completed: read failure	   90%	 18546		 3907028096
#20  Short offline	   Completed: read failure	   90%	 18498		 3907028096
#21  Short offline	   Completed: read failure	   90%	 18450		 3907028096

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Drive 2

Code:

[root@freenas] ~# smartctl -a /dev/ada1
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model:	 ST2000DM001-1CH164
Serial Number:	S1E1FC21
LU WWN Device Id: 5 000c50 05c18e1d8
Firmware Version: CC26
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:	Mon Oct 17 02:57:34 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

(pass1:ahcich1:0:0:0): SMART. ACB: b0 d0 00 4f c2 40 00 00 00 00 01 00
(pass1:ahcich1:0:0:0): CAM status: Command timeout
Error SMART Values Read failed: No error: 0
Smartctl: SMART Read Values failed.

=== START OF READ SMART DATA SECTION ===
(pass1:ahcich1:0:0:0): SMART. ACB: b0 da 00 4f c2 40 00 00 00 00 00 00
(pass1:ahcich1:0:0:0): CAM status: Command timeout
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.

SMART Error Log Version: 1
ATA Error Count: 260 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 260 occurred at disk power-on lifetime: 20105 hours (837 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00	  03:20:04.788  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:20:04.745  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:20:01.891  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:20:01.861  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:59.003  READ FPDMA QUEUED

Error 259 occurred at disk power-on lifetime: 20105 hours (837 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00	  03:20:04.788  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:20:04.745  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:20:01.891  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:20:01.861  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:59.003  READ FPDMA QUEUED

Error 258 occurred at disk power-on lifetime: 20105 hours (837 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00	  03:20:01.891  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:20:01.861  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:59.003  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:19:58.961  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:56.111  READ FPDMA QUEUED

Error 257 occurred at disk power-on lifetime: 20105 hours (837 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00	  03:20:01.891  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:20:01.861  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:59.003  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:19:58.961  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:56.111  READ FPDMA QUEUED

Error 256 occurred at disk power-on lifetime: 20105 hours (837 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00	  03:19:56.111  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00	  03:19:56.010  READ LOG EXT
  60 00 08 ff ff ff 4f 00	  03:19:53.153  READ FPDMA QUEUED
  61 00 10 90 02 40 40 00	  03:19:53.152  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00	  03:19:53.072  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 19118		 -
# 2  Short offline	   Completed without error	   00%	 19069		 -
# 3  Short offline	   Completed without error	   00%	 19021		 -
# 4  Short offline	   Completed without error	   00%	 18973		 -
# 5  Extended offline	Completed without error	   00%	 18910		 -
# 6  Short offline	   Completed without error	   00%	 18881		 -
# 7  Short offline	   Completed without error	   00%	 18833		 -
# 8  Short offline	   Completed without error	   00%	 18785		 -
# 9  Short offline	   Completed without error	   00%	 18737		 -
#10  Short offline	   Completed without error	   00%	 18689		 -
#11  Short offline	   Completed without error	   00%	 18641		 -
#12  Short offline	   Completed without error	   00%	 18596		 -
#13  Short offline	   Completed without error	   00%	 18596		 -
#14  Extended offline	Completed without error	   00%	 18576		 -
#15  Short offline	   Completed without error	   00%	 18548		 -
#16  Short offline	   Completed without error	   00%	 18500		 -
#17  Short offline	   Completed without error	   00%	 18451		 -
#18  Short offline	   Completed without error	   00%	 18403		 -
#19  Short offline	   Completed without error	   00%	 18379		 -
#20  Short offline	   Completed without error	   00%	 18332		 -
#21  Short offline	   Completed without error	   00%	 18284		 -

Device does not support Selective Self Tests/Logging
[root@freenas] ~# p

I dont really have backups. I had a zfs mirror setup in the hope that if one drive failied i could replace it with a new one. I am hoping that its just one drive that has failed and the second is still fine.

I should also mention that i can power on my nas and access the files for a short period of time. But after about 5-10min i can no longer access them, the nas is still powered on i just cant access it via the gui or ssh
Thanks

Robert Trevellyan · Oct 17, 2016

ada0 has failed every short self-test for at least the last month. It has a couple of hundred reallocated sectors and over a thousand pending reallocations.

ada1 is passing short self-tests, but throwing a lot of read errors, and the SMART data is missing.

You should have been warned a long time ago about both of these drives having issues, so my guess is maybe you don't have the SMART monitoring service enabled, even though you have SMART tests running.

It seems to me you have three options to try to save as much data as possible:

Immediately copy as much data as you can recover to external storage, then discard both existing drives and set up a new pool on new drives. This will be pretty tedious if the NAS stops responding every 10-15 minutes.
Attempt to replace each drive in place, one by one, following the directions in the manual. Ideally, if you have at least one spare SATA port, you can attach each new drive to replace an existing drive without removing an existing drive first. This procedure is documented for FreeNAS version 9.3.1 in section 8.1.11 of the user guide "Replacing Drives to Grow a ZFS Pool". If you don't have a spare SATA port, I'm not sure how to pick which drive to replace first, since they both appear to be failing hard. Maybe ada0, since ada1 is passing short tests.
Attempt to clone one of the existing drives to a new drive with something like GNU ddrescue.If one drive won't clone, try cloning the other. If that works, you should be able to mount the pool degraded with one drive and recover your data to external storage before moving forward as in option 1.

Whatever you do, it's time to order some new drives.

rs225 · Oct 17, 2016

I agree on the ddrescue. If you can't fix this with cables or a PSU, put the drives in another computer, and clone to new drives with ddrescue. There is a good chance your damaged file count will drop significantly on a followup scrub of the new drives.

wblock · Oct 17, 2016

I agree with immediately trying to replace the drives one at a time, but would not attempt outside utilities unless it became a data recovery operation. Cloning a single drive out of a ZFS array could be interesting. Bad Things(TM) could happen.

Jarrodspencerharper · Oct 17, 2016

Sorry if this is a stupid question but if I was to buy 2 New drives and replace them one at a time using freenas wouldn't the 2 New drives inherit all the corrupt files that the current drives have?

Ericloewe · Oct 17, 2016

Jarrodspencerharper said:
Sorry if this is a stupid question but if I was to buy 2 New drives and replace them one at a time using freenas wouldn't the 2 New drives inherit all the corrupt files that the current drives have?

Well, yes, but it seems better than losing all the files.

rs225 · Oct 17, 2016

That is why ddrescue is being suggested.

Jarrodspencerharper · Oct 17, 2016

Just wanted to confirm if I replace one drive at a time using freenas. Should I be following the process of growing the zfs pool or replacing a failed drive (resilvering)

SweetAndLow · Oct 17, 2016

Well are you growing the pool or replacing a disk?

You probably want the replace directions.

Sent from my Nexus 5X using Tapatalk

Jarrodspencerharper · Oct 17, 2016

SweetAndLow said:
Well are you growing the pool or replacing a disk?

You probably want the replace directions.

Sent from my Nexus 5X using Tapatalk

Do I replace olddrive1 with new drive 1 and then replace old drive 2 with new drive 2?

Or old drive 1 with new drive 1 then replicate New drive 1 to new drive 2

SweetAndLow · Oct 17, 2016

Replication is for backup I don't think that will help with replacing a drive. You want to replace drive 1 then replace drive 2.

Sent from my Nexus 5X using Tapatalk

Jarrodspencerharper · Oct 17, 2016

SweetAndLow said:
Replication is for backup I don't think that will help with replacing a drive. You want to replace drive 1 then replace drive 2.

Sent from my Nexus 5X using Tapatalk

Ok thanks and once both drives are replaced will the error count and sectors waiting to be written go down to 0? Or will the new drives inherit the errors from the old drives

SweetAndLow · Oct 17, 2016

The read errors and sectors are specific to the drives. But it's probably all going to fail anyways because you have filesystem corruption. To many read errors on each disk.

Sent from my Nexus 5X using Tapatalk

Robert Trevellyan · Oct 18, 2016

Jarrodspencerharper said:
Should I be following the process of growing the zfs pool or replacing a failed drive (resilvering)

The key takeaway from the directions for replacing drives to grow a pool is that, if you have a spare drive port, you can replace a drive without offlining the existing drive, which means no loss of redundancy during resilver. This is true regardless of whether your goal is to grow the pool or just to replace a failed or failing drive.

Stux · Oct 19, 2016

So, which drive to replace first? I'd replace the one with 35K errors. Connect a replacement drive via another sata port, then click Replace Drive on the failed drive and select the new drive.

And hope it goes well.

Then try the same thing with the other drive.

If you're lucky... everything will be fine.

Important Announcement for the TrueNAS Community.

Has my drive failed?

Cadet

Attachments

Pony Wrangler

Sweet'NASty

Guru

MVP

Cadet

Pony Wrangler

Guru

Documentation Engineer

Cadet

Server Wrangler

Guru

Cadet

Sweet'NASty

Cadet

Sweet'NASty

Cadet

Sweet'NASty

Pony Wrangler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Has my drive failed?"

Similar threads