Degraded pool: Please help me fix it.

Mike G · Apr 7, 2018

Hello, and thank you all in advance for any help you can provide for me.

My Freenas system is composed of the following hardware built in January 2016:

Asrock C2750D4I

Six (6) of WD Red 4TB WD40EFRX NAS Hard Drive that are NOT connected to the Marvel SATA ports. Instead they are on the four blue and two white sata ports all adjacent to each other.

Four (4) of MICRON MT18KSF1G72AZ-1G6E1ZE 8GB (1X8GB)1600MHZ PC3-12800 CL11 ECC REGISTERED DUAL RANK DDR3 SDRAM 240 PIN DIMM

Power Supply: Antec Earthwatts Green 380W EA-380E HT

I don't know if my Asrock C2750D4I sata controlling hardware is bad, or if I have a bad WD Red drive, or something else.

Starting on February 21st my daily system report emails showed long strings that repeat this:

(ada3:ahcich13:0:0:0): READ_DMA48. ACB: 25 00 68 38 7c 40 84 01 00 00 d0 00
> (ada3:ahcich13:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich13:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
> (ada3:ahcich13:0:0:0): RES: 51 40 af 38 7c 40 84 01 00 7f 00
> (ada3:ahcich13:0:0:0): Retrying command

or
> (ada3:ahcich13:0:0:0): Retrying command
> (ada3:ahcich13:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e8 48 40 3f 40 83 01 00 00 00 00
> (ada3:ahcich13:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich13:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich13:0:0:0): RES: 41 40 9f 40 3f 40 83 01 00 00 00
> (ada3:ahcich13:0:0:0): Error 5, Retries exhausted

So I assumed my ADA3 device has an issue, and after looking at forum threads I tried to do some short and long SMART tests, although I don't know how to really look at the long test results. I saved most of the output of whatever test i ran and I didn't judge it a bad result, but I really am not well educated on this to properly troubleshoot. I checked my sata and power cable connections on each end, and rebooted, but when the ATA status errors persisted, I swapped out the SATA cable. The errors remained. I have not yet tried to move any sata connections on the motherboard; just scared to try anything without expert guidance and I read that I shouldn't use the Marvel ports.

Starting on March 26th I got an email report that gives this information, so I turned it off and unplugged it.:

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
freenas-boot 29.8G 562M 29.2G - - 1% 1.00x ONLINE -
pool1 21.8T 13.4T 8.39T - 10% 61% 1.00x DEGRADED /mnt

pool: pool1
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub in progress since Sun Mar 25 00:00:04 2018
7.02T scanned out of 13.3T at 75.7M/s, 24h19m to go
0 repaired, 52.61% done
config:

NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/fd6d1689-88f5-11e5-ae91-d05099c00684 ONLINE 0 0 0
gptid/fe521d35-88f5-11e5-ae91-d05099c00684 ONLINE 0 0 0
gptid/ff38e329-88f5-11e5-ae91-d05099c00684 ONLINE 0 0 0
14083642073395126713 REMOVED 0 0 0 was /dev/gptid/0021127e-88f6-11e5-ae91-d05099c00684
gptid/010698d4-88f6-11e5-ae91-d05099c00684 ONLINE 0 0 0
gptid/01ebee95-88f6-11e5-ae91-d05099c00684 ONLINE 0 0 0

Just today I booted this NAS back up, and looked in the BIOS because I knew the system time had been wrong for a while an wanted to fix it, and then realized that in the BIOS it did not say SMART was enabled. I enabled it. I had been trying to set the SMART schedules to run for quite some time, but I couldn't tell if it was working or how to make an output file to review. Anyway, once it booted up, I saw this:

This is where I ask for help and wait to be told that my SMART isn't turned on/get yelled at for not doing things right and not reading the manual. I tried to read the manual, and I feel like I made a good attempt at selecting hardware based on the advice on this board at the time. If I need to buy a new drive, let me know. If you have any instructions for me, please give me as much detail in the steps as possible, since I have to do internet searches to find the proper commands.
Thanks.

MrToddsFriends · Apr 8, 2018

I'm trying to answer a part of your questions.

I'm not sure if it is the same hard disk you showed part of some smartctl output or this is from two different HDDs. A nonzero and constantly growing Reallocated_Sector_Ct value is a very bad sign. The drive(s) in question should be replaced from this fact alone. Also a nonzero value in Raw_Read_Error_Rate is a bad sign for WD Red, although the interpretation of this field is vendor specific. See also
https://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

To show the whole information that smartctl is capable of showing for SATA drives, use the -x option and redirect the output to a file that is easily reachable through a share and copy-and paste the content using CODE tags. Will look like this (output for one of my 3TB WD Reds):

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Serial Number:	WD-WMC4N0E6K1VF
LU WWN Device Id: 5 0014ee 65a574394
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Sun Apr  8 08:00:00 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
				   was never started.
				   Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0)   The previous self-test routine completed
				   without error or no self-test has ever
				   been run.
Total time to complete Offline
data collection:		(40020) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 401) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:		   (0x703d)   SCT Status supported.
				   SCT Error Recovery Control supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   200   051	-	0
  3 Spin_Up_Time			POS--K   177   175   021	-	6116
  4 Start_Stop_Count		-O--CK   100   100   000	-	310
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   200   200   000	-	0
  9 Power_On_Hours		  -O--CK   065   065   000	-	25885
 10 Spin_Retry_Count		-O--CK   100   100   000	-	0
 11 Calibration_Retry_Count -O--CK   100   100   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	120
192 Power-Off_Retract_Count -O--CK   200   200   000	-	17
193 Load_Cycle_Count		-O--CK   188   188   000	-	37973
194 Temperature_Celsius	 -O---K   126   114   000	-	24
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS	   1  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  93  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 25851		 -
# 2  Short offline	   Completed without error	   00%	 25779		 -
# 3  Short offline	   Completed without error	   00%	 25611		 -
# 4  Short offline	   Completed without error	   00%	 25540		 -
# 5  Extended offline	Completed without error	   00%	 25501		 -
# 6  Short offline	   Completed without error	   00%	 25445		 -
# 7  Short offline	   Completed without error	   00%	 25373		 -
# 8  Short offline	   Completed without error	   00%	 25277		 -
# 9  Short offline	   Completed without error	   00%	 25205		 -
#10  Extended offline	Completed without error	   00%	 25165		 -
#11  Short offline	   Completed without error	   00%	 25109		 -
#12  Short offline	   Completed without error	   00%	 25037		 -
#13  Short offline	   Completed without error	   00%	 24941		 -
#14  Short offline	   Completed without error	   00%	 24869		 -
#15  Extended offline	Completed without error	   00%	 24830		 -
#16  Short offline	   Completed without error	   00%	 24773		 -
#17  Short offline	   Completed without error	   00%	 24701		 -
#18  Short offline	   Completed without error	   00%	 24605		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					24 Celsius
Power Cycle Min/Max Temperature:	 22/31 Celsius
Lifetime	Min/Max Temperature:	  2/36 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (189)

Index	Estimated Time   Temperature Celsius
 190	2018-04-08 00:03	28  *********
 ...	..( 28 skipped).	..  *********
 219	2018-04-08 00:32	28  *********
 220	2018-04-08 00:33	27  ********
 ...	..( 16 skipped).	..  ********
 237	2018-04-08 00:50	27  ********
 238	2018-04-08 00:51	28  *********
 ...	..( 86 skipped).	..  *********
 325	2018-04-08 02:18	28  *********
 326	2018-04-08 02:19	27  ********
 ...	..(  5 skipped).	..  ********
 332	2018-04-08 02:25	27  ********
 333	2018-04-08 02:26	26  *******
 ...	..(  4 skipped).	..  *******
 338	2018-04-08 02:31	26  *******
 339	2018-04-08 02:32	25  ******
 ...	..( 14 skipped).	..  ******
 354	2018-04-08 02:47	25  ******
 355	2018-04-08 02:48	24  *****
 ...	..( 27 skipped).	..  *****
 383	2018-04-08 03:16	24  *****
 384	2018-04-08 03:17	28  *********
 ...	..(266 skipped).	..  *********
 173	2018-04-08 07:44	28  *********
 174	2018-04-08 07:45	27  ********
 ...	..(  2 skipped).	..  ********
 177	2018-04-08 07:48	27  ********
 178	2018-04-08 07:49	28  *********
 ...	..( 10 skipped).	..  *********
 189	2018-04-08 08:00	28  *********

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2		   50  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2		   51  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4	  8511991  Vendor specific

Helping forum members here will be glad to see the at least the SMART Extended Self-test Log section showing the history of smartctl runs and of course the output as text and not as a screenshot.

AFAICT activating SMART options in the BIOS setup is completely independent of configuring SMART tests in FreeNAS.

Johnnie Black · Apr 8, 2018

That disk is failing and needs to be replaced ASAP

joeschmuck · Apr 8, 2018

Look at my Hard Drive Troubleshooting Guide (link in my signature) and it will explain some about the types of failures such as ID5 errors are not SATA cable related.

ID 5 is the number of reallocated sectors, basically a physical flaw on the drive platter and/or a head failure (normally mechanical). The data you listed indicates this drive is about to be completely dead soon since you provided two different outputs and the value difference significantly increased.

Mike G said:
This is where I ask for help and wait to be told that my SMART isn't turned on/get yelled at for not doing things right and not reading the manual.

Of course, what kind of people would we be if we didn't push you in the right direction. ;)

And never wait to troubleshoot a problem is you get an error message, many times hard drives just fail quickly.

Mike G · Apr 8, 2018

Thank you all so much for the quick response. I have ordered a new WD red hard drive and will try to absorb as much as I can from the Hard Drive Troubleshooting Guide and other information from previous forum posts. This will be my first NAS drive replacement, and I do know the serial number associated with my ATA3 drive. I may have some questions to ask before I actually attempt this, and will want to see if I can post the run long smart tests on each drive to a file I can share for review once I 'm done.
Also, thanks for explaining that the BIOS setting about SMART does not impact freenas.
I think I originally ran a WD diagnostics tool on all these drives on a windows pc before installing them for use with freenas, so I'll try to do that first with this new drive.
I'll also want to understand if I should take the bad drive out and turn m NAS back on to try to run long smart tests on the other drives while I wait for the new one, or if I should first install the new one and attempt the re silvering process,before conducting long smart tests on each drive. I just want to protect the data (lots of priceless family home videos, etc) as best I can at this point.
Can't thank you all enough!

Stux · Apr 8, 2018

Ada3 is the drive that is failing/failed right? But which drive is that?

And it’s not necessary the one on sata port 3.

You definately want to work that out for sure.

I’m concerned because you said you replaced the Sata cable already... but how did you know which one to replace?

joeschmuck · Apr 8, 2018

Stux said:
Ada3 is the drive that is failing/failed right? But which drive is that?

Hopefully it's the one ending in serial number "DKV3" according to the first posting.

Mike G · Apr 8, 2018

When I replaced my locking sata cable another one, it was for the drive with serial number ending in DKV3. I think I saw some other posts stating that bad sata cables could cause ATA status errors, and other posts starting that the SATA controllers on the board could fail and someone bought a SATA controller card to use that fixed the problem.
As far as knowing which drive it is, I sure hope I have the right one. I thought since my emailed freenas.local security run output said ATA3 had the ATA status errors, then it would be a reasonable expectation to look into the "storage" - disks area of FreeNAS, and sure enough, it did show a serial number associated with ATA3 (shown in the first post screenshot with the alert popup message). I just didn't know if was absolutely a bad disk,or if there was a potential that FreeNas was being blocked from communicating with the drive due to some other hardware issue. That serial number drive is plugged into SATA3 Connector (SATA_0, White) on the board. I did not attempt to use any different sata ports as I use all six of the ones I understand to not have design issues.

I also wasn't getting the ATA3 Error message in my daily freenas.local security run output every day. I couldn't tell if there was a pattern to it and wondered if the error only appeared when a SMART test was run the previous night. I route emails from my NAS to a separate folder, because I really only want to have to look at them if I have an issue to deal with. Hopefully, once I understand when smart test are being run, I can figure out how to get only warning/critical emails to only show up in my regular inbox. I'll try to tackle that aspect if I can get the pool healthy.

joeschmuck · Apr 8, 2018

When you posted your smatrctl output you didn't show the serial number, that would have eliminated the questions. The identifier ada3 can change to different drives, sometimes it's port based, sometimes it's just which drive is ready first. So when you have an issue then you want to find out the drive serial number immediately and track it from there using the serial number. If you move SATA cables to ports around and the ada number changes but the serial number remains the same then it's a drive problem. If the drive serial number changes then it's likely not a drive.

Mike G · Apr 20, 2018

Hello. I have returned after a busy couple of weeks (taxes, family stuff…) to take the next step in repairing my freenas. To recap, the same as existing model WD Red hard drive I ordered has arrived, and like I did on all original drives, I’ve run the Western Digital Data LifeGuard Diagnostics test. No bad sectors were reported.

I feel like I have two main options here: A) Identify and replace the suspect disk immediately or B) try to run more tests and post to the board. If A) is best, see my plan below and kindly point out any flaws. I assume I don’t have to bother with shutting down jails for any portion of this work. If B) is best, please provide some direction.

For A) my steps will be as follows:

1) boot up to the Console Setup screen that gives me the numbered options. choose #9 to go into a shell command line or use shell from GUI.

2) type “zpool status” to see the gptid of the degraded disk, which as listed in my previous post should be 0021127e.

3) Hit “ESC” key to clear screen. (Not sure if that works or is even necessary)

4) type “glabel list | less” and scroll down to find the Geom name ADAx, (where x is another number) that reports the same gptid number found in step 2.

5) Hit “ESC” to clear screen

6) type “camcontrol identify ADAx | less

7) This will display the serial number of the ADAx drive, which I can use to locate on the label of the drive and remove it for replacement.

8) Go into the GUI and View Volumes –Volume Status to offline the disk, or make sure it is offline. Maybe this should go before step 2 but I don’t think it matters and may not even be necessary if a degraded disk is already offline.

9) type “shutdown –p now” with “p” resulting in power off. Or shutdown from GUI.

10) Replace old hard drive with new after removing power.

11) Power on, and boot up to the Console Setup screen that gives me the numbered options. choose #9 to go into a shell command line.

12) type “zpool status” to see the offline disk is reported in the degraded pool

13) Hit “ESC” to clear screen

14) type “smartz” (do I need to do this?, don’t think so)

15) Go into the GUI and into View Volumes and click on my pool, and then volume status icon at the bottom

16) Click on the ADAx hard drive number I am replacing, which will say offline to the right, and click on the Replace button at the bottom of the screen..

17) The popup dialog will give me a choice of a Member disk ADAx number, which I assume will be one drive to choose from.

18) Go into shell and type “zpool status” and make sure it is reslithering. Repeat the command again after it finishes to make sure there were no errors.

19) Wipe brow, find beer, do the polka, etc, etc.

20) Regroup and run smart long tests on all the drives and keep sort and long tests on a schedule.

Thanks again so much. I’ll attempt this after I read any feedback you can provide. I have not yet learned how to output command results to a file, so if it is not too much effort, not to mention that my Freenas is down, please direct me to the answer or an appropriate forum post to read.

MrToddsFriends · Apr 20, 2018

Mike G said:
For A) my steps will be as follows:

Replacing the disk is the way to go, as far as I can tell. Your list of steps looks rather long and confusing to me, I'm trying to give some hints and affirmations:

zpool status shows gptid values of disks belonging to system inherent pools
glabel list or glabel status (the latter command has much shorter output) is helpful to see the correlation of gptid values to adax designations
camcontrol identify adax (among other commands) is helpful to see the correlation of adax designations to serial numbers, with the x in adax replaced by some actually used number

I'm not sure why you want to hit “ESC” to clear the screen between issuing different commands. Assume that this is not necessary. And what the hell should “smartz” be? Forget that or try to find out how the correct name of this command is (or what it is supposed to do).

Be sure to having read and understood all steps in 8.1.10. Replacing a Failed Drive in the manual.

Also be sure to label all your drives appropriately during your current maintenance to make it easier for you in the future.

Mike G said:
I have not yet learned how to output command results to a file, so if it is not too much effort, not to mention that my Freenas is down, please direct me to the answer or an appropriate forum post to read.

The short answer is: > can be used to redirect the output of a command to a file. Example: To redirect the output of smartctl -x /dev/ada2 to a file named /mnt/volume0/scr/smartctl_out_ada2.txt I simply type

smartctl -x /dev/ada2 > /mnt/volume0/scr/smartctl_out_ada2.txt

In this example volume0 is the name of a pool, scr is the name of a dataset in that pool and smartctl_out_ada2.txt is the file name proper. Be prepared that after following this approach permission issues might occur to meet the goal that the file just created is visible in an associated share.

A longer answer would be that the details of command output redirection are specific to the UNIX shell actually used (sh, bash, csh, ...). But: > as mentioned above should actually always work as a first approach to save the output (stdout) of an arbitrary command, if no attention or distinction to errors (stderr) is desired or even necessary.

Edit: Some minor fixes and clarifications.

Mike G · Apr 20, 2018

In addition to reading that documentation and others, I also watched a few youtube videos to watch others do it. The hitting "ESC" and "smartz" and the bulk of what I wrote came from this video, but I just wanted to confirm those steps were not necessary:
https://www.youtube.com/watch?v=AoZkBugCUdA
and then also I wached this.
https://www.youtube.com/watch?v=9YmPg_uP08w
I think I feel comfortable with attempting the process now, but will hold off until I have a block of time without interruption.

joeschmuck · Apr 22, 2018

Just make sure you are replacing the correct drive, use the serial number.

Mike G said:
I’ve run the Western Digital Data LifeGuard Diagnostics test. No bad sectors were reported.

I'm not sure how throughout of a test this is, I haven't run this test in like a decade or more. I would recommend you run "badblock" on the new drive to burn it in properly before using it to replace the suspect drive.

Mike G · Apr 22, 2018

Ok thanks , I was literally just about to start the disk replacement, so I'll try and figure out how to run badbloks

Mike G · Apr 22, 2018

From https://wiki.archlinux.org/index.php/Badblocks I'm reading that
a destructive badblocks test can be run on a new drive with:
badblocks -wsv /dev/<device>
and
a non-destructive test on existing drive with:
badblocks -nsv /dev/<device>

I feel like after I identify the serial number of the degraded disk, offline it and physically remove it, that to be absolutely safe I should disconnect the sata cables from all the other disks and put in the new disk before attempting the destructive test on that new disk. This may be overkill for others, but can I go wrong here vs another method? Then after all the drives are back in and after the resilvering process I can run the non-destructive test on all disks.

MrToddsFriends · Apr 22, 2018

Mike G said:
[...] that to be absolutely safe I should disconnect the sata cables from all the other disks and put in the new disk before attempting the destructive test on that new disk.

You can't do harm to the remaining disks with the destructive badblocks test when all of them are disconnected.

Note that such a badblocks run will take several days to complete, though. Good luck!

Mike G · Apr 22, 2018

exactly. Thanks. I'll feel better just knowing only the new disk is connected when running a destructive test.
I just read on one thread that badblocks is more effective at finding bad sectors than SMART tests, so also curious to see if that should be understood as a true statement. It sound like i should do a non-destructive test periodically on all drives.

by the way, so far I've booted up the system with all original disks, and all disks are ONLINE with the 0021127e showing a CHCKSUM of 12.I'll next verify the device name and serial number of that disk.

Mike G · Apr 22, 2018

I have to say, I'm so glad to now know how to output commands to a file.Thanks a million just for that help! I was actually having fun when this is usually stressful. I ran the commands and was able to see that the gptid 0021127e corresponds to ADA3 and my serial number disk ending in DKV3, which is what I had expected to see. I also ran a short smart test. I know the badblocks test on the new drive will take over a day, so I'll start that sometime soon. See output results below and if there is anything of note that might cause me to change course, please raise a flag.
I'm curious why my zpool status only has 12 checksum errors instead of the 512 listed the other week.
Also it was up for an hour or so and I don't think it made itself degraded at any point, wheras it was definitely degraded the other week. The commands were run in the same order as below:

Code:

pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Mar 24 03:45:43 2018

config:


	  NAME										  STATE	 READ WRITE CKSUM

	  freenas-boot								  ONLINE	   0	 0	 0

		gptid/8b25d495-6a01-11e5-bc95-d05099c00684  ONLINE	  0	 0	 0


errors: No known data errors


  pool: pool1

 state: ONLINE

status: One or more devices has experienced an unrecoverable error.  An

	  attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

	  using 'zpool clear' or replace the device with 'zpool replace'.

   see: http://illumos.org/msg/ZFS-8000-9P

  scan: scrub repaired 0 in 34h36m with 0 errors on Mon Mar 26 10:36:50 2018

config:


	  NAME											STATE	 READ WRITE CKSUM

	  pool1										  ONLINE	   0	 0	 0

		raidz1-0									  ONLINE	   0	 0	 0

		  gptid/fd6d1689-88f5-11e5-ae91-d05099c00684  ONLINE	  0	 0	 0

		  gptid/fe521d35-88f5-11e5-ae91-d05099c00684  ONLINE	  0	 0	 0

		  gptid/ff38e329-88f5-11e5-ae91-d05099c00684  ONLINE	  0	 0	 0

		  gptid/0021127e-88f6-11e5-ae91-d05099c00684  ONLINE	  0	 0	12

		  gptid/010698d4-88f6-11e5-ae91-d05099c00684  ONLINE	  0	 0	 0

		  gptid/01ebee95-88f6-11e5-ae91-d05099c00684  ONLINE	  0	 0	 0


errors: No known data errors

Code:

Geom name: da0p1
Providers:
1. Name: gptid/8b1a75da-6a01-11e5-bc95-d05099c00684
   Mediasize: 524288 (512k)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0
   secoffset: 0
   offset: 0
   seclength: 1024
   length: 524288
   index: 0
Consumers:
1. Name: da0p1
   Mediasize: 524288 (512k)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0

Geom name: da0p2
Providers:
1. Name: gptid/8b25d495-6a01-11e5-bc95-d05099c00684
   Mediasize: 32015114240 (29G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 544768
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 62529520
   length: 32015114240
   index: 0
Consumers:
1. Name: da0p2
   Mediasize: 32015114240 (29G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 544768
   Mode: r1w1e2

Geom name: ada0p2
Providers:
1. Name: gptid/fd6d1689-88f5-11e5-ae91-d05099c00684
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada0p2
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

Geom name: ada1p2
Providers:
1. Name: gptid/fe521d35-88f5-11e5-ae91-d05099c00684
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada1p2
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

Geom name: ada2p2
Providers:
1. Name: gptid/ff38e329-88f5-11e5-ae91-d05099c00684
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada2p2
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

Geom name: ada3p2
Providers:
1. Name: gptid/0021127e-88f6-11e5-ae91-d05099c00684
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada3p2
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

Geom name: ada4p2
Providers:
1. Name: gptid/010698d4-88f6-11e5-ae91-d05099c00684
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada4p2
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

Geom name: ada5p2
Providers:
1. Name: gptid/01ebee95-88f6-11e5-ae91-d05099c00684
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada5p2
   Mediasize: 3998639460352 (3.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

so

Below was "camcontrol identify ADA3". Wish it said that in the results so all the proof was there

Code:

pass4: <WDC WD40EFRX-68WT0N0 82.00A82> ATA-9 SATA 3.x device
pass4: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol			  ATA/ATAPI-9 SATA 3.x
device model		  WDC WD40EFRX-68WT0N0
firmware revision	 82.00A82
serial number		 WD-WCC4E4USDKV3
WWN				   50014ee20c7c9d7a
cylinders			 16383
heads				 16
sectors/track		 63
sector size		   logical 512, physical 4096, offset 0
LBA supported		 268435455 sectors
LBA48 supported	   7814037168 sectors
PIO supported		 PIO4
DMA supported		 WDMA2 UDMA6
media RPM			 5400

Feature					  Support  Enabled   Value		   Vendor
read ahead					 yes	yes
write cache					yes	yes
flush cache					yes	yes
overlap						no
Tagged Command Queuing (TCQ)   no	no
Native Command Queuing (NCQ)   yes		32 tags
NCQ Queue Management		   no
NCQ Streaming				  no
Receive & Send FPDMA Queued	no
SMART						  yes	yes
microcode download			 yes	yes
security					   yes	no
power management			   yes	yes
advanced power management	  no	no
automatic acoustic management  no	no
media status notification	  no	no
power-up in Standby			yes	no
write-read-verify			  no	no
unload						 yes	yes
general purpose logging		yes	yes
free-fall					  no	no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)	  yes	  no	  7814037168/7814037168
HPA - Security				 no

Code:

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p25 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	WD-WCC4E4USDKV3
LU WWN Device Id: 5 0014ee 20c7c9d7a
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Sun Apr 22 11:21:37 2018 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection:		 (53280) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 532) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:			(0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   171   170   051	-	192808
  3 Spin_Up_Time			POS--K   200   177   021	-	6958
  4 Start_Stop_Count		-O--CK   100   100   000	-	103
  5 Reallocated_Sector_Ct   PO--CK   158   158   140	-	1242
  7 Seek_Error_Rate		 -OSR-K   200   168   000	-	0
  9 Power_On_Hours		  -O--CK   074   074   000	-	19294
 10 Spin_Retry_Count		-O--CK   100   100   000	-	0
 11 Calibration_Retry_Count -O--CK   100   100   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	103
192 Power-Off_Retract_Count -O--CK   200   200   000	-	51
193 Load_Cycle_Count		-O--CK   200   200   000	-	608
194 Temperature_Celsius	 -O---K   120   102   000	-	32
196 Reallocated_Event_Count -O--CK   111   111   000	-	89
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   052   052   000	-	59520
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS	   1  Device vendor specific log
0xb7	   GPL,SL  VS	  39  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  93  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 17325 (device log contains only the most recent 24 errors)
	CR	 = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH	 = LBA High (was: Cylinder High) Register	]   LBA
	LM	 = LBA Mid (was: Cylinder Low) Register	  ] Register
	LL	 = LBA Low (was: Sector Number) Register	 ]
	DV	 = Device (was: Device/Head) Register
	DC	 = Device Control Register
	ER	 = Error register
	ST	 = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 17325 [20] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 01 91 92 23 68 40 00  Error: IDNF at LBA = 0x191922368 = 6737240936

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 68 00 01 91 92 23 68 40 08 10d+08:23:39.300  WRITE FPDMA QUEUED
  61 00 10 00 68 00 01 91 92 23 48 40 08 10d+08:23:39.299  WRITE FPDMA QUEUED
  61 00 10 00 68 00 00 a6 b8 f5 18 40 08 10d+08:23:39.066  WRITE FPDMA QUEUED
  61 00 08 00 68 00 00 a6 5e b7 08 40 08 10d+08:23:39.066  WRITE FPDMA QUEUED
  61 00 08 00 68 00 01 91 92 23 60 40 08 10d+08:23:38.786  WRITE FPDMA QUEUED

Error 17324 [19] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 a6 b8 f4 30 40 00  Error: IDNF at LBA = 0xa6b8f430 = 2797138992

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 28 00 00 a6 b8 f4 30 40 08 10d+08:22:54.439  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:22:49.597  FLUSH CACHE EXT
  61 00 08 00 28 00 01 d1 c0 be 38 40 08 10d+08:22:49.596  WRITE FPDMA QUEUED
  61 00 08 00 20 00 01 d1 c0 bc 38 40 08 10d+08:22:49.596  WRITE FPDMA QUEUED
  61 00 08 00 20 00 00 00 40 04 38 40 08 10d+08:22:49.596  WRITE FPDMA QUEUED

Error 17323 [18] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 a6 b8 e5 60 40 00  Error: IDNF at LBA = 0xa6b8e560 = 2797135200

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 c0 00 00 a6 b8 e5 60 40 08 10d+08:19:35.341  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:19:30.939  FLUSH CACHE EXT
  61 00 08 00 c0 00 01 d1 c0 be 08 40 08 10d+08:19:30.897  WRITE FPDMA QUEUED
  61 00 08 00 b8 00 01 d1 c0 bc 08 40 08 10d+08:19:30.897  WRITE FPDMA QUEUED
  61 00 08 00 b0 00 00 00 40 04 08 40 08 10d+08:19:30.896  WRITE FPDMA QUEUED

Error 17322 [17] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 40 02 70 40 00  Error: IDNF at LBA = 0x00400270 = 4194928

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 10 00 00 00 40 02 70 40 08 10d+08:14:32.901  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:14:32.821  FLUSH CACHE EXT
  61 00 10 00 10 00 00 a6 b8 d8 c0 40 08 10d+08:14:32.821  WRITE FPDMA QUEUED
  61 00 10 00 10 00 00 a6 5e 9c 60 40 08 10d+08:14:32.820  WRITE FPDMA QUEUED
  61 00 28 00 10 00 01 91 91 f1 78 40 08 10d+08:14:32.820  WRITE FPDMA QUEUED

Error 17321 [16] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 01 91 91 f0 e8 40 00  Error: IDNF at LBA = 0x19191f0e8 = 6737228008

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 01 91 91 f0 e8 40 08 10d+08:14:12.018  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:14:07.184  FLUSH CACHE EXT
  61 00 08 00 00 00 01 d1 c0 be 60 40 08 10d+08:14:07.184  WRITE FPDMA QUEUED
  61 00 08 00 f8 00 01 d1 c0 bc 60 40 08 10d+08:14:07.184  WRITE FPDMA QUEUED
  61 00 08 00 f0 00 00 00 40 04 60 40 08 10d+08:14:07.184  WRITE FPDMA QUEUED

Error 17320 [15] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 e4 03 10 78 40 00  Error: IDNF at LBA = 0xe4031078 = 3825406072

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 f8 00 00 e4 03 10 78 40 08 10d+08:10:31.031  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:10:27.522  FLUSH CACHE EXT
  61 00 08 00 f8 00 01 d1 c0 be 50 40 08 10d+08:10:27.522  WRITE FPDMA QUEUED
  61 00 08 00 f8 00 01 d1 c0 bc 50 40 08 10d+08:10:27.522  WRITE FPDMA QUEUED
  61 00 08 00 f0 00 00 00 40 04 50 40 08 10d+08:10:27.522  WRITE FPDMA QUEUED

Error 17319 [14] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 e4 02 f9 60 40 00  Error: IDNF at LBA = 0xe402f960 = 3825400160

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 b8 00 00 e4 02 f9 60 40 08 10d+08:03:32.940  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:03:31.005  FLUSH CACHE EXT
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:03:30.969  FLUSH CACHE EXT
  61 00 30 00 b0 00 01 86 db 37 88 40 08 10d+08:03:30.947  WRITE FPDMA QUEUED
  61 00 08 00 a8 00 01 d1 c0 bd f0 40 08 10d+08:03:30.919  WRITE FPDMA QUEUED

Error 17318 [13] occurred at disk power-on lifetime: 18865 hours (786 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 01 86 e2 70 a8 40 00  Error: IDNF at LBA = 0x186e270a8 = 6557954216

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 80 00 01 86 e2 70 a8 40 08 10d+08:02:51.581  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 10d+08:02:49.105  FLUSH CACHE EXT
  61 00 08 00 80 00 01 d1 c0 bd c8 40 08 10d+08:02:49.104  WRITE FPDMA QUEUED
  61 00 08 00 80 00 01 d1 c0 bb c8 40 08 10d+08:02:49.104  WRITE FPDMA QUEUED
  61 00 08 00 78 00 00 00 40 03 c8 40 08 10d+08:02:49.104  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 19292		 -
# 2  Short offline	   Completed without error	   00%	 19109		 -
# 3  Extended offline	Completed without error	   00%	 18668		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					32 Celsius
Power Cycle Min/Max Temperature:	 24/32 Celsius
Lifetime	Min/Max Temperature:	 23/50 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (129)

Index	Estimated Time   Temperature Celsius
 130	2018-04-22 03:24	32  *************
 ...	..( 81 skipped).	..  *************
 212	2018-04-22 04:46	32  *************
 213	2018-04-22 04:47	31  ************
 ...	..(  8 skipped).	..  ************
 222	2018-04-22 04:56	31  ************
 223	2018-04-22 04:57	32  *************
 ...	..( 13 skipped).	..  *************
 237	2018-04-22 05:11	32  *************
 238	2018-04-22 05:12	31  ************
 ...	..(  2 skipped).	..  ************
 241	2018-04-22 05:15	31  ************
 242	2018-04-22 05:16	32  *************
 ...	..( 18 skipped).	..  *************
 261	2018-04-22 05:35	32  *************
 262	2018-04-22 05:36	31  ************
 ...	..( 17 skipped).	..  ************
 280	2018-04-22 05:54	31  ************
 281	2018-04-22 05:55	32  *************
 ...	..(  9 skipped).	..  *************
 291	2018-04-22 06:05	32  *************
 292	2018-04-22 06:06	31  ************
 ...	..(  8 skipped).	..  ************
 301	2018-04-22 06:15	31  ************
 302	2018-04-22 06:16	32  *************
 ...	..( 58 skipped).	..  *************
 361	2018-04-22 07:15	32  *************
 362	2018-04-22 07:16	33  **************
 363	2018-04-22 07:17	 ?  -
 364	2018-04-22 07:18	26  *******
 365	2018-04-22 07:19	26  *******
 366	2018-04-22 07:20	26  *******
 367	2018-04-22 07:21	27  ********
 368	2018-04-22 07:22	27  ********
 369	2018-04-22 07:23	27  ********
 370	2018-04-22 07:24	28  *********
 ...	..(  3 skipped).	..  *********
 374	2018-04-22 07:28	28  *********
 375	2018-04-22 07:29	29  **********
 ...	..(  8 skipped).	..  **********
 384	2018-04-22 07:38	29  **********
 385	2018-04-22 07:39	30  ***********
 ...	..(  5 skipped).	..  ***********
 391	2018-04-22 07:45	30  ***********
 392	2018-04-22 07:46	31  ************
 ...	..( 30 skipped).	..  ************
 423	2018-04-22 08:17	31  ************
 424	2018-04-22 08:18	32  *************
 ...	..(  3 skipped).	..  *************
 428	2018-04-22 08:22	32  *************
 429	2018-04-22 08:23	 ?  -
 430	2018-04-22 08:24	24  *****
 431	2018-04-22 08:25	25  ******
 ...	..(  2 skipped).	..  ******
 434	2018-04-22 08:28	25  ******
 435	2018-04-22 08:29	26  *******
 436	2018-04-22 08:30	27  ********
 437	2018-04-22 08:31	27  ********
 438	2018-04-22 08:32	28  *********
 ...	..(  4 skipped).	..  *********
 443	2018-04-22 08:37	28  *********
 444	2018-04-22 08:38	29  **********
 ...	..(  8 skipped).	..  **********
 453	2018-04-22 08:47	29  **********
 454	2018-04-22 08:48	30  ***********
 ...	..(  4 skipped).	..  ***********
 459	2018-04-22 08:53	30  ***********
 460	2018-04-22 08:54	31  ************
 ...	..( 10 skipped).	..  ************
 471	2018-04-22 09:05	31  ************
 472	2018-04-22 09:06	32  *************
 ...	..( 66 skipped).	..  *************
  61	2018-04-22 10:13	32  *************
  62	2018-04-22 10:14	33  **************
 ...	..( 11 skipped).	..  **************
  74	2018-04-22 10:26	33  **************
  75	2018-04-22 10:27	32  *************
 ...	..( 45 skipped).	..  *************
 121	2018-04-22 11:13	32  *************
 122	2018-04-22 11:14	33  **************
 ...	..(  6 skipped).	..  **************
 129	2018-04-22 11:21	33  **************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			3  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4		 4365  Vendor specific

Mike G · Apr 22, 2018

Sorry to post another image, but I photographed what came across my screen soon before running the smart short test. It was at this point that I turned off my jails. I've seen this before, and wonder if this is caused by the disk issue or if there is another problem.

joeschmuck · Apr 22, 2018

Mike G said:
I just read on one thread that badblocks is more effective at finding bad sectors than SMART tests, so also curious to see if that should be understood as a true statement.

Yes, badblocks is more thorough than a SMART Long test because badblocks runs several pattern tests on the drive, effectively running the test 4 times before it completes.

Mike G said:
It sound like i should do a non-destructive test periodically on all drives.

I would not recommend this, just run the SMART Long test on each drive periodically. For example I run a Short test each day on my drives and a Long test once a week. Some folks run a short test every day and a long test once a month. Since my drives spin all the time and I have a low usage, I see no issue with testing as I indicated above.

Mike G said:
Sorry to post another image, but I photographed what came across my screen soon before running the smart short test. It was at this point that I turned off my jails. I've seen this before, and wonder if this is caused by the disk issue or if there is another problem.

You really should start to use "Google" to search the internet for help. I searched for "freed uma keg freenas" and the first hit was this link here.

So run badblocks on the new drive, it will take a very long time but once it's done you can be fairly certain that drive is good to go, then you can replace the failed drive.

I think you are on the road to recovery.

Important Announcement for the TrueNAS Community.

Degraded pool: Please help me fix it.

Dabbler

Documentation Browser

Guru

Old Man

Dabbler

MVP

Old Man

Dabbler

Old Man

Dabbler

Documentation Browser

Dabbler

Old Man

Dabbler

Dabbler

Documentation Browser

Dabbler

Dabbler

Dabbler

Attachments

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Degraded pool: Please help me fix it."

Similar threads