uncorrectable sectors = drive failure imminent?

odoyle · Jun 22, 2018

Hi guys,
I have a raidz2 with 4 disks, one disk seems unhappy lately, got this message.

Offline uncorrectable sectors
not capable of SMART self-check

I posted zpool and smartctl results below, any opionions on how bad this is?

Question is, should I wait for it to fail? Maybe it will be fine for a while..
From my understanding, two disks can fail, but not in the same vdev..
Should I use this opportunity to upgrade disk size with a replacement? So instead of replacing with a similar 2 TB, could I get a 3 or 4 TB even if the other disks are all 2 TB?

zpool status

   pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0h0m with 0 errors on Thu Jun  7 03:45:39 2018

config:



	NAME		STATE	 READ WRITE CKSUM

	freenas-boot  ONLINE	   0	 0	 0

	  da0p2	 ONLINE	   0	 0	 0



errors: No known data errors



  pool: vol1

 state: ONLINE

status: Some supported features are not enabled on the pool. The pool can

	still be used, but some features are unavailable.

action: Enable all features using 'zpool upgrade'. Once this is done,

	the pool may no longer be accessible by software that does not support

	the features. See zpool-features(7) for details.

  scan: scrub repaired 0 in 5h48m with 0 errors on Sun Jun 17 05:48:30 2018

config:



	NAME												STATE	 READ WRITE CKSUM

	vol1												ONLINE	   0	 0	 0

	  raidz2-0										  ONLINE	   0	 0	 0

		gptid/522190ca-3a38-11e4-a945-002590466eeb.eli  ONLINE	   0	 0	 0

		gptid/52a74e89-3a38-11e4-a945-002590466eeb.eli  ONLINE	   0	 0	 0

		gptid/5330724c-3a38-11e4-a945-002590466eeb.eli  ONLINE	   0	 0	 0

		gptid/5412a572-3a38-11e4-a945-002590466eeb.eli  ONLINE	   0	 0	 0



errors: No known data errors

smartctl -a /dev/ada1

 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Seagate Barracuda 7200.14 (AF)

Device Model:	 ST2000DM001-1ER164

Serial Number:	Z4Z08R9E

LU WWN Device Id: 5 000c50 0673fee11

Firmware Version: CC43

User Capacity:	2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	7200 rpm

Form Factor:	  3.5 inches

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:	Fri Jun 22 18:49:42 2018 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x00)	Offline data collection activity

					was never started.

					Auto Offline Data Collection: Disabled.

Self-test execution status:	  (   0)	The previous self-test routine completed

					without error or no self-test has ever 

					been run.

Total time to complete Offline 

data collection:		 (   89) seconds.

Offline data collection

capabilities:			 (0x73) SMART execute Offline immediate.

					Auto Offline data collection on/off support.

					Suspend Offline collection upon new

					command.

					No Offline surface scan supported.

					Self-test supported.

					Conveyance Self-test supported.

					Selective Self-test supported.

SMART capabilities:			(0x0003)	Saves SMART data before entering

					power-saving mode.

					Supports SMART auto save timer.

Error logging capability:		(0x01)	Error logging supported.

					General Purpose Logging supported.

Short self-test routine 

recommended polling time:	 (   1) minutes.

Extended self-test routine

recommended polling time:	 ( 227) minutes.

Conveyance self-test routine

recommended polling time:	 (   2) minutes.

SCT capabilities:			(0x1085)	SCT Status supported.



SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate	 0x000f   119   099   006	Pre-fail  Always	   -	   231892632

  3 Spin_Up_Time			0x0003   094   094   000	Pre-fail  Always	   -	   0

  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   58

  5 Reallocated_Sector_Ct   0x0033   094   094   010	Pre-fail  Always	   -	   7688

  7 Seek_Error_Rate		 0x000f   083   060   030	Pre-fail  Always	   -	   228365176

  9 Power_On_Hours		  0x0032   065   065   000	Old_age   Always	   -	   31326

 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0

 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   58

183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0

184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0

187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0

188 Command_Timeout		 0x0032   100   098   000	Old_age   Always	   -	   3 3 3

189 High_Fly_Writes		 0x003a   089   089   000	Old_age   Always	   -	   11

190 Airflow_Temperature_Cel 0x0022   061   050   045	Old_age   Always	   -	   39 (Min/Max 20/50)

191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0

192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   0

193 Load_Cycle_Count		0x0032   001   001   000	Old_age   Always	   -	   774300

194 Temperature_Celsius	 0x0022   039   050   000	Old_age   Always	   -	   39 (0 18 0 0 0)

197 Current_Pending_Sector  0x0012   001   001   000	Old_age   Always	   -	   29856

198 Offline_Uncorrectable   0x0010   001   001   000	Old_age   Offline	  -	   29856

199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   27433h+05m+58.559s

241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   29547895209

242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   972292748944



SMART Error Log Version: 1

No Errors Logged



SMART Self-test log structure revision number 1

Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline	Completed without error	   00%	 31315		 -

# 2  Short offline	   Completed without error	   00%	 31239		 -

# 3  Short offline	   Completed without error	   00%	 31071		 -

# 4  Extended offline	Completed without error	   00%	 30979		 -

# 5  Short offline	   Completed without error	   00%	 30903		 -

# 6  Short offline	   Completed without error	   00%	 30663		 -

# 7  Extended offline	Completed without error	   00%	 30571		 -

# 8  Short offline	   Completed without error	   00%	 30495		 -

# 9  Short offline	   Completed without error	   00%	 30327		 -

#10  Extended offline	Completed without error	   00%	 30236		 -

#11  Short offline	   Completed without error	   00%	 30159		 -

#12  Short offline	   Completed without error	   00%	 29943		 -

#13  Short offline	   Completed without error	   00%	 29775		 -

#14  Short offline	   Completed without error	   00%	 29607		 -

#15  Short offline	   Completed without error	   00%	 29439		 -

#16  Short offline	   Completed without error	   00%	 29199		 -

#17  Extended offline	Completed without error	   00%	 29108		 -

#18  Short offline	   Completed without error	   00%	 29031		 -

#19  Short offline	   Completed without error	   00%	 28863		 -

#20  Extended offline	Completed without error	   00%	 28773		 -

#21  Short offline	   Completed without error	   00%	 28696		 -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

	1		0		0  Not_testing

	2		0		0  Not_testing

	3		0		0  Not_testing

	4		0		0  Not_testing

	5		0		0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

DrKK · Jun 22, 2018

You've got massive problems. Let me explain why:

Attribute #5: Reallocated Sector Count. Desired number: 0. Your number: 7688. Which, by the way, may be a FreeNAS forum record.
Power on Hours: 31326. Drives are officially in the elderly zone at 30000; the more conservative among us would start replacing drives automatically at 30000.
Attribute #193: Load Cycle Count. Well, if your drive is 31326 hours in, I'd expect this number to be about a thousand or so. This drive is designed for a few tens of thousands load cycles. Your number, sir, is an exceptionally impressive 774300. Well done sir. Perhaps you've been spinning down this drive or something, but you have worn it out.
Attribute #197 and #198: Bad sectors. Desired number: 0. The number where the more conservative among us would replace the drive: 1. Your number: 29856. (definitely a record for a FreeNAS forum. I have literally never seen that).

While I salute you for what appears to be a reasonable regimen of SMART tests, long and short, this drive is totally hosed. I am surprised it hasn't joined the choir invisible yet.

This drive could have numbers only 1% as bad as these, and it would be WAY over the cutoff where I would replace it.

You are lucky. This particular Seagate Barracuda is known as one of the worst drives ever, and you got a lot of use out of it.

Replace the drive now.

odoyle · Jun 23, 2018

hah, thanks! Do I win some prizes?! Seriously that was a very helpful description of those parameters and I appreciate the time you took to respond and include some humor :)
So I have another one of these "worst drives ever", I checked that one and it doesn't have any offline_uncorrectable or bad sectors, but the load cycle count is 774587. Not sure why that occurred, I followed some guidelines years ago to set all this up and the power mgmt setting is 128 (no spindown).
I checked the other two drives in my array which are both
Model Family: Western Digital Green
Device Model: WDC WD20EZRX-00D8PB0
Hopefully these aren't the "second worst drives ever"! The numbers look ok, but I didn't know people normally replace drives after a certain number of power on hours.. I learned alot from your post!
As I am shopping for replacements, any way to avoid getting bad hard drive models? I was thinking of getting some WD reds..

DrKK · Jun 23, 2018

So, there were problems with the Seagate Barracuda's from that time period. If I'm not mistaken, the 3TB drive was the worst of that series, but they were all pretty bad. The failures of those Seagates caused huge damage to Seagate reputation for consumer drives for years. You may search the forum to see how many people had catastrophic failures. To this day, I won't use Seagate drives, even if they're free. I mean that literally--you could give me free Seagate drives, and I won't put a single byte of data of them. But I admit that is completely ridiculous of me...but ridiculous or not, that's what it is.

I would really super strongly recommend complete DISABLE of the power management and acoustic features in the "view disks" screen. The idea that one can "save power" by setting these things in 2018 (or, even, 2013) is really quite passe'. Drives like these, spinning 24/7, probably cost $5 of electricity per year to run. (That's actual mathematics, not a guess). Enabled AAM/APM features, god knows what kind of spindowns or unloads the disks do for that, and for what purpose really?

The issue, also, with some drives not INTENDED for NAS deployment (and I don't believe any of the drives you have, not these seagates, not those wd greens, are) is that when in a NAS, for reasons that I won't get into, the way activity works, they can just thrash load-unload-load-unload every minute. That appears to be your problem. In the case of WD greens (and blues), in 95% of the cases you can run wdidle3.exe from a true DOS boot, and reset some timing behavior to disable that. You will find that referenced a million times if you google "wdidle3 freenas". Most drives are engineered to endure a certain number of load-unload cycles---that number is usually in the (low) "tens of thousands" range, and is never more than a hundred thousand to my knowledge. The firmware on the WD reds is designed, basically, not to trigger load-unload cycles at all, really, whereas on the green-blue, those drives come out of the box unloading the heads if there's no activity in something like 30 seconds (not sure what the exact # is).

As to your specific questions:

The WD20EZRX is a very good drive, but see the wdidle3 comment above, and also, make sure the temps on that drive stay good. Every WD20EZRX I have seen take a dump had a history of temperature problems.

If I am buying new drives TODAY for a FreeNAS, I don't even consider anything other than the standard WD reds. WD reds are the galactically overwhelming favorite of the FreeNAS guys in the forum and IRC. Other guys confirm the HGST NAS drives are also excellent, but I can't personally speak for those. As for the Seagates, we certainly have had a lot of users have excellent results with the latest Seagate Ironwolf NAS stuff, and I see almost zero users complaining about failures of those, but it'll be years before I even consider using Seagate myself.

And to augment the discussion about power-on hours: Do not take what I have said too strongly to heart. For me, these days, the cost of a drive is trivial. "Proactive replacement when a drive gets to be 3-4 years old" is certainly not considered "standard" or "necessary", and is a luxury of the overly careful NAS guy. It is very reasonable to run your drives until you have MATERIAL evidence that they are end of life.

DrKK · Jun 23, 2018

But to reiterate: You don't have to be a conservative NAS psycho to replace your drive with the smart info above. That drive is completely hosed and in mid-death-knell, as we speak. You can ask anyone.

odoyle · Jun 23, 2018

Thanks, running backup now and replacement drive ordered :)
Good to hear the WD are ok - I remember at the time reading about NAS drives vs not and I don't remember why I chose the greens but I did do the mod you mentioned to change the settings on the WDs.
That makes sense about the power saving thing, I thought it was good for the drives (maybe they spin slower?) but I will try turning it off.

danb35 · Jun 23, 2018

...and lest you think that @DrKK is exaggerating, I concur completely with his recommendation on the disk in question: Replace it immediately (and by "immediately" I mean "as soon as you have a burned-in and tested replacement on hand"). It is of that model disk that @DrKK has said that its best storage configuration is buried as deep as possible in the nearest landfill.

I consider myself fairly conservative, though still less so than some. I won't immediately replace a disk with a single-digit number of bad sectors that's staying constant (though I will keep an eye on it). I will use the hot swap capabilities of my hardware to minimize downtime. But your disk has many thousands of bad sectors--it's sailing off to that undiscovered country from whose bourn no traveler returns. Replace it as soon as you can, and be sure to follow the manual's instructions (DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER).

I do want to address one misconception in what you've stated. Your pool consists of four disks in a single RAIDZ2 vdev. Of those four, any two can fail without data loss, as long as there are no data errors with the remaining two.

pro lamer · Jun 24, 2018

danb35 said:
DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER

I think I need to write it down, too. Is the warning already in the manual literally?

danb35 · Jun 24, 2018

pro lamer said:
Is the warning already in the manual literally?

Nowhere in the manual does it tell you to use the volume manager as any part of the disk replacement process, but it doesn't explicitly warn NOT to use the Volume Manager.

toadman · Jun 24, 2018

The disk replacement instructions are clear. :)

http://doc.freenas.org/11/storage.html?highlight=replace disk#replacing-a-failed-drive

Not the "volume manager," but certainly "Storage > Volumes > View Volumes" and then click on the disk in question. Then below click on "Replace."

DrKK · Jun 25, 2018

toadman said:
The disk replacement instructions are clear.

No doubt. But you put a great deal of faith in literacy of users :)

pro lamer · Jun 26, 2018

danb35 said:
(DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER)

I guess the reason is ~~described~~ mentioned in the thread with this post:

danb35 said:
people keep thinking that the way to replace a disk goes through the Volume Manager

Sent from my mobile phone

odoyle · Jun 26, 2018

Good correction, I think was confused with another system I used to manage that was a raid 10 I think.
Question for everyone - if the stats on this disk were so bad, how come smartctl or one of these tests I had set up didn't detect it until now? Should I change thresholds for warnings or something?

danb35 said:
...and lest you think that @DrKK is exaggerating, I concur completely with his recommendation on the disk in question: Replace it immediately (and by "immediately" I mean "as soon as you have a burned-in and tested replacement on hand"). It is of that model disk that @DrKK has said that its best storage configuration is buried as deep as possible in the nearest landfill.

I consider myself fairly conservative, though still less so than some. I won't immediately replace a disk with a single-digit number of bad sectors that's staying constant (though I will keep an eye on it). I will use the hot swap capabilities of my hardware to minimize downtime. But your disk has many thousands of bad sectors--it's sailing off to that undiscovered country from whose bourn no traveler returns. Replace it as soon as you can, and be sure to follow the manual's instructions (DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER).

I do want to address one misconception in what you've stated. Your pool consists of four disks in a single RAIDZ2 vdev. Of those four, any two can fail without data loss, as long as there are no data errors with the remaining two.

Important Announcement for the TrueNAS Community.

uncorrectable sectors = drive failure imminent?

odoyle

Explorer

DrKK

FreeNAS Generalissimo

odoyle

Explorer

DrKK

FreeNAS Generalissimo

DrKK

FreeNAS Generalissimo

odoyle

Explorer

danb35

Hall of Famer

pro lamer

Guru

danb35

Hall of Famer

toadman

Guru

DrKK

FreeNAS Generalissimo

pro lamer

Guru

odoyle

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

uncorrectable sectors = drive failure imminent?

Explorer

FreeNAS Generalissimo

Explorer

FreeNAS Generalissimo

FreeNAS Generalissimo

Explorer

Hall of Famer

Guru

Hall of Famer

Guru

FreeNAS Generalissimo

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "uncorrectable sectors = drive failure imminent?"

Similar threads