uncorrectable sectors = drive failure imminent?

Status
Not open for further replies.

odoyle

Explorer
Joined
Sep 2, 2014
Messages
62
Hi guys,
I have a raidz2 with 4 disks, one disk seems unhappy lately, got this message.

Offline uncorrectable sectors
not capable of SMART self-check

I posted zpool and smartctl results below, any opionions on how bad this is?

Question is, should I wait for it to fail? Maybe it will be fine for a while..
From my understanding, two disks can fail, but not in the same vdev..
Should I use this opportunity to upgrade disk size with a replacement? So instead of replacing with a similar 2 TB, could I get a 3 or 4 TB even if the other disks are all 2 TB?

zpool status
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Thu Jun 7 03:45:39 2018
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors

pool: vol1
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0 in 5h48m with 0 errors on Sun Jun 17 05:48:30 2018
config:

NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/522190ca-3a38-11e4-a945-002590466eeb.eli ONLINE 0 0 0
gptid/52a74e89-3a38-11e4-a945-002590466eeb.eli ONLINE 0 0 0
gptid/5330724c-3a38-11e4-a945-002590466eeb.eli ONLINE 0 0 0
gptid/5412a572-3a38-11e4-a945-002590466eeb.eli ONLINE 0 0 0

errors: No known data errors



smartctl -a /dev/ada1

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1ER164
Serial Number: Z4Z08R9E
LU WWN Device Id: 5 000c50 0673fee11
Firmware Version: CC43
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jun 22 18:49:42 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 89) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 227) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 231892632
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 58
5 Reallocated_Sector_Ct 0x0033 094 094 010 Pre-fail Always - 7688
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 228365176
9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 31326
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 58
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 3 3 3
189 High_Fly_Writes 0x003a 089 089 000 Old_age Always - 11
190 Airflow_Temperature_Cel 0x0022 061 050 045 Old_age Always - 39 (Min/Max 20/50)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 0
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 774300
194 Temperature_Celsius 0x0022 039 050 000 Old_age Always - 39 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 001 001 000 Old_age Always - 29856
198 Offline_Uncorrectable 0x0010 001 001 000 Old_age Offline - 29856
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 27433h+05m+58.559s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 29547895209
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 972292748944

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 31315 -
# 2 Short offline Completed without error 00% 31239 -
# 3 Short offline Completed without error 00% 31071 -
# 4 Extended offline Completed without error 00% 30979 -
# 5 Short offline Completed without error 00% 30903 -
# 6 Short offline Completed without error 00% 30663 -
# 7 Extended offline Completed without error 00% 30571 -
# 8 Short offline Completed without error 00% 30495 -
# 9 Short offline Completed without error 00% 30327 -
#10 Extended offline Completed without error 00% 30236 -
#11 Short offline Completed without error 00% 30159 -
#12 Short offline Completed without error 00% 29943 -
#13 Short offline Completed without error 00% 29775 -
#14 Short offline Completed without error 00% 29607 -
#15 Short offline Completed without error 00% 29439 -
#16 Short offline Completed without error 00% 29199 -
#17 Extended offline Completed without error 00% 29108 -
#18 Short offline Completed without error 00% 29031 -
#19 Short offline Completed without error 00% 28863 -
#20 Extended offline Completed without error 00% 28773 -
#21 Short offline Completed without error 00% 28696 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You've got massive problems. Let me explain why:

  • Attribute #5: Reallocated Sector Count. Desired number: 0. Your number: 7688. Which, by the way, may be a FreeNAS forum record.
  • Power on Hours: 31326. Drives are officially in the elderly zone at 30000; the more conservative among us would start replacing drives automatically at 30000.
  • Attribute #193: Load Cycle Count. Well, if your drive is 31326 hours in, I'd expect this number to be about a thousand or so. This drive is designed for a few tens of thousands load cycles. Your number, sir, is an exceptionally impressive 774300. Well done sir. Perhaps you've been spinning down this drive or something, but you have worn it out.
  • Attribute #197 and #198: Bad sectors. Desired number: 0. The number where the more conservative among us would replace the drive: 1. Your number: 29856. (definitely a record for a FreeNAS forum. I have literally never seen that).

While I salute you for what appears to be a reasonable regimen of SMART tests, long and short, this drive is totally hosed. I am surprised it hasn't joined the choir invisible yet.

This drive could have numbers only 1% as bad as these, and it would be WAY over the cutoff where I would replace it.

You are lucky. This particular Seagate Barracuda is known as one of the worst drives ever, and you got a lot of use out of it.

Replace the drive now.
 

odoyle

Explorer
Joined
Sep 2, 2014
Messages
62
hah, thanks! Do I win some prizes?! Seriously that was a very helpful description of those parameters and I appreciate the time you took to respond and include some humor :)
So I have another one of these "worst drives ever", I checked that one and it doesn't have any offline_uncorrectable or bad sectors, but the load cycle count is 774587. Not sure why that occurred, I followed some guidelines years ago to set all this up and the power mgmt setting is 128 (no spindown).
I checked the other two drives in my array which are both
Model Family: Western Digital Green
Device Model: WDC WD20EZRX-00D8PB0
Hopefully these aren't the "second worst drives ever"! The numbers look ok, but I didn't know people normally replace drives after a certain number of power on hours.. I learned alot from your post!
As I am shopping for replacements, any way to avoid getting bad hard drive models? I was thinking of getting some WD reds..
 
Last edited:

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
So, there were problems with the Seagate Barracuda's from that time period. If I'm not mistaken, the 3TB drive was the worst of that series, but they were all pretty bad. The failures of those Seagates caused huge damage to Seagate reputation for consumer drives for years. You may search the forum to see how many people had catastrophic failures. To this day, I won't use Seagate drives, even if they're free. I mean that literally--you could give me free Seagate drives, and I won't put a single byte of data of them. But I admit that is completely ridiculous of me...but ridiculous or not, that's what it is.

I would really super strongly recommend complete DISABLE of the power management and acoustic features in the "view disks" screen. The idea that one can "save power" by setting these things in 2018 (or, even, 2013) is really quite passe'. Drives like these, spinning 24/7, probably cost $5 of electricity per year to run. (That's actual mathematics, not a guess). Enabled AAM/APM features, god knows what kind of spindowns or unloads the disks do for that, and for what purpose really?

The issue, also, with some drives not INTENDED for NAS deployment (and I don't believe any of the drives you have, not these seagates, not those wd greens, are) is that when in a NAS, for reasons that I won't get into, the way activity works, they can just thrash load-unload-load-unload every minute. That appears to be your problem. In the case of WD greens (and blues), in 95% of the cases you can run wdidle3.exe from a true DOS boot, and reset some timing behavior to disable that. You will find that referenced a million times if you google "wdidle3 freenas". Most drives are engineered to endure a certain number of load-unload cycles---that number is usually in the (low) "tens of thousands" range, and is never more than a hundred thousand to my knowledge. The firmware on the WD reds is designed, basically, not to trigger load-unload cycles at all, really, whereas on the green-blue, those drives come out of the box unloading the heads if there's no activity in something like 30 seconds (not sure what the exact # is).

As to your specific questions:

The WD20EZRX is a very good drive, but see the wdidle3 comment above, and also, make sure the temps on that drive stay good. Every WD20EZRX I have seen take a dump had a history of temperature problems.

If I am buying new drives TODAY for a FreeNAS, I don't even consider anything other than the standard WD reds. WD reds are the galactically overwhelming favorite of the FreeNAS guys in the forum and IRC. Other guys confirm the HGST NAS drives are also excellent, but I can't personally speak for those. As for the Seagates, we certainly have had a lot of users have excellent results with the latest Seagate Ironwolf NAS stuff, and I see almost zero users complaining about failures of those, but it'll be years before I even consider using Seagate myself.

And to augment the discussion about power-on hours: Do not take what I have said too strongly to heart. For me, these days, the cost of a drive is trivial. "Proactive replacement when a drive gets to be 3-4 years old" is certainly not considered "standard" or "necessary", and is a luxury of the overly careful NAS guy. It is very reasonable to run your drives until you have MATERIAL evidence that they are end of life.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
But to reiterate: You don't have to be a conservative NAS psycho to replace your drive with the smart info above. That drive is completely hosed and in mid-death-knell, as we speak. You can ask anyone.
 

odoyle

Explorer
Joined
Sep 2, 2014
Messages
62
Thanks, running backup now and replacement drive ordered :)
Good to hear the WD are ok - I remember at the time reading about NAS drives vs not and I don't remember why I chose the greens but I did do the mod you mentioned to change the settings on the WDs.
That makes sense about the power saving thing, I thought it was good for the drives (maybe they spin slower?) but I will try turning it off.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
...and lest you think that @DrKK is exaggerating, I concur completely with his recommendation on the disk in question: Replace it immediately (and by "immediately" I mean "as soon as you have a burned-in and tested replacement on hand"). It is of that model disk that @DrKK has said that its best storage configuration is buried as deep as possible in the nearest landfill.

I consider myself fairly conservative, though still less so than some. I won't immediately replace a disk with a single-digit number of bad sectors that's staying constant (though I will keep an eye on it). I will use the hot swap capabilities of my hardware to minimize downtime. But your disk has many thousands of bad sectors--it's sailing off to that undiscovered country from whose bourn no traveler returns. Replace it as soon as you can, and be sure to follow the manual's instructions (DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER).

I do want to address one misconception in what you've stated. Your pool consists of four disks in a single RAIDZ2 vdev. Of those four, any two can fail without data loss, as long as there are no data errors with the remaining two.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER
I think I need to write it down, too. Is the warning already in the manual literally?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Is the warning already in the manual literally?
Nowhere in the manual does it tell you to use the volume manager as any part of the disk replacement process, but it doesn't explicitly warn NOT to use the Volume Manager.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
(DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER)
I guess the reason is described mentioned in the thread with this post:
people keep thinking that the way to replace a disk goes through the Volume Manager

Sent from my mobile phone
 
Last edited:

odoyle

Explorer
Joined
Sep 2, 2014
Messages
62
Good correction, I think was confused with another system I used to manage that was a raid 10 I think.
Question for everyone - if the stats on this disk were so bad, how come smartctl or one of these tests I had set up didn't detect it until now? Should I change thresholds for warnings or something?

...and lest you think that @DrKK is exaggerating, I concur completely with his recommendation on the disk in question: Replace it immediately (and by "immediately" I mean "as soon as you have a burned-in and tested replacement on hand"). It is of that model disk that @DrKK has said that its best storage configuration is buried as deep as possible in the nearest landfill.

I consider myself fairly conservative, though still less so than some. I won't immediately replace a disk with a single-digit number of bad sectors that's staying constant (though I will keep an eye on it). I will use the hot swap capabilities of my hardware to minimize downtime. But your disk has many thousands of bad sectors--it's sailing off to that undiscovered country from whose bourn no traveler returns. Replace it as soon as you can, and be sure to follow the manual's instructions (DO NOT UNDER ANY CIRCUMSTANCES TOUCH THE VOLUME MANAGER).

I do want to address one misconception in what you've stated. Your pool consists of four disks in a single RAIDZ2 vdev. Of those four, any two can fail without data loss, as long as there are no data errors with the remaining two.
 
Status
Not open for further replies.
Top