After 9.10.1 update, getting 8 unreadable sectors error

Status
Not open for further replies.

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
Updated to FreeNAS-9.10.1 (d989edd)
and I am now getting the red light with error:
CRITICAL: Aug. 11, 2016, 3:39 p.m. - Device: /dev/ada1, 8 Currently unreadable (pending) sectors

But volumes show everything healthy and zpool status shows everything online. Scrub is running

This drive is in a 6 drive Raid z2 Pool.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Well, 8 pending sectors isn't critical but if the number rise then it means the drive is probably going to die soon, just keep an eye on it ;)
 

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
That's what I figured but there are a couple weird things, this is less then 6 month old drive, I did a shutdown and reboot to move my power cord right before this update and no errors were thrown; then right after update it throws the error, and shouldn't this error cause the volume to show up as degraded or something other then healthy?

Do I need to run a smart long to check for pending sectors?
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The volume shows up as degraded only if there's checksum mismatch (or a device drop, but that's extreme), not if a drive has smart warnings.

If I were you I'd probably do a long smart test, then a scrub, then a long smart test again ;)
 

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
Scrub repaired 0 and long test completed without error but still shows 8 pending sectors error and 8 pending sectors in smartctl -a /dev/ada1. Error is coming up every 30 min in /var/log/messages.

Shouldn't 8 pending sectors throw a read error on a smart long test? Anything else I can check to try verifying and/or find where the sectors are so I can try to rewrite?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok, it's good news then, but keep an eye on this drive nonetheless.

IIRC using smartctl -x /dev/ada1 will give you more info than -a and you'll maybe have the LBAs of those sectors ;)
 

Lars Jensen

Explorer
Joined
Feb 5, 2013
Messages
63
FYI After upgrading to FreeNAS-9.10.1 I have the exact same SMART error on a less than 6 months old drive except it is on ada9 also with a 6 drive Raid z2. Did a long self-test and number increased to 16. Pool is still good, but have ordered a new disk just to have one ready for exchange.

Maybe just a coincidence of almost new drives dying.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
What are the temps of those drives?
 

Lars Jensen

Explorer
Joined
Feb 5, 2013
Messages
63
What are the temps of those drives?

Around 23 degrees celcius, see:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 223522992
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 4
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 12886077
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3286
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 4
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 077 069 045 Old_age Always - 23 (Min/Max 20/31)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 4
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 023 040 000 Old_age Always - 23 (0 20 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 16
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 3285 -
# 2 Short offline Completed: read failure 90% 3273 -
# 3 Short offline Completed: read failure 90% 3261 -
# 4 Short offline Completed: read failure 90% 3249 -
# 5 Extended offline Completed: read failure 10% 3247 -
# 6 Short offline Completed: read failure 70% 3238 -
# 7 Short offline Completed: read failure 50% 3238 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Maybe just a coincidence of almost new drives dying.
I think it's more likely that a known bug that resulted in email notifications not working was fixed in the update, and the drive has been failing for a while.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
a known bug that resulted in email notifications not working was fixed in the update,
...and lest it not be clear, this is the case--there was such a bug, and it was fixed in 9.10.1. It may or may not be the case that there were bad sectors for some time.
 

Lars Jensen

Explorer
Joined
Feb 5, 2013
Messages
63
Great bugfix :smile: FYI the disk actually died today, so it is in this case a "soon-to-die" smart error. And now resilvering:
scan: resilver in progress since Mon Aug 29 12:23:10 2016
1.58T scanned out of 15.9T at 925M/s, 4h30m to go
387M resilvered, 9.95% done
 

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
To update this, my drive was fine. The pending sectors went to 0 after a 5tb write and there were no reallocated sectors.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
That just means the sectors were successfully written. It doesn't mean the underlying medium is good.
 

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
That just means the sectors were successfully written. It doesn't mean the underlying medium is good.

Yes it could still be bad even though smart tests show successfull writes on those sectors. However, It doesnt mean it is bad either and it's certainly a good sign that they were not reallocated right?
I believe the pending sectors were due to a local brown out that occured shortly after the update, I'm on UPS but its old and might have not have kicked on fast enough under a brown out. Unless I'm missing something else I'm just going to keep an eye on this drive until there is something to indicate a bad drive.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
If it were my disk, I would prefer to see the sectors reallocated. Take a look at what Wikipedia has to say about SMART attribute #197: https://en.wikipedia.org/wiki/S.M.A.R.T.

Any reason you can't run badblocks destructive on the disk? That way you get four write passes and four read passes with different patterns.
 

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
If it were my disk, I would prefer to see the sectors reallocated. Take a look at what Wikipedia has to say about SMART attribute #197: https://en.wikipedia.org/wiki/S.M.A.R.T.

Any reason you can't run badblocks destructive on the disk? That way you get four write passes and four read passes with different patterns.

Why would you want them to be reallocated? That would indicate that the sectors were in fact bad right? In my case I had 8 pending, 0 reallocated, and 0 uncorrectable. After rewrite i now have 0 pending, 0 reallocated, and 0 uncorrectable. At the least we know those sectors are good enough to be rewritten successfully. If these sectors pop up again then there is for sure an issue, but I don't agree with the wiki that attempting to rewrite the pending sectors vs always reallocating them is "a serious shortcoming" That seems like a more intelligent approach since a pending sector could be caused by power failure for instance and the sectors might be fine. What is the advantage of reallocating them without testing to see if they can be written? You should be able to see if it keeps popping up and there is a very low chance of reoccurring pending sectors that can still be successfully written.

I'm still not understanding what you are basing the idea that it might be bad on? The wiki indicates that drives with first pending sectors, reallocations, and uncorrectables are much more likely to fail statistically, and I get that; but that doesn't give me anything to do or to check except to preemptively replace. And in this case I'm on Z2 with a full backup on another PC so I'm probably gonna run the drive until it either dies without warning or starts throwing more errors.

Only reason I can't run Badblocks is I don't have a spare drive and I don't want to degrade it to 1 disk for redundancy. I could buy another but they are running like $250 and at that point why not just replace the drive and be done.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
The sectors were pending reallocation due to read failure. They were subsequently written without an error being detected. This does not tell us whether they can be read successfully in the future. Being able to write data without detecting an error is of no benefit if that data can't be read later - think untested backups. Reallocation to known good sectors would ensure that potentially flaky sectors were no longer in use. That would make me more comfortable, but that's just me.

Can you explain why you think the pending reallocation was related to a brownout?

Please note that I'm focusing on this specific disk and it's potentially bad sectors. I'm not suggesting you're about to lose data.
 

Jr922

Explorer
Joined
Apr 22, 2016
Messages
58
The sectors were pending reallocation due to read failure. They were subsequently written without an error being detected. This does not tell us whether they can be read successfully in the future. Being able to write data without detecting an error is of no benefit if that data can't be read later - think untested backups. Reallocation to known good sectors would ensure that potentially flaky sectors were no longer in use. That would make me more comfortable, but that's just me.

Can you explain why you think the pending reallocation was related to a brownout?

Please note that I'm focusing on this specific disk and it's potentially bad sectors. I'm not suggesting you're about to lose data.

But if it couldn't be read after the write wouldn't that come back up in a long smart test?

Im not sure, thats just a theory. I had run smart tests regularly and checked smart info on all the drives not too long before the update also no red light in the webgui right before the update. So it is likely that this occured after the update. I updated and let it reboot and walked away since everything was working fine, my cifs shares were up, etc. Later that night I went to the webgui and noticed I had a red light for 8 pending sectors so I thought it was too much of a coincidence that it happended right after the update so I posted here. But after thinking about it we have had quite a few intermitent powerlosses/brownouts lately with everyone using so much AC and one of them happened to be on that day. At the time I was on my PC and my UPS for my PC kicked on for a sec and then power came back almost instantly so I didn't think much of it. But its possible that my other UPS for my Freenas box failed in some way or got confused by the intermittent power. Other then that I can't think of what could have happened in that time between the update and when I checked the webgui again. There was no real activity at all on this drive during this time, system dataset, jails, etc. are all on other drives not on this pool.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
But if it couldn't be read after the write wouldn't that come back up in a long smart test?
Maybe, to be honest I'm not sure.
just a theory
OK, well I guess anything is possible, but I'm inclined to believe it's either a coincidence, or a result of updating to a version that eliminated a reporting bug.
 
Status
Not open for further replies.
Top