Fix Drive Offline Uncorrectable Sectors

KazuyaDarklight

Dabbler
Joined
May 8, 2019
Messages
36
I hope I'm in the middle of doing what needs to be done to resolve this, but its taken so long and has had hiccups such that I'm wondering if I could have done it in better way or should be doing things differently even now.

Situation/Steps Taken:
One of my drives spawned the notices; "Currently unreadable (pending) sectors" and "Offline uncorrectable sectors", somehow without putting the pool into an unhealthy state. After a bit of looking around I decided my safest bet was to replace it and I could consider trying to get smart to map around the sectors later.

So I went to pool status, found the drive and used the normal "Replace" option, and chose another drive I'd freed up from the spare vdev. The pool became unhealthy as it started the replacement/resilvering process. This took around 24hrs to complete (14TB mirror vdev for reference). Now the resilver is complete, but the pool is still unhealthy with the alert "One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." and 2 in the replacement drive's Checksum field. I'm now in the middle of a Scrub for the pool.

Questions:
  • Is there anything else I should do?/Will this likely fix the problem?
  • As best practices, should I have done anything differently? For instance, would it have been better for me to just pull the erroring drive outright and then replace it? I'm kind of assuming the errors were caused by read problems copying from the bad drive, if I'd pulled it and replaced, presumably the resilver would have pulled from the good drive in the mirror.
Any other thoughts are welcome. Thanks for your time.
 

KazuyaDarklight

Dabbler
Joined
May 8, 2019
Messages
36
The scrub finished with no errors, but did not restore the pool to a healthy state, the new drive still shows 2 checksum errors. Thoughts?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
You need to be looking at your SMART data.

If you're running regular tests, you can run smartctl -a /dev/da1 (or whatever disk it is)

If it has run a long test, you would usually see reports about bad blocks.

Since reports of those coming from SMART are a good indicator of a drive on its way to failure, you might be better served by replacing it rather than going through the (IMHO time-wasting) process of trying to zero out the bad block.
 
Top