Failed disk remains after replace or Replace process does not complete.

Nyxxon · Apr 29, 2017

Hi all, and in advance thank you for your help and patience.

Build FreeNAS-9.10.1 (d989edd)
Platform AMD A6-6400K APU with Radeon(tm) HD Graphics
Memory 15269MB

I have the above build with a number of raidsets all running fine. One particular disk started to show some errors so I started a replace on it through the GUI. Whilst the replace was running the disk failed, this was a few weeks ago and whilst the server is "running" normally the replace never appears to complete, although the resliver does and I have run a scrub ok.

The following is the output from a zpool staus

---------------------------------------------------------------------------
[root@xxxxxxxxx] ~# zpool status Raid5Root
pool: Raid5Root
state: DEGRADED
scan: scrub repaired 0 in 8h50m with 0 errors on Sun Apr 2 08:50:23 2017
config:

NAME STATE READ WRITE CKSUM
Raid5Root DEGRADED 0 0 0
gptid/5df43782-8177-11e4-8275-002590394642 ONLINE 0 0 0
gptid/5e591c2d-8177-11e4-8275-002590394642 ONLINE 0 0 0
gptid/5ec42610-8177-11e4-8275-002590394642 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 0
13953885356757872590 OFFLINE 0 0 0 was /dev/gptid/5f26c751-8177-11e4-8275-002590394642
gptid/01438f44-77e3-11e6-b673-6805ca443bbe ONLINE 0 0 0

errors: No known data errors
---------------------------------------------------------------------------

I have scanned the forums and there are references to "detaching" the disk via the volume status GUI, however I can see the failed disk in the GUI, and its showing as offline, however there is no "detach" option when I select it, only replace which looks as if its going to restart the replace process.

Could anyone offer some advice, I am very conscious that the replacing-3 set is showing degraded so if I lose one of the others I am going to have to rebuild and restore which I would like to try to avoid, and I wanted to check with experts before I made the situation worse.

Dice · Apr 29, 2017

http://doc.freenas.org/9.10/storage.html#replacing-a-failed-drive

Nyxxon · Apr 29, 2017

Dice, I do appreciate you taking time too look at the post I had checked the manual but sorry I just realised I did not put the details into the original post. Th only option I have to "replace" is a backup 1tb unraided disk I have on the server. the "replacement" disk is gptid/01438f44-77e3-11e6-b673-6805ca443bbe which is not showing when I click the drop down, but you can see is in "repacing-3"

SweetAndLow · Apr 29, 2017

How did you do the replace the first time? I think you just need to offline the disk and remove it.

Sent from my Nexus 5X using Tapatalk

Nyxxon · Apr 29, 2017

Hi Sweet and Low, thanks for getting back to me, the replace was started through the GUI but the old disk failed during the replace. It is U/S so I disconnedted it some time back

Nyxxon · May 20, 2017

Bump.... Would appreciate any thoughts people have on this I am very worried about running a "degraded" although technically not, volume

danb35 · May 20, 2017

Nyxxon said:
The following is the output from a zpool staus

Are you sure that's the complete output? Because ordinarily there'd be another line there, between the name of the pool and the list of the devices, indicating how they're arranged. Right now, it looks like all devices are striped, but that's obviously not the case or your pool would be unavailable.

SweetAndLow · May 20, 2017

Yeah looks look a stripe to me. Since you said the drive died then you lost your whole pool. This is why strips are dangerous.

Sent from my Nexus 5X using Tapatalk

danb35 · May 20, 2017

SweetAndLow said:
Since you said the drive died then you lost your whole pool.

...but if the pool were dead, I'm thinking zpool status would report something other than DEGRADED.

Edit: Just tested it. Here's what happens when you pull a disk on a striped pool:

Code:

  pool: testpool
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
  see: http://illumos.org/msg/ZFS-8000-JQ
  scan: none requested
config:

NAME										  STATE	 READ WRITE CKSUM
testpool									  UNAVAIL	  0	 0	 0
  gptid/156d64ef-3d67-11e7-a1ca-002590caf340  ONLINE	   0	 0	 0
  10484862489935661261						REMOVED	  0	 0	 0  was /dev/gptid/162406e0-3d67-11e7-a1ca-002590caf340

errors: No known data errors

SweetAndLow · May 20, 2017

Is the resilver still going? If the disk dies during a resilver I'm not sure what will happen.

Sent from my Nexus 5X using Tapatalk

Nyxxon · May 20, 2017

Guys Really appreciate the reply's and the effort in doing some testing. Yes this is the complete output (Copied below again with top and tail to show). The re-sliver is complete and I ran another Scrub last week just for good measure which completed OK. Im currently backing up the whole volume with the expectation that its a wipe and re-build, but its going to take days for the copy off/on and would prefer not to if you think there is any other option.

EDIT: Also just to confirm the pool is fully available and has been "serving" data fine for about 5 weeks now with no impact to normal behaviour. Its almost like "Replacing-3" is counted as a new "disk" and within it is a mirror pair with 1 missing?!?!?!? so that is degraded, but the overall Raid5Root pool is still ok because 1 of the Replacing-3 mirror pair is available.

Could I destroy "replacing-3" somehow (Hopefully the pool will still be ok with 3 of the 4 set still available), wipe gptid/01438f44-77e3-11e6-b673-6805ca443bbe to factory then put it back as a "replacement" for replacing-3?

Welcome to FreeNAS
[root@xxxxxxxx] ~# zpool status Raid5Root
pool: Raid5Root
state: DEGRADED
scan: scrub repaired 0 in 9h26m with 0 errors on Sun May 14 09:26:13 2017
config:

NAME STATE READ WRITE CKSUM
Raid5Root DEGRADED 0 0 0
gptid/5df43782-8177-11e4-8275-002590394642 ONLINE 0 0 0
gptid/5e591c2d-8177-11e4-8275-002590394642 ONLINE 0 0 0
gptid/5ec42610-8177-11e4-8275-002590394642 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 0
13953885356757872590 OFFLINE 0 0 0 was /dev/gptid/5f26c751-8177-11e4-8275-002590394642
gptid/01438f44-77e3-11e6-b673-6805ca443bbe ONLINE 0 0 0

errors: No known data errors
[root@KeynetSan6] ~#

Robert Trevellyan · May 21, 2017

Nyxxon said:
Hopefully the pool will still be ok with 3 of the 4 set still available

It won't, because it has no redundancy. If you remove gptid/01438f44-77e3-11e6-b673-6805ca443bbe the pool will become unavailable.

Nyxxon said:
Im currently backing up the whole volume with the expectation that its a wipe and re-build

This is what you should do, regardless of whether the pool remains degraded.

Nyxxon · May 26, 2017

Thank you all for your time replying to me. To close this out, in the end it was a backup, wipe and restore, in case anybody comes across this post. I tried a few things once the backup was secure, like forcing 'replacing-3' offline, but as Robert said it just refused to mount Raid5Root if both disks in replacing-3 were offline.

Things could be worse, i have around a 3 day restore running now, but the important things are prioritised and im pretty much "back to normal" but now with a healthy volume.

Thank you all for your attempts to help.

ZOP · Sep 12, 2017

I can confirm FreeNAS 11.0-U2 at least absolutely has a bug with replacing, it will never complete on it's own, and a ZFS bug causes it to go back into resilver after you manually complete the replacement with a detach of the old drive. I'm in the process of upgrading from 6T to 10T drives on my Mini and it's taking twice as long using replace because even if both drives are online and healthy the replace *NEVER* completes. It finishes the resilver/data copy process, ZFS states it's done, but ZFS never detaches the original device. Manually running zfs detach <pool> <old_dev> after completing the resilver results in another resilver starting on the new device, however the array does not go into degraded state this way...so it's more like ZFS is deciding it's a device that came home after I detach'd the original. I'm still working through the pool using replace because I don't want to run in degraded rebuild by removing and replacing a live and healthy drive.

It's not the UI doing it (zpool history shows the replace command, followed days later by my detach, but nothing other than scheduled snapshots)...I do have an occasional issue (which has been since I got this unit) with the FreeNAS Mini's SATA DOM boot drive going away occasionally under high activity and that forces me to do a reboot - and both drives I've replaced so far HAVE had an unplanned, dirty, reboot during the initial zpool replace resilver process and that MAY be part of the root cause for me (ZFS losing track of what it's supposed to do after the replace completes) but I cannot yet confirm.

ZOP · Sep 12, 2017

Also to OP - zpool detach <pool> <old_dev> is the command you need, even if the UI is hiding the option, obv must be done via sudo/as root.

Important Announcement for the TrueNAS Community.

Failed disk remains after replace or Replace process does not complete.

Nyxxon

Dabbler

Dice

Wizard

Nyxxon

Dabbler

SweetAndLow

Sweet'NASty

Nyxxon

Dabbler

Nyxxon

Dabbler

danb35

Hall of Famer

SweetAndLow

Sweet'NASty

danb35

Hall of Famer

SweetAndLow

Sweet'NASty

Nyxxon

Dabbler

Robert Trevellyan

Pony Wrangler

Nyxxon

Dabbler

ZOP

Cadet

ZOP

Cadet

Similar threads