Is it possible to undo a Replace?

Status
Not open for further replies.

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
I am using encrypted raid z1 - and boy do I regret z1 instead of z2 right now.

I had a drive start to go bad on me. No data was lost, but there were some SMART test failures, and metrics showed a couple of bad reads on the drive. So I bought a new drive and started replacing according to http://doc.freenas.org/11/storage.html#replacing-an-encrypted-drive. During resilvering things started to go wrong. I noticed that reading files through SMB would sometimes fail, and I got write errors as well. Jails took a very long time to appear in the fails list. In my logs I see many messages like (ada7 is not the new disk. The new disk is ada0.):

Code:
Mar 19 08:22:27 note GEOM_ELI(ada7:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d0 ce 48 40 28 01 00 01 00 00
Mar 19 08:22:27 note g_eli_read_done() failed (error=5) gptid/a4ee10c8-45dd-11e3-9663-60a44caf3660.eli[READ(offset=2542914260992, length=798720)]
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): CAM status: Uncorrectable parity/CRC error
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): Retrying command
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 78 d0 cf 48 40 28 01 00 00 00 00
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): CAM status: Uncorrectable parity/CRC error


And when checking resilvering progress I'd see something up near 2% one time and then look again and see it back at 0.15%. In the volume status pane I see the volume is degraded, the pool (I think it's the pool, it's named "raidz1-0") shows as degraded, and disk ada0p2 also shows as degraded. All other disks show as healthy.

Maybe I'm off base here, but that looks like ada7 is in big trouble as well. I did try replacing the sata cable just in case. Is it possible to go back and unreplace? The data on the pulled drive are still there. My hope is I can restore the volume using the pulled drive, backup what I can, replace ada7, resilver (hoping the originally pulled drive I've put back in place doesn't get worse), replace ada0 again, and resilver again.
 
Last edited:

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Pretty sure the answer is yes. The way to find out is shutdown, swap the drives, start back up. See if the pool is online and then check the zpool status The old drive should very quickly be brought up to date. This is also a big reason why drives should not be off-lined unless necessary.

If the drive can't be brought up to date without hitting an error on the ada7, then you can't get back to normal without deleting the corrupt data (if possible) or backing up what you can and then rebuilding.
 

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
Thanks for the suggestion. Would I need to offline the replacement drive before removing it? Would I then expect the original just to show up in its old spot or would I need to replace the replacement with the original or something like that?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
This being an encrypted pool, many of the regular answers don't apply. You need to copy the data out of the pool while you still can.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Do backup what you can while you can, as suggested above.

No to offlining, No to replace. If you run zpool status you will see the true configuration. If the old disk can get back to ONLINE, then you might be in good shape and can post your zpool status output. If not, the pool has to be abandoned with whatever you can get out of it.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The encryption significantly limits your options. If you do the wrong thing here, the encryption will snap down on you like a trap and you will not be able to access this data EVER again.
After you have made a backup, then you can try any combination of options you want with no fear of loosing your data.
Personally, I would do a full badblocks test on all the drives to determine which of them are still worth using and create a new RAIDz2 pool (adding disks if needed) to have enough redundancy to not get in this spot again.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS. Can you give us a full rundown on your hardware and pool layout? How many drives do you have? What size are they?
 

Integer

Dabbler
Joined
Mar 19, 2018
Messages
11
I thank you all for your suggestions. I did manage to fix it, and I kind of feel dumb given the solution. It was really bugging me that I had a bunch of working drives throwing no errors, and when I replaced the one bad drive another good drive would coincidentally fail at exactly that time. Especially since the one that I was replacing was itself a recent replacement, and all the rest of the drives have been running stably for five years, and I know those CRC errors are really common with bad cables. Turns out the replacement SATA cable I put in was also bad. Third cable's the charm I guess.
 
Status
Not open for further replies.
Top