Is it possible to undo a Replace?

Integer · Mar 19, 2018

I am using encrypted raid z1 - and boy do I regret z1 instead of z2 right now.

I had a drive start to go bad on me. No data was lost, but there were some SMART test failures, and metrics showed a couple of bad reads on the drive. So I bought a new drive and started replacing according to http://doc.freenas.org/11/storage.html#replacing-an-encrypted-drive. During resilvering things started to go wrong. I noticed that reading files through SMB would sometimes fail, and I got write errors as well. Jails took a very long time to appear in the fails list. In my logs I see many messages like (ada7 is not the new disk. The new disk is ada0.):

Code:

Mar 19 08:22:27 note GEOM_ELI(ada7:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d0 ce 48 40 28 01 00 01 00 00
Mar 19 08:22:27 note g_eli_read_done() failed (error=5) gptid/a4ee10c8-45dd-11e3-9663-60a44caf3660.eli[READ(offset=2542914260992, length=798720)]
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): CAM status: Uncorrectable parity/CRC error
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): Retrying command
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 78 d0 cf 48 40 28 01 00 00 00 00
Mar 19 08:22:27 note (ada7:ahcich7:0:0:0): CAM status: Uncorrectable parity/CRC error

And when checking resilvering progress I'd see something up near 2% one time and then look again and see it back at 0.15%. In the volume status pane I see the volume is degraded, the pool (I think it's the pool, it's named "raidz1-0") shows as degraded, and disk ada0p2 also shows as degraded. All other disks show as healthy.

Maybe I'm off base here, but that looks like ada7 is in big trouble as well. I did try replacing the sata cable just in case. Is it possible to go back and unreplace? The data on the pulled drive are still there. My hope is I can restore the volume using the pulled drive, backup what I can, replace ada7, resilver (hoping the originally pulled drive I've put back in place doesn't get worse), replace ada0 again, and resilver again.

rs225 · Mar 19, 2018

Pretty sure the answer is yes. The way to find out is shutdown, swap the drives, start back up. See if the pool is online and then check the zpool status The old drive should very quickly be brought up to date. This is also a big reason why drives should not be off-lined unless necessary.

If the drive can't be brought up to date without hitting an error on the ada7, then you can't get back to normal without deleting the corrupt data (if possible) or backing up what you can and then rebuilding.

Integer · Mar 19, 2018

Thanks for the suggestion. Would I need to offline the replacement drive before removing it? Would I then expect the original just to show up in its old spot or would I need to replace the replacement with the original or something like that?

Chris Moore · Mar 20, 2018

This being an encrypted pool, many of the regular answers don't apply. You need to copy the data out of the pool while you still can.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

rs225 · Mar 20, 2018

Do backup what you can while you can, as suggested above.

No to offlining, No to replace. If you run zpool status you will see the true configuration. If the old disk can get back to ONLINE, then you might be in good shape and can post your zpool status output. If not, the pool has to be abandoned with whatever you can get out of it.

Chris Moore · Mar 20, 2018

The encryption significantly limits your options. If you do the wrong thing here, the encryption will snap down on you like a trap and you will not be able to access this data EVER again.
After you have made a backup, then you can try any combination of options you want with no fear of loosing your data.
Personally, I would do a full badblocks test on all the drives to determine which of them are still worth using and create a new RAIDz2 pool (adding disks if needed) to have enough redundancy to not get in this spot again.

Chris Moore · Mar 20, 2018

PS. Can you give us a full rundown on your hardware and pool layout? How many drives do you have? What size are they?

Integer · Mar 20, 2018

I thank you all for your suggestions. I did manage to fix it, and I kind of feel dumb given the solution. It was really bugging me that I had a bunch of working drives throwing no errors, and when I replaced the one bad drive another good drive would coincidentally fail at exactly that time. Especially since the one that I was replacing was itself a recent replacement, and all the rest of the drives have been running stably for five years, and I know those CRC errors are really common with bad cables. Turns out the replacement SATA cable I put in was also bad. Third cable's the charm I guess.

Important Announcement for the TrueNAS Community.

Is it possible to undo a Replace?

Integer

Dabbler

rs225

Guru

Integer

Dabbler

Chris Moore

Hall of Famer

rs225

Guru

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Integer

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Is it possible to undo a Replace?

Dabbler

Guru

Dabbler

Hall of Famer

Guru

Hall of Famer

Hall of Famer

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Is it possible to undo a Replace?"

Similar threads