Replacing Failed Drive Troubles

AceMilo

Dabbler
Joined
Jul 12, 2015
Messages
25
Let me explain my issue from the start until now. I have 2 arrays, the one with issues is a 4x4tb raidz array, all seagate ironwolf drives. I started getting SMART errors on one of the drives so I got a replacement from seagate under warranty. I offlined the old drive, put the new one in, resilvered and it went fine. After a day or so I started getting new I/O errors and it wasn't acting right. After a couple of weeks I restarted my box and it wouldn't boot anymore. I removed that drive and it booted right up so clearly that drive has a problem. I got a replacement drive for the first replacement and put it back in the box. The old drive just says unavailable since I couldn't boot with it in to offline it. I resilvered and now I'm having an issue I can discribe.

The old drive is still being listed even after I replaced the drive in the array. If I try to offline or remove the old one it gives me an error "no valid replicas". If I try to offline the new drive I get the same error. I read that scrubbing the pool solves this but I did that and it did not solve the issue, I get the same errors. If I try to remove the old drive or do anything it just resilvers again. It's probably resilvered a half dozen times and it won't get rid of the old drive and the pool is still degraded. I cannot get it to not be degraded no matter what I do.

Please, any help is appreciated. I need to get this thing fixed. It's currently resilvering again.

1589816215409.png
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
If it's consistently that drive position that errors out, regardless of drive, then perhaps it's the backplane port that's faulty?
 

AceMilo

Dabbler
Joined
Jul 12, 2015
Messages
25
If it's consistently that drive position that errors out, regardless of drive, then perhaps it's the backplane port that's faulty?

Anything is possible but I can't continue testing without fixing this new issue. I just need to know what to do next to get this array fixed
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
"Permanent errors have been detected in the following files" will cause issues for you. That means they had errors that could not be handled by parity, likely because a second drive showed issues while you were replacing the one that failed.

Those files need to be deleted and replaced from backup. That may allow you to complete the resilver successfully.
 

AceMilo

Dabbler
Joined
Jul 12, 2015
Messages
25
"Permanent errors have been detected in the following files" will cause issues for you. That means they had errors that could not be handled by parity, likely because a second drive showed issues while you were replacing the one that failed.

Those files need to be deleted and replaced from backup. That may allow you to complete the resilver successfully.

Great, how do I tell what files they are? They only show as hex values of some kind
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
In another thread, someone tried to delete files with permanent errors, and killed their pool. Doing write operations on a degraded pool is asking for trouble. Restrict yourself to read operations only. At this time, you need to get your data off your pool as soon as possible. These files appear to be associated with plugins, so I wouldn't worry about trying to restore them. You're better off reinstalling the plugins after destroying your pool, recreating it, and restoring your data.

If you can spare the space, you should rebuild your pool as RAIDZ2, so it can tolerate the loss of 2 drives instead of just 1 with RAIDZ1.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
Temporarily increasing the TLER values on the remaining drives may- emphasis on MAY- allow you to recover the data.
This is not without risks, a higher TLER setting will also put more strain on a drive that is already starting to fail.

You do not have a risk-free course of action left to you at this point. You had a raidz1 pool, but with one disk out you no longer have any redundancy.
 
Top