Replaced faulty SSD - pool is now gone

thany

Dabbler
Joined
Sep 26, 2022
Messages
13
I have (or had, really) a pool consisting of 4 SSDs in RAID-Z mode.
One of them was faulty. I shut down the truenas box, so that I can pull the all out to check which serial number matches the faulty reported one.
The faulty one got replaced.
Server is booted up.
And the pool is now completely gone.

This is ridiculous. Sorry but it really is. Am I not supposed to replace a faulty disk, and than just simply select which new disk to add to the existing array? What else am I supposed to do?

I cannot import the pool either.

Luckily no imported data is on it... YET.
But if this were to happen with important data on it, I would be thoroughly, let's say, "discouraged". I guess this is a perfectly failed test of a fault scenario.

So now what? Let's say, just for the sake of argument, that I have to retain the data that is supposed to be stored safely on the remaining 3 SSDs. What should I do to restore the array?

And also, what *should* I have done to replace the faulty SSD, other than what I already described? I had no way of knowing which one it means by the serial number, nevermind its internal devicename like `/dev/sdg` or something. I *have* to pull them all out to check which exact one is reported faulty. And you'd *think* that be safer to do while the system is not powered on, wouldn't you?

Lastly, I want to say that even as a novice in the world of TrueNAS, I would consider this a critical bug. Top priority and all that. After all, there is no way to assign it a new drive and make it rebuild the raidness & redundancy.

Please advice.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Am I not supposed to replace a faulty disk, and than just simply select which new disk to add to the existing array? What else am I supposed to do?
You mention adding a disk... did you use the "replace" action or the "Add" action?


I cannot import the pool either.
What's the output from zpool import ?

And also, what *should* I have done to replace the faulty SSD, other than what I already described?
Replacement instructions linked above.

like `/dev/sdg` or something
The first clue you've left as to the version you're using... seems to be SCALE...

Link to SCALE doc here:

Lastly, I want to say that even as a novice in the world of TrueNAS, I would consider this a critical bug. Top priority and all that. After all, there is no way to assign it a new drive and make it rebuild the raidness & redundancy.
If you follow the instructions, you'll probably find (like most others) that everything works.

I suggest the "bug" was somehow likely to have been a mistaken action on your part, but we can investigate that if you care to respond to the question above.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Just in the last few days, we had a user with a "RAIDZ" pool that turned out to be a striped pool. So we need that zpool import output. Fortunately, in that other user's case, he had the old disk, and was able to re-install then restore his pool to mostly functional, minus a few files he will have to restore from backups.

SSDs are a bit of an odd device. Without regular scrubs, (monthly to several times a year), flash cells that are starting to get weak, may not be detected until they bit flip. And if too many cells flip in a single block, internal ECC can't compensate, so that block goes away. ZFS would detect this, and recover the block from redundancy, IF the block was read, like during a scrub.

I am NOT suggesting limiting scrubs to several times a year. Just pointing out that SSDs need it. (In theory, SSD firmware could perform it's own internal scan and self-correct. But, SSD vendors tend to be quite opaque about their firmware and it's feature set.)


To be clear on ZFS pool failures: An overwhelming number that have shown up here are user error. Not saying it's your case, just reminding everyone reading that is the case.
 

thany

Dabbler
Joined
Sep 26, 2022
Messages
13
> You mention adding a disk... did you use the "replace" action or the "Add" action?

Neither. The pool is GONE. How am I supposed to select any action on a disk inside it?

> What's the output from zpool import ?

Sure thing:

```
pool: SSD
id: 3027844552434207005
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
config:

SSD UNAVAIL insufficient replicas
raidz1-0 UNAVAIL insufficient replicas
023172eb-29f5-4fce-b92b-2cff66c979d8 ONLINE
f3e72362-fd80-450a-a74b-7c49124de95a UNAVAIL
57f3f728-ad70-4e87-8827-e815598d5990 ONLINE
f2ed1ac6-36a1-45c1-aa6b-6941aa6228a1 UNAVAIL
```

This is after installing a replacement drive. All four drives are fine. UNAVAIL doesn't make sense to me.

> Replacement instructions linked above.

Sure but what if an SSD (or HDD for that matter) *completely* fails, i.e. as far as the system is concerned, it is gone? That'll be the same situation as what I have here. I has to be possible to replace a drive after taking it out of the system. I would consider this an absolutely critical feature for any RAID config. Otherwise, it's basically as secure as a stripe, meaning 2^3 times as likely to die as a single drive, considering 4 drives.

> The first clue you've left as to the version you're using... seems to be SCALE...

Yes. Does this matter? This doesn't pertain to any SCALE-specific features, does it? It's just ZFS / RAID-Z stuff. Surely every TrueNAS has these features.

> If you follow the instructions, you'll probably find (like most others) that everything works.

Only if the entire pool hasn't shat itself. If I could still access the pool, fine, but it doesn't allow me to do anything to it, nevermind replacing a disk... The instruction are not followable - they blindly assume the pool is still there, which is bloody well isn't.

> I suggest the "bug" was somehow likely to have been a mistaken action on your part, but we can investigate that if you care to respond to the question above.

Best not to assume user error. This is hardware error first of all, and sorry to say this again, but something as professional as TrueNAS should be able to bloody deal with it gracefully. How *could* I do something wrong? I have done *exactly* what anyone would and should do: replace the hardware that's faulty.

> SSDs are a bit of an odd device.

Not really.

> Without regular scrubs, (monthly to several times a year), flash cells that are starting to get weak

Not really. Any OS modern enough can deal with SSD perfectly fine. User does not have to do anything at all manually. Not in Windows, not in Linux. SSD's "just work" these days. Until they fail, of course.
 
Last edited:

thany

Dabbler
Joined
Sep 26, 2022
Messages
13
Welp, I just exported my array. So then I can import it again right? Wrong.

Import never shows a selectable array, neither before nor after export. Another thing that seems broken about TrueNAS. Probably user-error again, but to me, import is the opposite of export, sooo... Idunno, I don't feel like I've made a mistake here.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
This is after installing a replacement drive. All four drives are fine. UNAVAIL doesn't make sense to me.

Two drives are showing as UNAVAIL - have you confirmed that all drives are visible in the BIOS, and respond to things like smartctl queries? Check physical connections as well.

If you issue zdb -ul /dev/daX (that's a lowercase L, not an uppercase I) do you get a list of uberblocks from each device?

Can you show the output of zdb -U /data/zfs/zpool.cache?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
@thany, you are right, today SSDs should just work. Until they don't. Many hard drives just run out of spares but are otherwise usable. Some SSDs just become dead, (when they run out of spares?), without definable reason.

Because you lost 2 members of your RAID-Z1 vDev, it's not importable, as @HoneyBadger stated above. This is always a risk in powering off a device.

It may sound stupid, (but we have seen this exact mistake here before, so no disrespect intended):

Are you sure you pulled the correct SSD?
Do you have the old SSD?
Can you put it back in the exact same disk bay?
 
Top