Replacing a dead HDD - questions... issues...

Status
Not open for further replies.

globus999

Contributor
Joined
Jun 9, 2011
Messages
105
OK, here we go again! Bitten by ZFS for the umpteen time.

So I have a raidz1 called tank1 composed of 4 hdd's of different sizes.
One of the HDDs suddenly dies (i.e. it won't spin up).
Ok, so zpool status indicates that tank1 is DEGRADED and ada5 (the dead hdd) is UNAVAILBLE.

So, after considerable reading of several ZFS manuals, I decide to follow the replacement instructions by the numbers:

1 - turn the system off
2 - replace the damaged hdd with a good hdd on the same controller, same port
3 - bring the system on
4 - issue the command zpool tank1 replace /dev/ada5

So, ZFS starts resilvering ada5. So far so good.

A few chksum errors and 8 corrupted files later, ZFS finishes resilvering.

A few more pages later, I decide to erase those 8 files to clear the errors. I do so. However, there is no change in tank1's status.

Problem is, the pool is still listed as DEGRADED, /dev/ada5/old status is RESILVERING and the new ada5 hdd is ONLINE.

After checking that ZFS is doing bupkus (i.e. it has indeed finished resilvering), I try to bring ada5 ONLINE (just to make sure) with no effect whatsoever. OK.

Then I bring /dev/ada5/old OFFLINE, with no effect.

Then I DETACH /dev/ada5/old and tank1 becomes ONLINE.

Fine. Then I again, follow the manual's recommended procedure and issue a zpool scrub tank1 command.... and.... ZFS start reslivering ada5!!!

WTF!!!

It is resilvering the hdd that just resilvered!!! And from scratch!!!

So, in summary:

1 - The manual is crap. ZFS does not do what the manual says it will do.
2 - The RESILVERING status does not get cleared automatically.
3 - Permanent errors do not get cleared (will they be cleared once the resivlering is done? do I need to scrub to get rid of them?)
4 - One has to DETACH a "ghost" hdd manually!!!
5 - The darn thing has a mind of his own!

:mad:

Anybody has any idea of what's going on? That would be *very* much appreciated indeed!
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
I just had a nearly identical situation with my Z2 pool, with a few minor differences.

* I offlined the bad disk before shutting down and replacing it.
* I tried doing a zpool replace pool replaced-bad-disk-name, but got bad vdev name, and some other error about no disk by that name in the pool existed
*I did a zpool online pool original-offlined-disk-name. It came online but said 'unavailable' under status
*Without offlining the above drive, from the GUI I tried doing a replace and it starting resilvering.
* 12 hours later it finished resilvering, but like you, a zpool status showed degraded with the OLD device listed also. I freaked out a little but then did a detach on that OLD device name and bingo, status was good and weird device was gone.

I didn't do a scrub after that, but from what you described it makes me wonder.
Also, WTF did the GUI do that the command line didn't? I saw the same error about a bad vdev, but then it just plugged along and resilvered.

I suspected part of the problem was the shifting of device names after offlining the bad disk, even though I used the same SATA port when connecting the new one. I think this is fixed in the 8.01 betas (the name shifting).
 

Tekkie

Patron
Joined
May 31, 2011
Messages
353
8.0.1beta3 fixes many issues with disk replacement but introduces some others unfortunately there are at least 3+ threads on this in the forum.
 

marian78

Patron
Joined
Jun 30, 2011
Messages
210

globus999

Contributor
Joined
Jun 9, 2011
Messages
105
* I offlined the bad disk before shutting down and replacing it.

I looked into this but manual says it's not necessary. Furthermore, manual says OFFLINING is just a temp solution. My thinking was that I didn't want to have some ghost hdd setting being written by ZFS in raidz1 config somewhere. I did not want to have a ghost disk threatening to show up in the next screw-up. Manual says ZFS handles everything automatically during hdd replacement. This is BS.

* I tried doing a zpool replace pool replaced-bad-disk-name, but got bad vdev name, and some other error about no disk by that name in the pool existed

This is consistent with the manual. Manual says you cannot replace an OFFLINED disk. That's also why I did not OFFLINE mine.

*I did a zpool online pool original-offlined-disk-name. It came online but said 'unavailable' under status

That is correct. ZFS still finds the dead disk... well dead :)

*Without offlining the above drive, from the GUI I tried doing a replace and it starting resilvering.

I do not use the GUI, too buggy. I use strictly the CLI. My next step will be to check that the GUI reflects the raidz1 OK.

* 12 hours later it finished resilvering, but like you, a zpool status showed degraded with the OLD device listed also. I freaked out a little but then did a detach on that OLD device name and bingo, status was good and weird device was gone.

This is most definitively *NOT* in the manual. The manual says nothing of the sort. At this point I suspect that the ZFS implementation in FN8 is the *very* old one and there are a gazillion features that are not implemented. I got bitten by a similar non-implementation on a corruption issue some time ago.

I didn't do a scrub after that, but from what you described it makes me wonder.

I was just following the manual by the numbers. It recommends a scrub after each resilver.... however, it sounds strange to me since the manual also says that a resivering and a scrub are mutually exclusive since a resiver is "kind of a scrub". Huh? Anyhoo, a scrub should not hurt (just waste time and wattage).

As a matter of fact I am scrubbing now and it is scrubbing OK. Beyond that, who knows!


Also, WTF did the GUI do that the command line didn't? I saw the same error about a bad vdev, but then it just plugged along and resilvered.

Not a clue. I don't trust the GUI. At least on the CLI I know that ZFS features are missing but beyond that, there are no other bugs piled on.

I suspected part of the problem was the shifting of device names after offlining the bad disk, even though I used the same SATA port when connecting the new one. I think this is fixed in the 8.01 betas (the name shifting).

Manual says nope. Manual says ZFS will recognize that the failed hdd was replaced simply because a good one was connected on the same controller same port. Manual says "zpool replace tank bad-hdd-name" is sufficient. At least in my case it did work.... to some degree...

Ohhhhh.... I am *SO NOT* impressed with ZFS!!!
 

globus999

Contributor
Joined
Jun 9, 2011
Messages
105
hi, i have also question about replacement hdds. see my post (noob) http://forums.freenas.org/showthread.php?757-rebuild.Thx. Marian.

Hi, this is a known bug in FN8. I started a bug track for this issue. Changes made to any zpool through the CLI do not get properly reflected in the GUI. However, since FN8 hides a great more info from the user, I am not sure if just the BSD hdd name is sufficient or will the underlying parameters also change.

In other words, it's a wait and see... unfortunately...
 

marian78

Patron
Joined
Jun 30, 2011
Messages
210
ok, and freenas v7 is OK from this point????
 

globus999

Contributor
Joined
Jun 9, 2011
Messages
105
ok, and freenas v7 is OK from this point????

Not sure, not too familiar with FN7 but, presumably, yes. Apparently there is a Synchronization button somewhere in the GUI that does just that. Haven't tried it.. so... caveat emptor!
 
Status
Not open for further replies.
Top