how to force resilver after 1 working drive was disconnected and later reconnected?

DukeNukem · May 8, 2013

Hello!

I'm running a RAID-Z2 with 6 drives. As I have only 6 SATA ports, and I wanted to copy data from a 7th single drive to the pool, I disconnected one pool drive, connected the 7th drive (autoreplace being off!), copied the new data to the (now degraded) pool, disconnected it and reconnected the previously disconnected pool drive (each time shutting down and restarting the system). I thought the pool would resilver automatically to ensure the one pool drive is "in sync", but nothing happened. The pool just went online without resilvering. Even a scrub did not find any data to be "repaired".
How is that possible? I fear that the one removed and reconnected drive is now somehow "out of sync", so that if two other drives failed, there would be data loss. Is that true? If so, how can I force to resilver a drive that has never been formally replaced because of errors, but only disconnected and reconnected?

Thank you very much!

Regards,
DukeNukem

pool: tank
state: ONLINE
scan: scrub repaired 0 in 6h41m with 0 errors on Thu May 9 03:04:16 2013
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/05a4deb5-b5c2-11e2-a6a1-60a44c3fd886 ONLINE 0 0 0
gptid/567c2a16-b36a-11e2-9b2c-60a44c3fd886 ONLINE 0 0 0
gptid/57255634-b36a-11e2-9b2c-60a44c3fd886 ONLINE 0 0 0
gptid/57cc2d36-b36a-11e2-9b2c-60a44c3fd886 ONLINE 0 0 0
gptid/58708a7a-b36a-11e2-9b2c-60a44c3fd886 ONLINE 0 0 0
gptid/590dc796-b36a-11e2-9b2c-60a44c3fd886 ONLINE 0 0 0

errors: No known data errors

cyberjock · May 8, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

I didn't do what you did with the a limited number of SATA ports, but I did experiment with pulling a drive and reattaching it after changing drive data. In one case doing a scrub did show that some data was resilvered, but in another it showed no indication of the drive being "out of sync" was a problem. I have no explanation for why it gave no indication of it, but I can tell you that if you ran the scrub with all 6 drives attached then the drives are definitely in sync. I verified this on my test system back when I was testing this type of scenario.

I'd just make sure that some of the files or folders you moved are here on the zpool and call it good. You also should be doing scrubs regularly, so even if your manual scrub wasn't "good enough"(not sure why it wouldn't be) but your next scrub had better fix it. After all, the whole purpose of scrubs is to verify all the drives are in sync and everything is fine.

paleoN · May 9, 2013

That's a bit bizarre. A scrub should pick it up. A you sure you didn't miss it? I suppose you could offline/online the drive in question.

And use [code][/code] tags.

DukeNukem · May 9, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

Things got weird. After some time, "zpool status" reported me 1.65K (i suppose 1650) errors on the drive:

Code:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 6h41m with 0 errors on Thu May  9 03:04:16 2013
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            ONLINE       0     0     0
	  raidz2-0                                      ONLINE       0     0     0
	    gptid/05a4deb5-b5c2-11e2-a6a1-60a44c3fd886  ONLINE       0     0 1.65K
	    gptid/567c2a16-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    gptid/57255634-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    gptid/57cc2d36-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    gptid/58708a7a-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    gptid/590dc796-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0

errors: No known data errors

So I wanted to test if the one drive really holds all necessary data. And so it got worse. I offlined two of the other drives, keeping the one drive in question online. The pool said "degraded", so in theory, everything should have been still be fine. But it was not. I got real read errors while trying to open files. zpool status reported:

Code:

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 6h41m with 0 errors on Thu May  9 03:04:16 2013
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            DEGRADED     0     0   245
	  raidz2-0                                      DEGRADED     0     0   490
	    gptid/05a4deb5-b5c2-11e2-a6a1-60a44c3fd886  ONLINE       0     0 1.65K
	    gptid/567c2a16-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    gptid/57255634-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    gptid/57cc2d36-b36a-11e2-9b2c-60a44c3fd886  ONLINE       0     0     0
	    8286206573674992439                         OFFLINE      0     0     0  was /dev/gptid/58708a7a-b36a-11e2-9b2c-60a44c3fd886
	    16420797784201749272                        OFFLINE      0     0     0  was /dev/gptid/590dc796-b36a-11e2-9b2c-60a44c3fd886

errors: 3 data errors, use '-v' for a list

I think this is a mayor flaw. Could the problem be that I didn't offline the drive before removing it in the first place, but just shut down the system and then removed it?

Trying to offline the drive first gave me:

Code:

cannot offline gptid/05a4deb5-b5c2-11e2-a6a1-60a44c3fd886: no valid replicas

But after a scrub, which finished instantaneously, offlining worked.

I guess I will try the following: shut down the system, remove the drive, delete it, reattach it, start the system and use the deleted drive to "zfs replace" the missing one.
Do you think this could work???

Thanks again!

Regards,
DukeNukem

cyberjock · May 9, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

I'm confused by your most recent post. You said you don't understand how there can be no valid replicas in a RAIDZ2 with only 1 drive offlined. According to your zpool status you have 2 drives that say "OFFLINE" so you have no more redundancy. Look at the last 2 disks in your previous post. You have 2 disks offline, so it would be normal to not be capable of removing any more disks.

Are you 100% sure that the disk that you had removed and reinstalled is the disk with the 1.65k checksum errors? I'm just trying to make sure you aren't making bad assumptions. The biggest problems with disks is that they can be fine for months/years, but suddenly started having a problem last night. And while you are trying to identify an older issue some new issue is confusing you from a proper diagnosis.

Just to clarify read/write/chksum errors, I think of it like this:

If a hard drive is requested to read data but the hard drive responds it can't read the sector, that's a READ error.

If a hard drive is requested to write data, but the hard drive says it can't write to that sector, that's a WRITE error.

If you have neither of the above, but the ZFS checksums determine that the data the hard drive read is not correct, that's a CHKSUM error.

Things do get muddy though. With FreeBSD 8.3 if you unplug a disk with the system on it is not necessarily immediately offlined. Depending on your SATA controller it seems that sometimes you will rack up a large number of READ and WRITE errors because the system is still assuming the disk is connected. But sometimes you will only get CHKSUM errors.

In any case, I think you need to shutdown your server, reattach all of your disks. Do a bootup and see if they all appear in a zpool status command. Then, after you verify they are all online do a zpool clear, then a zpool scrub. Then see what values you get when the scrub is complete.

Lastly, if you feel like rerunning the test you just tried to do, you can do so. But keep in mind that you should not ever, under any circumstances, remove a hard drive with the system up and running without detaching a disk from the array first in the GUI. You can lose data that way. Doing a cold shutdown of the system and removing disks with the power off is safe however(as you did above), just don't do it with the system up.

To be honest, I'm wondering if you are confused about something else(or something else is wrong) and you aren't recognizing the big picture. It does happen. I have complete faith that if all the drives were attached when you performed a scrub on your zpool and it didn't return errors and no drives were offlined then all of your data is safe.

DukeNukem · May 9, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

I edited my last post, perhaps that's why it's a bit confusing now.
I offlined two drives to test if the first drive held the correct data. And it did not. Definitely.
Then I put them online again and tried to offline the first drive, as paleoN suggested, which did not succeed. But after a scrub, it did. But this did not lead to a resilvering process.
So the last thing I thought of was rebuilding the first drive by offlining it, deleting it, and replacing it "by itself". This one seems to work, the resilvering process is ongoing as I type.

But to answer your question: I am definitely sure that the scrub process was performed with all 6 drives attached and online. And I am definitely sure that there was missing data when I put the last two drives offline, leaving the first drive online.

I can imagine that this would not have happened if I put the first drive offline before copying the new data to the degraded pool in the first place. The FreeNAS wiki actually states to always put a drive offline before replacing it, to prevent "swap issues" ("... This step is needed to properly remove the device from the ZFS pool and to prevent swap issues. ...", http://wiki.freenas.org/index.php/Volumes).

But the remaining question is, why did the scrub not identify and fix the problem?

Best Regards,
DukeNukem

titan_rw · May 9, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

I was under the impression that if you offline a drive, do some writes to the pool, then online the original disk, that zfs will 'catch' up the previously offline'd disk immediately. If the amount of written data wasn't gigantic, you might not notice this 'drive catch up'. Any subsequent scrub probably won't show any problems, as all drives should be in sync.

Maybe I'm either missing something from what the OP said they did, or not understanding zfs correctly.

DukeNukem · May 9, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

Perhaps my problem was that I did not offline the drive, but only shut down the system to remove the drive.
The amount of data written (after removing the drive) was substantial, so the catch up should have been noticed.

cyberjock · May 10, 2013

Re: how to force resilver after 1 working drive was disconnected and later reconnecte

You know, this thread is really bothering me.

I had done some experimenting and I found that a scrub would fix any potential losses of data from drives failing later on(despite the lack of "fixing" of data on a # zpool status). I've read quite a few people that pulled disks and later reattached them and we thought a scrub fixed it(mind you I don't think anybody in the forums noticed that a # zpool status didn't show anything fixed). But what does this all mean? I thought I had this all figured out and I understood the risks, but now I'm questioning my knowledge on what to do for an accidental disconnect of a drive with the system off.

Does this mean that if I accidentally bootup my server with a drive disconnected, then shutdown the server when i find the mistake and reattach the drive there is no way to bring the drive back in sync without offlining and replacing the disk via GUI? That's a pretty scary thought to be honest. I already had a situation where I accidentally left 3 disks disconnected from my RAIDZ3 and I simply did a scrub and called it good. I'm really wondering what is going on with this and if someone(s) are confused or if this really is a potential path to disaster. I tested this before last year and I thought I had convinced myself that if you removed a drive from the system temporarily, then reattached it later, that a scrub was all that was needed. But this is really making me question everything.

In all seriousness, I was convinced that if a disk ever ended up "out of sync" with the zpool it belongs to all that was needed was to plug the drive back in and do a scrub. This thread is making me question this conviction.

Anyone with any know-how on this that can explain what you should do if you accidentally leave a disk in your zpool disconnected. For example, if I shutdown my server to blow out the dust and then forget to reinsert a SATA cable and bootup my machine and use it, how do I fix this situation? Do I actually have to offline the old disk and replace it via the GUI? Or can I just add the disk back to the system and run a scrub?

paleoN · May 12, 2013

DukeNukem said:
So I wanted to test if the one drive really holds all necessary data. And so it got worse. I offlined two of the other drives, keeping the one drive in question online.

Not what I suggested. I take it you have working backups?

DukeNukem said:
Could the problem be that I didn't offline the drive before removing it in the first place, but just shut down the system and then removed it?

You would see different behavior if you were to offline the drive first, but a scrub will take care of it otherwise.

DukeNukem said:
But after a scrub, which finished instantaneously, offlining worked.

I've seen a scrub do this before. It obviously didn't run. IIRC, I reran the scrub command and it worked.

Important Announcement for the TrueNAS Community.

how to force resilver after 1 working drive was disconnected and later reconnected?

DukeNukem

Cadet

cyberjock

Inactive Account

paleoN

Wizard

DukeNukem

Cadet

cyberjock

Inactive Account

DukeNukem

Cadet

titan_rw

Guru

DukeNukem

Cadet

cyberjock

Inactive Account

paleoN

Wizard

Similar threads