Problem due to hot plug, recovery, etc ( RAIDZ2 test config)

Status
Not open for further replies.

SkyMonkey

Contributor
Joined
Mar 13, 2013
Messages
102
In preparing to put my FreeNAS into production (read: trust my data to it, with additional offsite backups of vital data), today I did a few things, one of which caused an issue. Trying to learn and recover the system, and wanted advice/comment.

My system is described here for reference: http://forums.freenas.org/showthrea...for-home-storage&p=54229&viewfull=1#post54229

I unfortunately didn't really take good notes, as I am quite jetlagged after returning from halfway around the world yesterday...


0) Upgraded from 8.3.0-p1 to 8.3.1-p2. This went fine, and I restored my few custom scripts without an issue.

1) I wanted to practice a drive replacement on my config, so I offlined ada5, then unplugged it from my chassis. I then started copying files to the NAS. After a while, I got a degraded notice in the GUI (yellow light, etc), and also (after a while), an email from smartd. Good stuff. The file copies completed and performance seemed about the same (though I'm limited here to the speed of my source disks).

2) I plugged the removed drive (ada5) into a windows pc, formatted it (quick format), wrote some data to it, etc, and then plugged it back into the NAS. This is where something happened. The instant I plugged it in (or within a second or two max), I saw in the GUI console that ada3 and ada4 had been disconnected!!!. A few seconds later it recognized a disk was at ada5 again, but no like messages at ada3/4.

3) I tried to copy some files, and was unable to do so (likely because ada3/4 were now inaccessible). The GUI showed some checksum errors (just a few) for read and write to those drives, and a zpool status detected several errors in files (all metadata files).

4) Then I did something sort of dumb (I'm still learning, and majorly jetlagged...). I unplugged ada3 and plugged it back in without offlining it first. The console GUI spit out some messages about how it couldn't connect the device to due some status or another (0x6 I think?). Didn't copy it down - again, jetlag...

5) I decided to reboot the server. The reboot did not complete, it froze after the following messages on the actual console (hooked a monitor up after it hadn't come back in 10 minutes):

Code:
Syncing disks, vnodes remaining…*0 0 0 0 done 
All buffers synced. 


6) I hard reboot the server. It came up fine, and I could again copy files etc. The pool is still degraded, as I've not replaced ada5 yet in the GUI (it is in the chassis).

7) I decided to run a scrub on the degraded pool, before adding ada5 back and resilvering due to the previously displayed checksum errors and file errors mentioned in step 3 above. This scrub is currently in progress, and will complete in another half hour or so.


So, I learned not to hotplug on my hardware! I've placed a sticky note internal to the chassis to remind myself of this if I ever have to replace a drive on this system once it's live.

A few questions:

A) Assuming the scrub completes and corrects any errors, am I good to go? I assume so, and I am going to attempt the reslilver once it does complete.

B) Any idea what happened with the other devices dropping? I'm assuming something with my actual hardware config doesn't like hotplugging, but who knows. I am going to check all the connections and wiggle some cables around with the system up to see if it's a loose cable of some kind. Could be a backplane glitch in the hotswap chassis.

C) I know unplugging and replugging a drive without offlining it is a no-no. Any chance I messed something up permanently on the pool by doing this?

D) If the scrub does NOT find any errors to correct, what does that indicate given that a zpool status showed several checksum and file errors in step 3 above?


Thanks!
 

SkyMonkey

Contributor
Joined
Mar 13, 2013
Messages
102
Well, after some sleep and some thought, I think I can answer some of my questions....

A) The scrub completed (without finding/correcting any errors). I replaced ada5 and it reslivered successfully. I detached the old ghost drive in the GUI and everything is now back like when I started.

B) No idea, haven't had time to experiment/wiggle cables yet.

C) Still looking for info on this one...

D) I assume this means that the drives were for some reason unavailable to the system, and completely untouched. Any checksum and file errors were due to the drives being unavailable to the system(?). The fact that the 4 file errors were metadata was indicative of the system trying to access the drive and being unable to do so(?).
 

SkyMonkey

Contributor
Joined
Mar 13, 2013
Messages
102
Well, experimentation has continued in the interest of learning. All appeared well after the scrubs, so I decided to repeat the hot plug after offlining experiment (steps 1 and 2). Same result, though this time only a single drive dropped. A subsequent reboot froze with a fatal trap 12, though after a replace/scrub, everything was good again. So, definitely do not hot plug on this system...

I then offlined and immediately onlined a drive, without shutting down or removing it. This worked fine, the online command worked to pick the drive backup and re-add it to the array. I didn't copy any files to the array during the time it was offline.

Then, I tried the following:

1) Offline a drive.
2) Shutdown and remove the drive.
3) Boot and copy files to the array.
4) Shutdown and reinstall the drive.
5) Boot the system, and try to online the drive via the command line.

Based on my reading, I understood that onlining the same drive I removed should re-add it to the array, and should automatically resilver it for any missing data. This did not work.

Code:
[root@freenas] ~# zpool status
  pool: bluemesa
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 1h6m with 0 errors on Tue May 28 22:37:54 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        bluemesa                                        DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/b930828c-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/b9969742-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/ba00b169-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/ba756e6a-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/bae4d348-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            10771722031914304598                        OFFLINE      0     0     0  was /dev/dsk/gptid/e6dfea9a-c808-11e2-bddb-7054d21693cc

errors: No known data errors


Code:
[root@freenas] ~# zpool online bluemesa /dev/dsk/gptid/e6dfea9a-c808-11e2-bddb-7054d21693cc
warning: device '/dev/dsk/gptid/e6dfea9a-c808-11e2-bddb-7054d21693cc' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
[root@freenas] ~# zpool status -x
  pool: bluemesa
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 1h6m with 0 errors on Tue May 28 22:37:54 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        bluemesa                                        DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/b930828c-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/b9969742-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/ba00b169-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/ba756e6a-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/bae4d348-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            10771722031914304598                        UNAVAIL      0     0     0  was /dev/dsk/gptid/e6dfea9a-c808-11e2-bddb-7054d21693cc

errors: No known data errors


Any idea why this didn't work? Attempting to replace the drive in the GUI didn't work either, as it sees ada5 (the one listed unavail/offline above) as still belonging to the pool as it should.


EDIT: Saw in this post: (http://forums.freenas.org/showthread.php?10359-Unable-to-ONLINE-a-drive-back-into-the-pool/page2) that shutdown and cold boot fixed a similar issue. I did this, and now have upon bootup:

Code:
[root@freenas] ~# zpool status -v
  pool: bluemesa
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 1h6m with 0 errors on Tue May 28 22:37:54 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        bluemesa                                        ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/b930828c-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/b9969742-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/ba00b169-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/ba756e6a-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/bae4d348-ae44-11e2-a357-7054d21693cc  ONLINE       0     0     0
            gptid/55892d30-c89c-11e2-9452-7054d21693cc  ONLINE       0     0     3

errors: No known data errors
[root@freenas] ~#


No partial resliver done (I think?) so I'm going to scrub. Still confused about why this isn't working as I expected...

EDIT 2: The scrub completed nearly immediately (come to think of it, this happened before), and so I restarted a new one. Now it's actually repairing stuff. Odd....
 
Status
Not open for further replies.
Top