In preparing to put my FreeNAS into production (read: trust my data to it, with additional offsite backups of vital data), today I did a few things, one of which caused an issue. Trying to learn and recover the system, and wanted advice/comment.
My system is described here for reference: http://forums.freenas.org/showthrea...for-home-storage&p=54229&viewfull=1#post54229
I unfortunately didn't really take good notes, as I am quite jetlagged after returning from halfway around the world yesterday...
0) Upgraded from 8.3.0-p1 to 8.3.1-p2. This went fine, and I restored my few custom scripts without an issue.
1) I wanted to practice a drive replacement on my config, so I offlined ada5, then unplugged it from my chassis. I then started copying files to the NAS. After a while, I got a degraded notice in the GUI (yellow light, etc), and also (after a while), an email from smartd. Good stuff. The file copies completed and performance seemed about the same (though I'm limited here to the speed of my source disks).
2) I plugged the removed drive (ada5) into a windows pc, formatted it (quick format), wrote some data to it, etc, and then plugged it back into the NAS. This is where something happened. The instant I plugged it in (or within a second or two max), I saw in the GUI console that ada3 and ada4 had been disconnected!!!. A few seconds later it recognized a disk was at ada5 again, but no like messages at ada3/4.
3) I tried to copy some files, and was unable to do so (likely because ada3/4 were now inaccessible). The GUI showed some checksum errors (just a few) for read and write to those drives, and a zpool status detected several errors in files (all metadata files).
4) Then I did something sort of dumb (I'm still learning, and majorly jetlagged...). I unplugged ada3 and plugged it back in without offlining it first. The console GUI spit out some messages about how it couldn't connect the device to due some status or another (0x6 I think?). Didn't copy it down - again, jetlag...
5) I decided to reboot the server. The reboot did not complete, it froze after the following messages on the actual console (hooked a monitor up after it hadn't come back in 10 minutes):
6) I hard reboot the server. It came up fine, and I could again copy files etc. The pool is still degraded, as I've not replaced ada5 yet in the GUI (it is in the chassis).
7) I decided to run a scrub on the degraded pool, before adding ada5 back and resilvering due to the previously displayed checksum errors and file errors mentioned in step 3 above. This scrub is currently in progress, and will complete in another half hour or so.
So, I learned not to hotplug on my hardware! I've placed a sticky note internal to the chassis to remind myself of this if I ever have to replace a drive on this system once it's live.
A few questions:
A) Assuming the scrub completes and corrects any errors, am I good to go? I assume so, and I am going to attempt the reslilver once it does complete.
B) Any idea what happened with the other devices dropping? I'm assuming something with my actual hardware config doesn't like hotplugging, but who knows. I am going to check all the connections and wiggle some cables around with the system up to see if it's a loose cable of some kind. Could be a backplane glitch in the hotswap chassis.
C) I know unplugging and replugging a drive without offlining it is a no-no. Any chance I messed something up permanently on the pool by doing this?
D) If the scrub does NOT find any errors to correct, what does that indicate given that a zpool status showed several checksum and file errors in step 3 above?
Thanks!
My system is described here for reference: http://forums.freenas.org/showthrea...for-home-storage&p=54229&viewfull=1#post54229
I unfortunately didn't really take good notes, as I am quite jetlagged after returning from halfway around the world yesterday...
0) Upgraded from 8.3.0-p1 to 8.3.1-p2. This went fine, and I restored my few custom scripts without an issue.
1) I wanted to practice a drive replacement on my config, so I offlined ada5, then unplugged it from my chassis. I then started copying files to the NAS. After a while, I got a degraded notice in the GUI (yellow light, etc), and also (after a while), an email from smartd. Good stuff. The file copies completed and performance seemed about the same (though I'm limited here to the speed of my source disks).
2) I plugged the removed drive (ada5) into a windows pc, formatted it (quick format), wrote some data to it, etc, and then plugged it back into the NAS. This is where something happened. The instant I plugged it in (or within a second or two max), I saw in the GUI console that ada3 and ada4 had been disconnected!!!. A few seconds later it recognized a disk was at ada5 again, but no like messages at ada3/4.
3) I tried to copy some files, and was unable to do so (likely because ada3/4 were now inaccessible). The GUI showed some checksum errors (just a few) for read and write to those drives, and a zpool status detected several errors in files (all metadata files).
4) Then I did something sort of dumb (I'm still learning, and majorly jetlagged...). I unplugged ada3 and plugged it back in without offlining it first. The console GUI spit out some messages about how it couldn't connect the device to due some status or another (0x6 I think?). Didn't copy it down - again, jetlag...
5) I decided to reboot the server. The reboot did not complete, it froze after the following messages on the actual console (hooked a monitor up after it hadn't come back in 10 minutes):
Code:
Syncing disks, vnodes remaining…*0 0 0 0 done All buffers synced.
6) I hard reboot the server. It came up fine, and I could again copy files etc. The pool is still degraded, as I've not replaced ada5 yet in the GUI (it is in the chassis).
7) I decided to run a scrub on the degraded pool, before adding ada5 back and resilvering due to the previously displayed checksum errors and file errors mentioned in step 3 above. This scrub is currently in progress, and will complete in another half hour or so.
So, I learned not to hotplug on my hardware! I've placed a sticky note internal to the chassis to remind myself of this if I ever have to replace a drive on this system once it's live.
A few questions:
A) Assuming the scrub completes and corrects any errors, am I good to go? I assume so, and I am going to attempt the reslilver once it does complete.
B) Any idea what happened with the other devices dropping? I'm assuming something with my actual hardware config doesn't like hotplugging, but who knows. I am going to check all the connections and wiggle some cables around with the system up to see if it's a loose cable of some kind. Could be a backplane glitch in the hotswap chassis.
C) I know unplugging and replugging a drive without offlining it is a no-no. Any chance I messed something up permanently on the pool by doing this?
D) If the scrub does NOT find any errors to correct, what does that indicate given that a zpool status showed several checksum and file errors in step 3 above?
Thanks!