Testing plans for Hardware Failures

headconnect · Jun 2, 2011

Following up a bit on the 'planning for hardware failure' thread, I thought I'd see what happened if i forced a bit of HW failure on my VirtualBox test environment.

So - What I've set up in vbox is:

- 1.5gb ram
- FreeNAS 8.0 installed on 2GB virtual IDE drive
- Added 4x6GB SATA disks
- Set up RAID-Z accross 4x disks

Been working quite nicely, so I thought it was high time I introduced some destruction.

What I was curious about is how FreeNAS would react to a disk failure - I'm still in the 'playing around' phase, so haven't dug too deeply, but I haven't found any area where I could set up notification on failures or problems.

What I did was:
1. Shut down FreeNAS
2. Remove the fourth disk
3. Add a new - blank - 6gb disk in its place
4. Boot FreeNAS

The good news? FreeNAS boots fine, and I can get into the web interface. The bad news? I can't access my CIFS share any longer (at the time I had only configured CIFS).

What I see in the console menu is this (may remove it if it gets too long):

Code:

Jun  2 09:49:29 FreeNAS kernel: ada0 at ata0 bus 0 scbus0 target 0 lun 0
Jun  2 09:49:29 FreeNAS kernel: ada0: <VBOX HARDDISK 1.0> ATA-6 device
Jun  2 09:49:29 FreeNAS kernel: ada0: 33.300MB/s transfers (UDMA2, PIO 65536bytes)
Jun  2 09:49:29 FreeNAS kernel: ada0: 2048MB (4194304 512 byte sectors: 16H 63S/T 4161C)
Jun  2 09:49:29 FreeNAS kernel: ada1 at ahcich0 bus 0 scbus2 target 0 lun 0
Jun  2 09:49:29 FreeNAS kernel: ada1: <VBOX HARDDISK 1.0> ATA-6 SATA 2.x device
Jun  2 09:49:29 FreeNAS kernel: ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jun  2 09:49:29 FreeNAS kernel: ada1: Command Queueing enabled
Jun  2 09:49:29 FreeNAS kernel: ada1: 6144MB (12582912 512 byte sectors: 16H 63S/T 12483C)
Jun  2 09:49:29 FreeNAS kernel: ada2 at ahcich1 bus 0 scbus3 target 0 lun 0
Jun  2 09:49:29 FreeNAS kernel: ada2: <VBOX HARDDISK 1.0> ATA-6 SATA 2.x device
Jun  2 09:49:29 FreeNAS kernel: ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jun  2 09:49:29 FreeNAS kernel: ada2: Command Queueing enabled
Jun  2 09:49:29 FreeNAS kernel: ada2: 6144MB (12582912 512 byte sectors: 16H 63S/T 12483C)
Jun  2 09:49:29 FreeNAS kernel: ada3 at ahcich2 bus 0 scbus4 target 0 lun 0
Jun  2 09:49:29 FreeNAS kernel: ada3: <VBOX HARDDISK 1.0> ATA-6 SATA 2.x device
Jun  2 09:49:29 FreeNAS kernel: ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jun  2 09:49:29 FreeNAS kernel: ada3: Command Queueing enabled
Jun  2 09:49:29 FreeNAS kernel: ada3: 6144MB (12582912 512 byte sectors: 16H 63S/T 12483C)
Jun  2 09:49:29 FreeNAS kernel: ada4 at ahcich3 bus 0 scbus5 target 0 lun 0
Jun  2 09:49:29 FreeNAS kernel: ada4: <VBOX HARDDISK 1.0> ATA-6 SATA 2.x device
Jun  2 09:49:29 FreeNAS kernel: ada4: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jun  2 09:49:29 FreeNAS kernel: ada4: Command Queueing enabled
Jun  2 09:49:29 FreeNAS kernel: ada4: 6144MB (12582912 512 byte sectors: 16H 63S/T 12483C)
Jun  2 09:49:29 FreeNAS kernel: Trying to mount root from ufs:/dev/ufs/FreeNASs1a
Jun  2 09:49:29 FreeNAS kernel: ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present;
Jun  2 09:49:29 FreeNAS kernel: to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
Jun  2 09:49:29 FreeNAS kernel: ZFS filesystem version 4
Jun  2 09:49:29 FreeNAS kernel: ZFS storage pool version 15
Jun  2 09:49:30 FreeNAS root: /etc/rc.d/mountd: WARNING: /etc/exports is not readable.
Jun  2 09:49:30 FreeNAS mountd[1136]: can't open /etc/exports
Jun  2 09:49:31 FreeNAS root: /etc/rc: WARNING: failed precmd routine for vmware_guestd
Jun  2 09:49:31 FreeNAS smartd[1202]: Unable to register ATA device /dev/ada1 at line 4 of file /usr/local/etc/smartd.conf
Jun  2 09:49:31 FreeNAS smartd[1202]: Unable to register device /dev/ada1 (no Directive -d removable). Exiting.
Jun  2 09:49:31 FreeNAS root: /etc/rc: WARNING: failed to start smartd
Jun  2 09:49:33 FreeNAS kernel: fuse4bsd: version 0.3.9-pre1, FUSE ABI 7.8
Jun  2 09:49:33 FreeNAS ntpd[1379]: ntpd 4.2.4p5-a (1)
Jun  2 09:49:34 FreeNAS root: /etc/rc: WARNING: failed to start proftpd
Jun  2 09:49:34 FreeNAS cnid_metad[1495]: Set syslog logging to level: LOG_NOTE
Jun  2 09:49:34 FreeNAS mountd[1137]: can't open /etc/exports
Jun  2 09:49:34 FreeNAS mountd[1137]: can't open /etc/exports
Jun  2 09:49:39 FreeNAS avahi-daemon[1614]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
Jun  2 07:49:42 FreeNAS ntpd[1380]: time reset -7199.328675 s
Jun  2 07:53:49 FreeNAS mountd[1137]: mount request denied from 10.0.1.72 for /dev
Jun  2 07:53:49 FreeNAS mountd[1137]: mount request denied from 10.0.1.72 for /dev

So it sees all drives, but no hint at rebuilding or anything to that extent. From the machine window, I don't see the regular 1-9 options.

Anyway - I'll probably add more (and remove something) from this post once I play around a bit more, but basically - I would have expected that RAID-Z would be rebuilding itself once it realised it lost a disk and that a new disk was present.

While I can see the files in the dataset when logging into the machine via ssh, I would have hoped I could see them as well through CIFS.

Update 1:
So I rebooted FreeNAS again, and in the VirtualBox machine view, i get the 1-9 option list, and lo and behold, I manage to access it through CIFS. However, I'm quite sure FreeNAS isn't actually rebuilding anything as the .vdi (virdual disk image) which would be needing to get up-to-date is still at 28kb compared to 200MB on the other .vdi's. Let's see what time does :)

mcleanrs · Jun 2, 2011

Do you need to fail out the missing drive first? IE -- boot with 1 drive totally missing before you boot with an empty disk in place.

headconnect · Jun 2, 2011

1. Shut down FreeNAS from console '9'
2. Remove disk from VirtualBox
3. Power on FreeNAS again
4. Note no differences (nothing different from the previous console log) in UI
5. Shut down FreeNAS (since the 1-9 menu didn't come up, doing so from UI - note to GUI designer: nice touch to have the red mask :) You should add that for deleting datasets and suchlike as well ;))
6. Power on FreeNAS again (without adding new disk, i just want to see that 1-9 menu pop up before i certify myself happy that it booted fine)
7. Still no 1-9 menu. oh well. Seems I'm not getting CIFS if i don't see the 1-9, so guess it's getting stuck somewhere. No matter, that's fodder for another thread at another time in my opinion, though this is getting tedious.
8. Got bored of that, started playing around in UI and found that there is at least a place I can see that the dataset is 'degraded', though the 'View Disks' told me nothing of interest, but seems to have a 'replace' function, so I'll try that now
9. Shut down FreeNas from UI
10. Added the empty disk again in VirtualBox
11. Booted - also noted that it was active directory somehow hanging itself up which was preventing the 1-9 issue. strange stuff.. yet another possible post for the future, but no matter. CIFS not accessible after hitting ctrl-c to get away from that, but still, irrelevant for the moment

So - no automatic actions have been taken by FreeNAS to rebuild my volumes. So went back into the 'View Disks' bit, and did a 'replace' of ada4, with 'in-place' as only option. Did that, lots of info in the console logs, also notice that the .vdi has grown decently in size indicating that i've probably succeeded in getting healthy again, but UI still stuck on 'degraded' even after logging in again. Considering trying another reboot for the hell of it, but going to wait a bit more..

Code:

Jun  2 14:30:06 FreeNAS freenas[1513]: Popen()ing: zpool list -H -o health nasty2
Jun  2 14:30:07 FreeNAS freenas[1513]: Popen()ing: zpool list -H -o health nasty2
Jun  2 14:30:24 FreeNAS freenas[1513]: Popen()ing: /sbin/sysctl -n kern.disks
Jun  2 14:30:24 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada4
Jun  2 14:30:25 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada3
Jun  2 14:30:25 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada2
Jun  2 14:30:25 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada1
Jun  2 14:30:25 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada0
Jun  2 14:30:28 FreeNAS freenas[1513]: Popen()ing: /sbin/sysctl -n kern.disks
Jun  2 14:30:28 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada4
Jun  2 14:30:28 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada3
Jun  2 14:30:28 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada2
Jun  2 14:30:28 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada1
Jun  2 14:30:28 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada0
Jun  2 14:30:28 FreeNAS freenas[1513]: Executing: /sbin/swapoff /dev/gpt/swap-ada4
Jun  2 14:30:28 FreeNAS freenas: swapoff: /dev/gpt/swap-ada4: No such file or directory
Jun  2 14:30:28 FreeNAS freenas[1513]: Executing: /sbin/zpool offline nasty2 gpt/ada4
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: geom disk list ada4 | grep 'Stripesize: 4096'
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: geom disk list ada4 | grep 'Sectorsize: 512'
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: dd if=/dev/zero of=/dev/ada4 bs=1m count=1
Jun  2 14:30:29 FreeNAS freenas: 1+0 records in
Jun  2 14:30:29 FreeNAS freenas: 1+0 records out
Jun  2 14:30:29 FreeNAS freenas: 1048576 bytes transferred in 0.012479 secs (84026796 bytes/sec)
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: dd if=/dev/zero of=/dev/ada4 bs=1m oseek=`diskinfo ada4 | awk '{print int($3 / (1024*1024)) - 4;}'`
Jun  2 14:30:29 FreeNAS freenas: dd: /dev/ada4: end of device
Jun  2 14:30:29 FreeNAS freenas: 5+0 records in
Jun  2 14:30:29 FreeNAS freenas: 4+0 records out
Jun  2 14:30:29 FreeNAS freenas: 4194304 bytes transferred in 0.057969 secs (72354439 bytes/sec)
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: gpart create -s gpt /dev/ada4 && gpart add -b 128 -t freebsd-swap -l swap-ada4 -s 4194304 ada4 && gpart add -t freebsd-zfs -l ada4 ada4
Jun  2 14:30:29 FreeNAS freenas: ada4 created
Jun  2 14:30:29 FreeNAS freenas: ada4p1 added
Jun  2 14:30:29 FreeNAS freenas: ada4p2 added
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: /sbin/swapon /dev/gpt/swap-ada4
Jun  2 14:30:29 FreeNAS freenas[1513]: Executing: /sbin/zpool online nasty2 gpt/ada4
Jun  2 14:30:30 FreeNAS freenas: warning: device 'gpt/ada4' onlined, but remains in faulted state
Jun  2 14:30:30 FreeNAS freenas: use 'zpool replace' to replace devices that are no longer present
Jun  2 14:30:30 FreeNAS freenas[1513]: Executing: /sbin/zpool replace nasty2 gpt/ada4
Jun  2 14:30:53 FreeNAS freenas[1513]: Popen()ing: zpool list -H -o health nasty2
Jun  2 14:30:53 FreeNAS freenas[1513]: Popen()ing: zpool list -H -o health nasty2
Jun  2 14:33:14 FreeNAS last message repeated 2 times
Jun  2 14:33:54 FreeNAS last message repeated 2 times
Jun  2 14:34:48 FreeNAS freenas[1513]: Popen()ing: /sbin/sysctl -n kern.disks
Jun  2 14:34:48 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada4
Jun  2 14:34:48 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada3
Jun  2 14:34:48 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada2
Jun  2 14:34:48 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada1
Jun  2 14:34:48 FreeNAS freenas[1513]: Popen()ing: /usr/sbin/diskinfo ada0

Thought I might as well try to have a quick look in SSH while I was at it - though I would prefer to stay completely on the UI side for simplicity and usability for those not familiar with.. well anything, but hey, a quick 'zpool status' in the shell yielded:

Code:

FreeNAS# zpool status
  pool: nasty2
 state: DEGRADED
 scrub: resilver completed after 0h0m with 0 errors on Thu Jun  2 14:30:37 2011
config:

        NAME                        STATE     READ WRITE CKSUM
        nasty2                      DEGRADED     0     0     0
          raidz1                    DEGRADED     0     0     0
            gpt/ada1                ONLINE       0     0     0
            gpt/ada2                ONLINE       0     0     0
            gpt/ada3                ONLINE       0     0     0
            replacing               DEGRADED     0     0     0
              12900431728038420588  UNAVAIL      0     0     0  was /dev/gpt/ada4/old
              gpt/ada4              ONLINE       0     0     0  119M resilvered

errors: No known data errors

Seems like the 'resilvered' size isn't changing much, and looks about right.. so going to try a restart for the hell of it (if i lose it all now, it's better than losing it all when it's the real deal).

Update: No change on restart, health is still the same, still fairly clueless as to how to get things up and get the volume healthy again. On the plus side, AD decided to cooperate this boot and CIFS now works again. yay.

matthewowen01 · Jun 2, 2011

now you need to remove the old disk.

zpool remove <pool> <device>

so in your case, zpool remove nasty2 12900431728038420588

headconnect · Jun 3, 2011

Right, a bit bad riming for this kind of testing for me, I'm now in france for ten days.. So hopefully I'll pick it back up when i return. In any event, shouldn't it be possible for the UI to take care of such things? It alludes to it with the 'replace' option.. Perhaps i'll start toying with dojo when i get back :)

matthewowen01 · Jun 3, 2011

zfs is very robust. when i first started using zfs i was trying to break it.

i setup a raidz with 4 drives,
added some known data and hashed it,
wrote a script to constantly hash the file and compare it to the original,
then ran dd to zero out one of the drives while running the hash loop.

no errors.

ixdwhite · Jun 6, 2011

This test is erroneous due to file caching -- when you hashed the file to verify after trashing the disk, you hashed an in-memory copy and not the trashed on-disk copy. You might have noticed that ZFS still though the array was ONLINE since it hadn't found errors during an actual disk I/O op yet. "zpool scrub" would have found the broken disk and fixed it.

For fun sometime try building a FreeBSD box with UFS, set kern.geom.debugflags=16 to turn off device access protection, then zero over the rootfs. The machine will last quite a while because of file caching. Once you try to do something that needs disk I/O (like create/remove directories) things go bad in a hurry.

Also for the record the "xxxxM resilvered" stat stays around until you run 'zpool clear'; the scrub: line at the top will state when the resilver is done, which it was about 9 minutes before your post.

matthewowen01 · Jun 6, 2011

the file was 100GB generated with /dev/random (this was on opensolaris back when it existed), well more than the cache, and not a bunch of zeros. I totally agree if the file was small enough it would have not been read again.

Important Announcement for the TrueNAS Community.

Testing plans for Hardware Failures

headconnect

Explorer

mcleanrs

Dabbler

headconnect

Explorer

matthewowen01

Guru

headconnect

Explorer

matthewowen01

Guru

ixdwhite

Guest

matthewowen01

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Testing plans for Hardware Failures

headconnect

Explorer

mcleanrs

Dabbler

headconnect

Explorer

matthewowen01

Guru

headconnect

Explorer

matthewowen01

Guru

ixdwhite

Guest

matthewowen01

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Testing plans for Hardware Failures"

Similar threads