HDD failure, several problems

Status
Not open for further replies.

monarchdodra

Explorer
Joined
Feb 15, 2012
Messages
79
So I had an HDD die out on me. Drives die, but it is what happened afterwards that kind of bothers me:

1) I did not receive any diagnostics about this. my "root" has an email address set, and I receive notifications when replications fail, for example. On reboot, FreeNas failed to detect an HDD, and did not even send an email notification about this???? Is this normal, did I miss a setting somewhere? If one of my drives die, I expect to be notified ASAP. Am I expected to log into the WebGUI daily to make sure everything is running fine?

2) I finally understood what was going on when I went to the WebGUI. This is not something I do very often anymore. There was a blinking yellow "warning" light. I'm not an professional network admin, but if a server fails to load all its mounts, isn't this a "critical" error? When looking at my pools, the only message was "Could not get drive size" (or something similar, I don't remember). I'm not fluent in freeBSD, but can't it detect that I just plain removed the drive because it was dead? Did I "only" get a warning because FreeNAS did a wrong diagnos

3) (minor) In the Alert System window here:
Clipboard01.png
The message does not fit into the actual alert system box.

----
I'm still a FreeNAS newby, but I expect that if an hdd craps out on me:
1) I receive an email.
2) The WegGUI notes a critical error.

Am I wrong in my way of assessing the situation?

PS: Did not find this reported yet. Sorry if duplicate.
 
Joined
Mar 15, 2012
Messages
2
We are testing FreeNAS 8 here at my office, and, during testing, I've noticed this same problem - possibly a little worse, actually. As a part of our testing, we physically removed a SATA drive from a RAID-Z2 RAID-set in a running system to simulate a critical failure of a member disk.

After 30 minutes, the "Alert" light is still green, the array status in FreeNAS still says "HEALTHY". The only indications that something is wrong are a) clicking on "View Disks" for the affected RAID-set shows the removed disk with no name and the serial number as "Unknown" and b) clicking zpool status yields "Sorry, an error occurred".

Attempting to use the "Replace" buttons to manually failover to the hot spare also yields a pop-up with "Sorry, an error occurred".

We are running FreeNAS-8.0.4-RELEASE-x64 (10351).
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
Guys, this discussion has occurred here several times. The notification doesn't work right and disconnecting a disk like that for testing doesn't work either. It's not a FreeNAS problem, but a problem with FreeBSD. I think it's been fixed in FreeBSD 9.0 and possibly in FreeBSD 8.3, but for now we have to wait (I know it sucks, waiting & the problem). See if you can find some of the other discussions and there are more details.
 
Joined
Mar 15, 2012
Messages
2
Okay. Thanks for the info - and sorry for the duplicate posts, then. This is the first one I came across in my searching.

Thanks again!
-Eric
 
J

jpaetzel

Guest
Just to provide a little more info..

FreeBSD 8.2 (the base system for FreeNAS 8.x) will not notice a drive failure in a zpool unless a scrub is run or the system is rebooted. We are working on the solution to this, but it involves a ton of kernel changes that aren't even in the development branches of FreeBSD, and for the moment are unwilling to unleash them on the FreeNAS user community as they are not without some risk.

A related bug is that hot spares won't kick in automatically.

Both of these issues are in our top 5 list of very important things to work on resolving, and hopefully sooner rather than later we'll get them out to the world.
 

monarchdodra

Explorer
Joined
Feb 15, 2012
Messages
79
Guys, this discussion has occurred here several times.
Just to provide a little more info..

Thank you both for the support and explanations.

I totally understand that freenas is not able to immediately detect the failure. My surprise came that when it finally did detect something fishy, it didn't say anything. I guess you are trying to develop something reliable rather than just kludge a patch.

I'm just a home user, so this is not a real issue for me, but I report what I find to help as much as I can.

PS: What about the minor interface bug in my first post? Seems like it is just a cell size problem. Was this reported before?
 
Status
Not open for further replies.
Top