3ware Drive Monitoring

Status
Not open for further replies.

MadsRC

Dabbler
Joined
Jul 14, 2013
Messages
20
NOTE: This is beta software. Use at your own risk!
Prologue

Let me start by saying, I'm fairly new to FreeNAS, so please bear with me if this has been done before.

Before I start by telling what the program does, you need to know my build.
I'm using a SuperMicro SC846 server, which comes with a 3ware/LSI 9650SE 24M8 raid card.
A couple of days ago when I was setting up FreeNAS, I wanted to do something to monitor my disks. I realised I would get a daily mail from the FreeNAS system, but daily doesn't cut it with 24 drives...
I started setting up SMART checks (short, long and offline) realising it was a pain to time it all (So to stay clear of scrubs and other SMART checks).
Then last night I fell over a line in the manual stating that if you had a RAID controller, you shouldn't schedule SMART checks, as the RAId controller should do that for you.

GREAT!!!

I did some testing using tw_cli and it seems like mine does.

Then I went down in the basement and pulled a drive. Back at the computer, the only "notice" I got was an entry in the messages log. No mail, no nothing...

After some more reading around, I found that I would need the 3ware GUI (w3dm or something) - But that isn't included in FreeNAS and couldn't really find a way to get it working.

Then, what's a hacker to do?

I created my own littel "wrapper" around the tw_cli program.

What It Does

It queries tw_cli for the disk status. This status is then checked for the keyword: "OK"
If OK isn't in the status a mail is sent to root.

The next time it runs, it will remember the last errors. This makes sure you are only noticed about a disc one time (I run it every 15th minute and only want to know it once) but notify's you of new errors.

It will use a temporary director in /tmp called tw (/tmp/tw).

This will notice you if you pull a drive and potentially if one of your drives fail (Haven't tested this as I don't have a bad drive... But a failed drive wouldn't have "OK" as a status... So it should notify you).

Installing

Installing is easy. Currently it's onlt distributed in source format, but will run fine on FreeNAS.

Use the following commands to install it:
Code:
cd /tmp                                                               
wget  www.v42.dk/data/code/python2/3ware_info_0.1.py                            
chmod +x 3ware_info_0.1.py  


Please note: It will NOT survive a reboot! Haven't found a way to do that yet...


Then I add 2 cron jobs.
One to run every 15th minute using this code: /tmp/3ware_info_0.1.py
One to run every night at 00:00: rm /tmp/tw/tw_last_error

The last cron reset's my "last_error" so I get notified once per day per failed drive.

Epilogue

Hope you can use it.
Currently It only supports 24 drives and 3ware controller. But I bet you can modify it if you need.
If you can't do it yourself, ask me nicely and I'll give it a shot.

EDIT: Oh yeah, I forgot. The output mail will look like this:
Code:
Local Time: Wed Jul 17 16:45:04 2013
 
DISC WARNING!!!
The following disk is not behaving normally:
 
/c0/u4 status = INOPERABLE
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Great job. To solve some of your problems, here's how you make it stay during reboot.

1. Store the files on the zpool. This sounds stupid on the surface but if you were to suddenly lose so many disks that the zpool isn't accessible the problem will be self-revealing so an email is unnecessary.
2. Mount the USB stick writable(mount -uw /), save your file wherever you want it, then make the USB stick read-only again (mount -ur /). Keep in mind that this option does NOT survive FreeNAS upgrades. Option 1 is better because it does survive an upgrade.

Now a few comments. Please don't take this as me bashing your process. I'm sure you learned a whole lot and got to get all snuggly with your controller. Your knowledge and experience may save your data someday.

For most users /c0/u4 isn't going to help them figure out which disk is bad. They aren't likely to know what /co/u4 is(but at least they'll have information that "something" is wrong). Also, if I remember correctly c0 represents controller 0 but u4 is the 4th drive attached to the controller and not necessarily the drive attached to port 4. /c0/p4 would match to an exact port on the controller at all times. Some people(like myself) happen to scatter the drives throughout the controller ports because of the physical location of the hard drives(I spaced them apart for maximum cooling). Because of the disparity between uX and pX I had to work with the FreeNAS developers to fix some user interface issues for 3ware controllers last year.

I did do a little bit of experimenting with the tw_cli. While it does warn you if a disk is inoperable(aka its been disconnected from the host controller), it didn't warn me of disks that were experiencing large numbers of errors. It kept saying that there was no problems with any disks despite one of them clearly failing in a miserable way(millions of read and write errors had occurred on the drive within 24 hours). SMART said that the disk had failed. I have no clue what 3ware's "threshold" is for labeling a disk as "inoperable", but it appears to only be "on disconnect from the host controller". Personally, i find this to be too crude and ineffective to be something I would ever rely on. Of much more importance are hard drives that are still attached to the system but doing nasty things and you are oblivious to the fact that the disk is already failed. Additionally, if you are using ZFS and shutdown a server and a disk fails to come back online on powerup, your email will never notice the failed disk since the disks are not part of a hardware RAID(at least they shouldn't be with ZFS).

If you setup the SMART function in FreeNAS to monitor your drives ever 15 minutes you'll get an email if a disk starts racking up errors (Current Pending Sector Count and Offline Uncorrectables at the least since I've seen those personally and on my 3ware 9650-24M8 too). If a disk is disconnected from the system you'll get an email at the next SMART check that the disk didn't respond(I have personally seen this on a disk that failed on my 3wware 9650-24M8). So essentially the built-in SMART functions in FreeNAS, if setup, will provide you with a superior level of protection and is compatible with more than just the 3ware controllers.

The SMART short tests are useless. If the hardware wasn't capable of passing the short test it wouldn't have been able to function on poweron nor would it be performing normally. You'd already be getting emails as I described in the previous paragraph, additional emails throughout the day that your zpool has developed errors, and it would show up on the nightly email.

If you want more than just the standard FreeNAS features, you can do a script like http://forums.freenas.org/threads/setup-smart-reporting-via-email.6211/ and collect all of the disk information by email. It's not quite as simple as yours, but its very thorough. I keep all of the emails on the zpool so I can refer back to them if I ever had a need to. They're only 150kbytes each.

Edit: I just went looking for another thread that had a similar situation, but I just realized some of this is exactly what I said in the other thread. Sorry.
 

MadsRC

Dabbler
Joined
Jul 14, 2013
Messages
20
Again, thank you very much for your input CyberJock - I really love that you take your time to reply.

I created a dataset and put the script on that :)

As for the script. I'm aware c0/u4 would confuse some... I guess I could add the drives serial number (As I expect people to know their drives serial numbers...) - Also will add support for multiple controllers (c0 and c1 for example.)

This is a version 0.1 and I intend to expand the use of the script. I noticed, as you said, that I didn't get any warnings from the script about Offline Uncorrectable Errors (Got a disc with 1 of those...) But as soon as I scheduled a couple of long test and offline test (2 a month of each) I began receiving some warnings.

The script you linked to actually was my inspiration. As I'm a hacker/coder by trade I enjoy cooking my own programs/scripts.
In the future, the script will be able to start SMART tests, analyse them and notify root of the problem.

While you could use the planned SMART tests, I'm more into a script that I know 100% what will do and when it will mail me. Therefore I'm adding support for SMART tests in this script.

Better to have 2 notifications about an error, than none at all.
 

MadsRC

Dabbler
Joined
Jul 14, 2013
Messages
20
I had forgotten everything about this thread. Code updated - I might even work some more on this ;)
 
Status
Not open for further replies.
Top