how do I know if a disk is breaking?

creepwood · Jan 1, 2012

Having used v8 since relase, getting daily security output mails and daily run output. Now I wonder, how do I know if something is about to break or a disk is about to. I'm not sure what everything really means in the daily output mail, etc. Is there any settings I should set. like smart testing once a month etc. I think I did that but never got any mail about it.

I'd like some guidelines. everthing is running awesome, it's making me worried :P

cpotter638 · Jan 1, 2012

Creepwood:
I found the below thread helpful. However, I have unanswered questions.

http://forums.freenas.org/showthread.php?821-HDD-Fail-Notification&highlight=smart

Creepwood & others:
Similar to creepwood's question - does anyone have a suggested schedule for SMART scan? Specifically - which SMART task or tasks and what frequency?

Above referenced thread indicates that NAS emailing with drive failure was not working in earlier versions of FreeNAS 8. Protosd had a work around script. What about FreeNAS 0.8.2? Is email with drive failure working? Or is protosd's workaround script still necessary? Is there a way this can be tested?

Thanks for the help.

cpotter638 · Jan 5, 2012

Another question - I've created a script which checks CPU & hard drive temps and emails at a threshold. At what threshold is CPU temp a problem? What about hard drive temps?

Thanks.

creepwood · Jan 5, 2012

cpotter638 said:
Creepwood:
I found the below thread helpful. However, I have unanswered questions.

http://forums.freenas.org/showthread.php?821-HDD-Fail-Notification&highlight=smart

yeah I already have the -m root in my extra smart options on each drive. Problem is that there isn't really a good way of trying this either?

joeschmuck · Feb 5, 2012

cpotter638 said:
Another question - I've created a script which checks CPU & hard drive temps and emails at a threshold. At what threshold is CPU temp a problem? What about hard drive temps?

That depends on your CPU's maximum operating temperature, or whatever you're comfortable with. For your CPU the thermal max is 69C. I would look at your normal CPU temp under normal load and if it's 50C, I'd then set my alarm set point to 59C to give yourself a little extra room. You could shoot for 55C to see if you get any invalid messages and adjust from there. This is an example.

It would be nice it you could post your script.

cpotter638 · Mar 10, 2012

cpotter638 said:
Creepwood & others:
Similar to creepwood's question - does anyone have a suggested schedule for SMART scan? Specifically - which SMART task or tasks and what frequency?

Above referenced thread indicates that NAS emailing with drive failure was not working in earlier versions of FreeNAS 8. Protosd had a work around script. What about FreeNAS 0.8.2? Is email with drive failure working? Or is protosd's workaround script still necessary? Is there a way this can be tested?

I still have not found answers to the above questions. I've now upgraded to 8.0.4. Anyone who can shed some light re: above questions?

joeschmuck said:
It would be nice it you could post your script.

Joeschmuck: Thanks for your suggestions. My CPU / HD temp monitor script seems to be working well now. I've posted my script in case it helps others.

#!/bin/bash

# --------------------------Email Parameters-----------------------------------------
# Set "From"
FROM=******@sbcglobal.net

# Set email subject
SUBJECT="NAS Temperatures (Routine Monthly Report)"
SUBJECT_ALERT="NAS Temperature Alert"

# Set email recipient
TO=******@sbcglobal.net

# Set variable for email body
BODY=""
BODY_ALERT=""

# Set alert temperatures
# Can't find info re: Intel E3-1200 processor temps. Currently running with max around 85F. Saw one 89F.
CORE_ALERT_TEMP=110
# Max Seagate operating temp = 140F (60C). Drives currently running approx 75F
# Script found online used 110 degrees F
HARD_DRIVE_ALERT_TEMP=100

#Set loop variable
LOOP_COUNT=0

PRINTF=/usr/bin/printf

# --------------------------------CPU TEMP------------------------------------------
# CPU temp header
BODY=$BODY`echo "CPU temperatures: \n \n"`

# Check CPU temps
for i in {0..3}
do
CPU_TEMP=`sysctl -n dev.cpu.$i.temperature`
CPU_TEMP=`echo "1.8*${CPU_TEMP}+32" | bc | awk -F. '{print $1}'`
BODY=$BODY`echo -e "CPU Core #$i temperature is ${CPU_TEMP}\xB0F"`
BODY=$BODY`echo "\n"`
# CPU Temperature Alert
if [ ${CPU_TEMP} -gt $CORE_ALERT_TEMP ]; then
if [ ${LOOP_COUNT} == 0 ]; then
LOOP_COUNT=1
BODY_ALERT=$BODY_ALERT`echo "NAS Temperature Alert: \n \n"`
fi
BODY_ALERT=$BODY_ALERT`echo -e "CPU Core #$i temperature is too high measuring ${CPU_TEMP}\xB0F"`
BODY_ALERT=$BODY_ALERT`echo "\n"`
fi
done

# ------------------------------HARD DRIVE TEMP----------------------------------------
# Set # of disks (Not used until adding more disks)
# Disks=`sysctl -an kern.disks`

# Hard drive temp header
BODY=$BODY`echo "\n \nHard drive temperatures: \n \n"`

# Check hard drive temps
for i in ada0 ada1 ada2 ada3 ada4 ada5;
do
HD_TEMP=`smartctl -a /dev/${i} | awk '$2=="Temperature_Celsius"{print $10}'`
HD_TEMP=`echo "1.8*${HD_TEMP}+32" | bc | awk -F. '{print $1}'`
SERIAL=`smartctl -a /dev/$i| awk '/Serial/{print $NF}'`
BODY=$BODY`echo -e "Hard Drive ${i} (serial #${SERIAL}) temperature is ${HD_TEMP}\xB0F"`
BODY=$BODY`echo "\n"`
# Hard Drive Temperature Alert
if [ ${LOOP_COUNT} == 1 ]; then BODY_ALERT=$BODY_ALERT`echo "\n"`; LOOP_COUNT=2; fi
if [ ${HD_TEMP} -gt $HARD_DRIVE_ALERT_TEMP ]; then
if [ ${LOOP_COUNT} == 0 ]; then
LOOP_COUNT=1
BODY_ALERT=$BODY_ALERT`echo "NAS Temperature Alert: \n \n"`
fi
BODY_ALERT=$BODY_ALERT`echo -e "Hard Drive $i (serial #${SERIAL}) temperature is too high measuring ${HD_TEMP}\xB0F"`
BODY_ALERT=$BODY_ALERT`echo "\n"`
fi
done

# ---------------------------------Reporting--------------------------------------------------------------
# Send email with all CPU & hard drive temps on the 1st of each month
DATE=`date +%d`
if [ ${DATE} == 01 ]; then
$PRINTF "$BODY" | mail -s "$SUBJECT" $TO
fi

# If alert message is not null, then email report
if [ "$BODY_ALERT" ]; then
$PRINTF "$BODY_ALERT" | mail -s "$SUBJECT_ALERT" $TO
fi

joeschmuck · Mar 11, 2012

To answer some of the SMART questions, I have written a very small script which will report the status of your drives and you can modify this any what you see fit to meet your needs.

You asked how to get the error results back, well the following scripts will not only report the SMART test results, they also report additional data to which you can interpret possible failures coming down the road. It helped me isolate a failing SATA cable, but some folks might have just replaced a drive because they didn't have the proper data in their hands.

The SMART Test does not send you an email result unless there was a failure as I understand it and that should work properly in 8.0.4. Until you have had a failure of the SMART test I don't think anyone feels like it works properly so the scrip will at least let you know that it passed.

The other thing you asked was periodicity to run the Short or Long SMART test. I have not found anything on the internet that recommends how frequently a Short or Long test should be conducted. My personal feeling is to run the Short test weekly and the long test maybe monthly. I'm only running the Short test to check the main electronics of the drive and that's about it. The Long test as I understand it additionally check all the hard drive area to be read from. I myself am not worried about that as I'm using ZFS in a RAID so if I have a single drive failure I expect ZFS to notify me. Hopefully I got that right. Also, as I understand SMART testing, everywhere I looked if you get a failure report from a SMART test failure, you got 24 hours or less before the drive fails. I hope that isn't true but I can't say I've read anything more positive than that.

I have two scripts for your viewing pleasure...

Single Drive Request
Call it: sh emart.sh /dev/ada0

Code:

#!/usr/local/bin/sh
#
# Place this in /conf/base/etc/
# Call: sh esmart.sh drive
switch1=$1
(
echo "To: youremailaddress@email.net"
echo "Subject: SMART Drive Results for ${switch1}"
echo " "
) > /var/cover
smartctl -i -H -A -n standby -l error ${switch1} >> /var/cover
sendmail -t < /var/cover
exit 0

# Set idle mode to so it doesn't spin up.
# Options -n standby
# -i = Device Info
# -H = Device Health
# -A = Only Vendor specific SMART attributes
# -l error = SMART Error Log
# -a = All reportable data
echo "---------------END--END--END--END--END--END-----------------"

This is great if you only want to poll a single drive at a time. Just set up a CRON job to do this whenever you desire. Do not run this job more than once during a specific time period as the file "cover" will be getting competing results from multiple runs. Separate them by 1 minutes at least or go for the following script which solves that problem.

The second script lets you do all your drives at once.
Call it: sh esmartall.sh

Code:

#!/usr/local/bin/sh
#
# Place this in /conf/base/etc/
# Call: sh esmartall.sh
(
echo "To: youremail@address.net"
echo "Subject: SMART Drive Results for all drives"
echo "MIME-Version: 1.0"
echo "Content-Type: text"
echo " "
echo "Drive ada0"
) > /var/cover1.txt
smartctl -i -H -n standby -l error -l selftest /dev/ada0 >> /var/cover1.txt
(
echo "---------------END--END--END--END--END--END-----------------"
echo " "
echo "Drive ada1"
) >> /var/cover1.txt
smartctl -i -H -n standby -l error -l selftest /dev/ada1 >> /var/cover1.txt
(
echo "---------------END--END--END--END--END--END-----------------"
echo " "
echo "Drive ada2"
) >> /var/cover1.txt
smartctl -i -H -n standby -l error -l selftest /dev/ada2 >> /var/cover1.txt
(
echo "---------------END--END--END--END--END--END-----------------"
echo " "
echo "Drive ada3"
) >> /var/cover1.txt
smartctl -i -H -n standby -l error -l selftest /dev/ada3 >> /var/cover1.txt
sendmail -t < /var/cover1.txt
exit 0

# Set idle mode to so it doesn't spin up.
# Options
# -n standby = If drive is in Standby, exit. (Will not allow drive spinup)
# -i = Device Info (Does not force spinup)
# -H = Device Health (Does force spinup)
# -A = Only Vendor specific SMART attributes
# -l error = SMART Error Log (Does force spinup)
# -a = List all (Does force spinup)

This second code was written for 4 drives (mine in particular), you can easily add or remove drives, there is enough code there to figure out the repeatability portions but ask if you need help.
You can change the parameters of the smartctl but I'm using what's best for me. I do not want to spin up the drives just to get a smart status as that would lead to an earlier death of the drives. If your drives are spinning already then you get a full report back vice an error 2 message in the email. I do not like the -a option, too much useless data although it is nice to see when you first start messing with smartctl command but the extra data doesn't help you see a problem.

Here is a link to the SMARTCTL manual page. You can use any options you like. http://smartmontools.sourceforge.net/man/smartctl.8.html

Important Announcement for the TrueNAS Community.

how do I know if a disk is breaking?

creepwood

Explorer

cpotter638

Dabbler

cpotter638

Dabbler

creepwood

Explorer

joeschmuck

Old Man

cpotter638

Dabbler

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

how do I know if a disk is breaking?

creepwood

Explorer

cpotter638

Dabbler

cpotter638

Dabbler

creepwood

Explorer

joeschmuck

Old Man

cpotter638

Dabbler

joeschmuck

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "how do I know if a disk is breaking?"

Similar threads