Set up SMART Reporting via email

Wolfeman0101 · Jul 7, 2012

Code:

[Derp@freenas] ~# smartctl -n standby -l error -l selftest /dev/ada2
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 16877 hours (703 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d6 01 e0 4f c2 a0 00   1d+17:29:53.210  SMART WRITE LOG

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17846         -
# 2  Short offline       Completed without error       00%     17841         -
# 3  Short offline       Completed without error       00%     17817         -
# 4  Short offline       Completed without error       00%     17793         -
# 5  Short offline       Completed without error       00%     17771         -
# 6  Short offline       Completed without error       00%     17759         -
# 7  Short offline       Completed without error       00%     11317         -
# 8  Short offline       Completed without error       00%      5860         -
# 9  Short offline       Completed without error       00%      1837         -
#10  Short offline       Completed without error       00%       672         -

Code:

[Derp@freenas] ~# smartctl -n standby -l error -l selftest /dev/ada3
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 7368 hours (307 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d6 01 e0 4f c2 a0 00   1d+17:29:42.759  SMART WRITE LOG

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      8333         -
# 2  Short offline       Completed without error       00%      8309         -
# 3  Short offline       Completed without error       00%      8285         -
# 4  Short offline       Completed without error       00%      8263         -
# 5  Short offline       Completed without error       00%      8251         -

joeschmuck · Jul 7, 2012

My first thought is if there is a way to reset the error flag on those two drives. The failures for both drives seem to have occurred at the same time, not sure what happened.

The other option is to create a second script and use that to run on those two drives where you would just dump the entire error message into the email vice trying to make it fit into the subject line. I chose to use the subject line so I didn't have to open the email unless I wanted to.

You can mess around with the makeheader() section, remove the "; cat /var/cover1${drv}" text. See what happens there. I suspect somehow the long text in the subject line is causing an issue with the email. If that doesn't produce text in the message body then you will need to adjust the procnormal() section.

While working on this you should disable sendmail and use cat as I indicated earlier and it will echo everything to the screen so you can work it out.

Also at this line you should comment out 'runsmartshort' so you are not running the test needlessly and waiting 5 minutes while testing the format of the email.

### Lets test the drive
#runsmartshort

If you get real fancy you could add another switch to denote the difference between the two drives but I think a separate script would work better.

Post your results.

Wolfeman0101 · Jul 8, 2012

I commented out runsmartshort. I haven't done anything else because I was busy last night. This is what I got for 1 of the drives.

Code:

Subject: SMART Drive Results for /dev/ada2 - ATA Error Count: 1
 CR = Command Register [HEX] FR = Features Register [HEX]
 SC = Sector Count Register [HEX] SN = Sector Number Register [HEX]
 CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX]
 DH = Device/Head Register [HEX] DC = Device Command Register [HEX]
From: ****
To: ****
Date: Sun, 08 Jul 2012 12:10:01 -0000

#10  Short offline       Completed without error       00%      5860       =
  -
#11  Short offline       Completed without error       00%      1837       =
  -
#12  Short offline       Completed without error       00%       672       =
  -

=20

joeschmuck · Jul 8, 2012

I recommend you remove the code I referenced in the third line of my last reply. This should remove the text stating the ATA Error Count and then see where that takes you.

What model are those two hard drives? If out could post the results of 'smartctl -a /dev/ada2' that should list all the data for that drive.

Wolfeman0101 · Jul 8, 2012

joeschmuck said:
I recommend you remove the code I referenced in the third line of my last reply. This should remove the text stating the ATA Error Count and then see where that takes you.

What model are those two hard drives? If out could post the results of 'smartctl -a /dev/ada2' that should list all the data for that drive.

Code:

smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD1501FASS-00U0B0
Serial Number:    WD-WMAUR0383617
LU WWN Device Id: 5 0014ee 6aab9fc0a
Firmware Version: 01.00101
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jul  8 15:39:25 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(23100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3037)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       182
  3 Spin_Up_Time            0x0027   054   040   021    Pre-fail  Always       -       14316
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       231
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       17875
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       139
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       112
193 Load_Cycle_Count        0x0032   167   167   000    Old_age   Always       -       101233
194 Temperature_Celsius     0x0022   105   103   000    Old_age   Always       -       47
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       10

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 16877 hours (703 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d6 01 e0 4f c2 a0 00   1d+17:29:53.210  SMART WRITE LOG

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17858         -
# 2  Extended offline    Aborted by host               90%     17858         -
# 3  Short offline       Completed without error       00%     17846         -
# 4  Short offline       Completed without error       00%     17841         -
# 5  Short offline       Completed without error       00%     17817         -
# 6  Short offline       Completed without error       00%     17793         -
# 7  Short offline       Completed without error       00%     17771         -
# 8  Short offline       Completed without error       00%     17759         -
# 9  Short offline       Completed without error       00%     11317         -
#10  Short offline       Completed without error       00%      5860         -
#11  Short offline       Completed without error       00%      1837         -
#12  Short offline       Completed without error       00%       672         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

joeschmuck · Jul 8, 2012

Based on what I have read on the internet and the hours that it appears the drive has been running continuous I would recommend replacing the two drives. They have over 2 years of run time. If you want to continue to use them I strongly recommend running the long SMART test on them to verify they are good for right now. If you are running a ZFS with 4 drives for a single pool, two drives will cause complete loss of data. If this is just the testing phase of you building a NAS and you can afford to lose any data you have stored then that if fine. My NAS was not complete until 8.0.4 came out and we had been working on the project for over a year. Once it became stable I transferred all my data to the new FreeNAS box. I keep my old NAS up to date because once 8.1.x comes out I will likely be testing it.

I was hoping to find a way to reset the error code but you are probably better always being reminded that an error on those drives does exist.

The storm is over here so time to get the BBQ warmed up for some thick pork chops. My daughter (15) is doing the cooking so it's a nice day in spite of the storm.

joeschmuck · Jul 8, 2012

Also, I think the term line 'It "wraps" after 49.710 days.' might mean that at ~18071 hours of power on time (8.12 days from now) that the error falls off. It would be nice to know that in 9 days this error message is not here for this drive. The other drive should be cleared in 9 days as well based on values from a previous posting.

So what happened, power outage while the drives were writing, MB failure, something else? Since they happened at the same time I would have to say the drives are probably still good.

Let me know what you think. Also I still recommend the long test either way for peace of mind. I do one once a month on each drive, not at the same time, each is run on a different day. Just a personal preference.

Wolfeman0101 · Jul 8, 2012

joeschmuck said:
Also, I think the term line 'It "wraps" after 49.710 days.' might mean that at ~18071 hours of power on time (8.12 days from now) that the error falls off. It would be nice to know that in 9 days this error message is not here for this drive. The other drive should be cleared in 9 days as well based on values from a previous posting.

So what happened, power outage while the drives were writing, MB failure, something else? Since they happened at the same time I would have to say the drives are probably still good.

Let me know what you think. Also I still recommend the long test either way for peace of mind. I do one once a month on each drive, not at the same time, each is run on a different day. Just a personal preference.

There was a power outage on the 4th of July.

joeschmuck · Jul 24, 2012

@Wolfeman

Did the problem ever resolve itself?

Wolfeman0101 · Jul 24, 2012

joeschmuck said:
@Wolfeman

Did the problem ever resolve itself?

I ended up rebuilding my whole volume with 6 2TB drives in RAIDZ2. I took the bad drives out of the mix. But now I have another issue where I set a daily report for ada0-ada5 but I only get emails for 0-3, never 4 & 5.

joeschmuck · Jul 24, 2012

Wolfeman0101 said:
I ended up rebuilding my whole volume with 6 2TB drives in RAIDZ2. I took the bad drives out of the mix. But now I have another issue where I set a daily report for ada0-ada5 but I only get emails for 0-3, never 4 & 5.

I thought I had responded but maybe I didn't hit send...

Is the last two drives on a separate controller?
Are the last two drives actually names ada4 and ada5?
Can you run the 'smartctl -a /dev/ada4' command and it works?

Just some thoughts is all.

Wolfeman0101 · Jul 24, 2012

joeschmuck said:
I thought I had responded but maybe I didn't hit send...

Is the last two drives on a separate controller?
Are the last two drives actually names ada4 and ada5?
Can you run the 'smartctl -a /dev/ada4' command and it works?

Just some thoughts is all.

Yeah they are ada4 and 5, same controller.

Code:

[Bryan@nibbler] /mnt/Vol1# smartctl -a /dev/ada4
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-WMAZA8497834
LU WWN Device Id: 5 0014ee 25c30a920
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jul 24 14:10:27 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (38460) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   166   166   021    Pre-fail  Always       -       6666
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       330
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7824
194 Temperature_Celsius     0x0022   107   100   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       321         -
# 2  Short offline       Completed without error       00%       297         -
# 3  Short offline       Completed without error       00%       273         -
# 4  Short offline       Completed without error       00%       225         -
# 5  Short offline       Completed without error       00%       201         -
# 6  Short offline       Completed without error       00%       177         -
# 7  Short offline       Completed without error       00%       153         -
# 8  Short offline       Completed without error       00%       129         -
# 9  Short offline       Completed without error       00%       106         -
#10  Short offline       Completed without error       00%        87         -
#11  Short offline       Completed without error       00%        58         -
#12  Short offline       Completed without error       00%        36         -
#13  Short offline       Completed without error       00%        34         -
#14  Short offline       Completed without error       00%        18         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[Bryan@nibbler] /mnt/Vol1#

joeschmuck · Jul 24, 2012

That is odd. I don't know why the script wouldn't work for ada4 & 5. Very odd. It looks like you are testing it every 24 hours but you say you are not seeing the email, correct?

You could run the script manually but first change the last lines in the script to comment out sendmail as in posting #20 above. See what spits out on the screen, if anything.

Wolfeman0101 · Jul 24, 2012

I'm running it but I left it for like 45 mins and it never output.

joeschmuck · Jul 25, 2012

In this section of the code make this change (for testing purposes only, not a complete fix) and take out the "-n standby" so it looks like this:

Code:

### Process to run our check on the drive, setup exclusively for only "-l error". 
# Output cover0
chkdrive()
{
smartctl -l error -l selftest ${switch1} > /var/cover0${drv}
}

This will basically no wait for the drive to spin up. I suspect there is an issue reading the drive error code but not sure why since it's a standard routine.

Are the last two drive a different model than the other drives in the system?

Wolfeman0101 · Aug 23, 2012

joeschmuck said:
In this section of the code make this change (for testing purposes only, not a complete fix) and take out the "-n standby" so it looks like this:

Code:
### Process to run our check on the drive, setup exclusively for only "-l error". # Output cover0 chkdrive() { smartctl -l error -l selftest ${switch1} > /var/cover0${drv} }

This will basically no wait for the drive to spin up. I suspect there is an issue reading the drive error code but not sure why since it's a standard routine.

Are the last two drive a different model than the other drives in the system?

So I took "-n standby" out and it seems to work for ada4 and ada5 now.

joeschmuck · Aug 23, 2012

Wolfeman0101 said:
So I took "-n standby" out and it seems to work for ada4 and ada5 now.

Looks like the drive/controller does not recognize when the drive is coming out of standby so you may have to live with this. It's not perfect as I personally would like the script to only run when the drive "needs" to spin up but doing it this way will force a drive spin-up and at least test your drive.

madmax · Sep 14, 2012

help with multiple drives code for short and long test.

Here is the more complex script but it brings something extra (not explained like the basic code is in the below text but you should be able to understand it). In the previous script if the drive is in standby you will get a report that doesn't tell you much because the drive is not running. In this script it will periodically poll the hard drive to see if it's out of standby and then generate the report plus it cleans up the report some and more importantly you can run it on all the drives at once (same time period) where as the previous script you could only run a CRON job on one drive, wait a minute and run another CRON job. This is by far the better script of the two.

Code:
#!/usr/local/bin/sh # # Place this in /conf/base/etc/ # Call: sh esmart.sh /dev/ada0 # switch1 is the drive to check (passed parameter) switch1=$1 # This will use the characters after "/dev/" for the temp file names. # Example: /dev/ada0 becomes coverada0 or cover0ada0 or cover1ada0 # This needs to be done to keep multiple jobs from using the same files. drv=`echo $switch1 | cut -c6-` # Variable just so we can add a note that the drive was asleep when the # application started but is now awake. c=0 # Process to run our check on the drive chkdrive() { smartctl -H -n standby -l error ${switch1} > /var/cover0${drv} } ( echo "To: youremail@address.net" echo "Subject: SMART Drive Results for ${switch1}" echo " " ) > /var/cover${drv} chkdrive while [ $? != "0" ] do # Pause the checking of the drive to about once a minute if the drive is not running. sleep 59 c=1 chkdrive done if [ $c -eq 1 ] then echo "THE DRIVE WAS ASLEEP AND JUST WOKE UP" >> /var/cover${drv} fi # These lines remove the automatic Branding lines sed -e '1d' /var/cover0${drv} > /var/cover1${drv} sed -e '1d' /var/cover1${drv} > /var/cover0${drv} sed -e '1d' /var/cover0${drv} > /var/cover1${drv} sed -e '1d' /var/cover1${drv} > /var/cover0${drv} cat /var/cover0${drv} >> /var/cover${drv} sendmail -t < /var/cover${drv} # Cleanup our trash rm /var/cover${drv} rm /var/cover0${drv} rm /var/cover1${drv} exit 0 # Set idle mode to so it doesn't spin up. # Options # -n standby (Remove this to force a spinup) # -i = Device Info # -H = Device Health # -A = Only Vendor specific SMART attributes # -l error = SMART Error Log

Code:
#!/usr/local/bin/sh # # Place this in /conf/base/etc/ # Call: sh esmart.sh ( echo "To: YourEmail@Address.net" echo "Subject: SMART Drive Results for all drives" echo " " ) > /var/cover smartctl -i -H -A -n standby -l error /dev/ada0 >> /var/cover smartctl -i -H -A -n standby -l error /dev/ada1 >> /var/cover smartctl -i -H -A -n standby -l error /dev/ada2 >> /var/cover smartctl -i -H -A -n standby -l error /dev/ada3 >> /var/cover sendmail -t < /var/cover exit 0 # Set idle mode to so it doesn't spin up. # Options -n standby # -i = Device Info # -H = Device Health # -A = Only Vendor specific SMART attributes # -l error = SMART Error Log

Again, the '-n standby' will cause an issue at the point where a drive not spinning is encountered. Since I have a single pool of 4 drives, should my first drive exit due to not spinning, I can safely assume my other drives are not spinning either since I have the same HDD Standby (in FreeNAS GUI) settings for each.

-Mark

Is there a difference from the first complicated script that was posted at the beginning of the post from the one that was posted at the end of that first post for the multiple drives? Specifically the first code seems to have more to it and I was wondering if thats the case, where would I want to copy the muiltple drive code into that the complicated code? The complicated code seems to have wake up and sleep options while the code for multiple drives does not? I might be missing something here on why the multiple drives code has less, there was mention about the spin down the same so it might not matter but is that all of the difference?

Also for the long and short test code that was made, is there way I can do mulitiple drives under one script instead making a cron for each drive?

joeschmuck · Sep 15, 2012

The code on posting#12 is the most current version and the advantage it gives is, it will not wake up the drives just to run a SMART test and will wait until something else wakes them up. This script is also only given to a single drive but you can run this script multiple times, ecah time with a different drive and they will reside in background just waiting for the drive to wake up. If the drive is running then the SMART test starts up right away.

The second script you listed is an example of how to run the SMART test on all four (in the example) drives with one script but the drives are sleeping, no testing is performed on the sleeping drives based on the -n standby parameter. If you remove the -n standby then the drives will spin up immediately and run SMART testing as indicated in the script.

If you plan to run either of these scripts, it's a good idea to read up on the smartctl command.

And to answer your question, Yes you can make one script to handle multiple drives but that script gets a bit more complex but it can be done if you have a clear head about how you want it done. Maybe one way to do it is to write a script that calls my script in posting#12 for each drive you have. This would allow you to run only one cron job and seems like the simplest way to do it.

madmax · Sep 17, 2012

So I can run the script without problem, get a email and everything but when I put in CRon job and the result is email that saids

/etc: Permission denied

change the file permission to 777 and its root wheel and of course root is under the CRON Job

is the location of the file in bad spot?

Important Announcement for the TrueNAS Community.

Set up SMART Reporting via email

Patron

Old Man

Patron

Old Man

Patron

Old Man

Old Man

Patron

Old Man

Patron

Old Man

Patron

Old Man

Patron

Old Man

Patron

Old Man

Explorer

Old Man

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Set up SMART Reporting via email"

Similar threads