Areca RAID controllers and SMART support

Status
Not open for further replies.

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So, I just got bit by this "issue" so I figured I'd share.

As many of you may know, I have a custom email script that emails me nightly. I thought I was super cool because I had the script provide me with what I consider the 2 most important parameters for determining drive health. Temperature(for long term health) and Current Pending Sector Count(for short-term health). Temperature should be kept below 45C at all times, and below 40C at least 50% of the time(but preferably all of the time). Any non-zero Current Pending Sector Count is usually a situation that should be monitored very closely as you likely have a drive that is getting ready to fail.

So my script provides me with something like the following every morning:


The following lists the Current Pending Sector Count for all hard drives on the system in order:

197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK

The following lists the current temperatures for all hard drives on the system in order:

194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK

I have more than 5 drives, but this proves the point. At a glance I can see that hard drive temps are okay and the current pending sector counts are okay.

But wait. Don't you dare think that current pending sector counts are okay. Check this out... I'll use my newly failed drive to prove my point:

Code:
# areca-cli disk smart drv=12
S.M.A.R.T Information For Drive[#12]
  # Attribute Items                           Flag   Value  Thres  State
===============================================================================
  1 Raw Read Error Rate                       0x2f     200     51  OK
  3 Spin Up Time                              0x27     144     21  OK
  4 Start/Stop Count                          0x32     100      0  OK
  5 Reallocated Sector Count                  0x33     200    140  OK
  7 Seek Error Rate                           0x2e     200      0  OK
  9 Power-on Hours Count                      0x32      86      0  OK
 10 Spin Retry Count                          0x32     100      0  OK
 11 Calibration Retry Count                   0x32     100      0  OK
 12 Device Power Cycle Count                  0x32     100      0  OK
192 Power-off Retract Count                   0x32     200      0  OK
193 Load Cycle Count                          0x32     199      0  OK
194 Temperature                               0x22     112      0  OK
196 Reallocation Event Count                  0x32     200      0  OK
197 Current Pending Sector Count              0x32     200      0  OK
198 Off-line Scan Uncorrectable Sector Count  0x30     200      0  OK
199 Ultra DMA CRC Error Count                 0x32     200      0  OK
===============================================================================


Hmm. Even the "long" printout from the Areca CLI says that Current Pending Sector Count is good. How odd. I do get this information in my email, but as you can see there's no reason to even question how healthy the disk is. Let's use smartctl...


Code:
# smartctl -q noserial -a --device=areca,12 /dev/arcmsr0
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD30EZRX-00MMMB0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May 10 00:09:41 2013 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (51180) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 492) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   144   144   021    Pre-fail  Always       -       9791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       778
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10533
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       178
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       174
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5967
194 Temperature_Celsius     0x0022   112   104   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       19
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     10533         2375802360

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Notice the bold. Current Pending Sector Count is actually 19! I did try to perform an extended offline test, but you cannot perform any SMART tests on drives connected to an Areca controller. The only option I had was to swap the drive with a drive connected to my Intel onboard SATA and run the long test. Naturally it failed. I knew it would fail. I've been moving data to the zpool this drive is part of all day at an amazing 6MB/sec average. When I looked at the drive LEDs the other drives in the zpool would blink every 10-15 seconds for a fraction of a second, but this drive LED was on solid. Ironically this was a hardware RAID that had been abandoned for about 120 days, then put back online to retrieve the data(which went flawlessly), then converted to a zpool. Naturally the zpool has had weird performance issues since the moment it was created, but I assumed the issue was not a failing disk because my emails were saying everything was fine. What makes matters worse is if you do this:

Code:
# areca-cli event info
Date-Time            Device           Event Type            Elapsed Time Errors
===============================================================================
2013-05-10 04:57:59  H/W MONITOR      Raid Powered On
2013-05-10 04:36:30  SW API Interface API Log In
2013-05-10 04:21:14  H/W MONITOR      Raid Powered On
2013-05-10 03:39:25  H/W MONITOR      Raid Powered On
2013-05-09 08:04:46  SW API Interface API Log In
2013-05-09 06:09:03  H/W MONITOR      Raid Powered On
2013-05-09 05:45:37  H/W MONITOR      Raid Powered On
2013-05-09 05:31:36  IDE Channel #17  Reading Error
2013-05-09 05:29:42  H/W MONITOR      Raid Powered On
2013-05-08 08:02:46  SW API Interface API Log In
2013-05-08 02:00:49  IDE Channel #11  Reading Error
2013-05-08 01:44:35  IDE Channel #11  Reading Error
2013-05-08 01:44:27  IDE Channel #11  Reading Error
2013-05-08 01:44:20  IDE Channel #11  Reading Error
2013-05-08 01:44:10  IDE Channel #11  Reading Error
2013-05-08 01:44:03  IDE Channel #11  Reading Error
2013-05-08 01:43:56  IDE Channel #11  Reading Error
2013-05-08 01:43:49  IDE Channel #11  Reading Error
2013-05-08 01:43:41  IDE Channel #11  Reading Error
2013-05-08 01:42:57  IDE Channel #11  Reading Error
2013-05-08 01:42:50  IDE Channel #11  Reading Error
2013-05-08 01:42:43  IDE Channel #11  Reading Error
2013-05-08 01:42:35  IDE Channel #11  Reading Error
2013-05-08 01:42:28  IDE Channel #11  Reading Error
2013-05-08 01:42:20  IDE Channel #11  Reading Error
2013-05-08 01:42:07  IDE Channel #11  Reading Error
2013-05-08 01:42:00  IDE Channel #11  Reading Error
2013-05-08 01:41:53  IDE Channel #11  Reading Error
2013-05-08 01:41:46  IDE Channel #11  Reading Error
2013-05-07 17:16:13  SW API Interface API Log In
2013-05-07 15:18:15  H/W MONITOR      Raid Powered On
2013-05-07 15:01:52  H/W MONITOR      Raid Powered On
2013-05-07 13:02:42  IDE Channel #11  Reading Error
2013-05-07 12:59:50  IDE Channel #11  Reading Error
2013-05-07 12:59:37  IDE Channel #11  Reading Error
2013-05-07 12:59:12  IDE Channel #11  Reading Error
2013-05-07 12:58:57  IDE Channel #11  Reading Error
2013-05-07 12:58:50  IDE Channel #11  Reading Error
2013-05-07 12:58:43  IDE Channel #11  Reading Error
2013-05-07 12:57:42  IDE Channel #11  Reading Error
2013-05-07 12:57:28  IDE Channel #11  Reading Error
2013-05-07 12:53:55  IDE Channel #11  Reading Error
2013-05-07 12:53:48  IDE Channel #11  Reading Error
2013-05-07 12:53:21  IDE Channel #11  Reading Error
2013-05-07 12:53:05  IDE Channel #11  Reading Error
2013-05-07 12:52:55  IDE Channel #11  Reading Error
2013-05-07 12:52:29  IDE Channel #11  Reading Error
2013-05-07 12:51:59  IDE Channel #11  Reading Error
2013-05-07 12:51:52  IDE Channel #11  Reading Error
2013-05-07 12:51:23  IDE Channel #11  Reading Error
2013-05-07 12:51:16  IDE Channel #11  Reading Error
2013-05-07 12:51:07  IDE Channel #11  Reading Error
2013-05-07 12:50:59  IDE Channel #11  Reading Error
2013-05-07 12:50:38  IDE Channel #11  Reading Error
2013-05-07 12:50:18  IDE Channel #11  Reading Error
2013-05-07 12:49:56  IDE Channel #11  Reading Error
2013-05-07 12:49:44  IDE Channel #11  Reading Error
2013-05-07 12:49:30  IDE Channel #11  Reading Error
2013-05-07 12:49:19  IDE Channel #11  Reading Error
2013-05-07 12:49:11  IDE Channel #11  Reading Error
2013-05-07 12:49:04  IDE Channel #11  Reading Error
2013-05-07 12:48:37  IDE Channel #11  Reading Error
2013-05-07 12:48:30  IDE Channel #11  Reading Error
2013-05-07 12:48:01  IDE Channel #11  Reading Error
===============================================================================
GuiErrMsg<0x00>: Success.


I had moved the drive around as part of troubleshooting. I never suspected that the areca-cli would tell me the drive is good if it isn't.


So what's the moral of the story? Areca controllers, even in JBOD mode, are not the best choice. Not only do they not let you run SMART tests, but their CLI lies about the hard drive status. I knew that it wouldn't let you run SMARt tests, so on my 24 port Norco case I have 4 drives that connect to the Intel SATA controller. I thought that in the event that I needed to run a SMART test I need only swap out the questionable disk with a disk on the Intel SATA controller. Add to that the fact that for this type of error doing a # zpool status have you zero errors on all disks. The only way I would have known about the exact issue was by noticing the LED light on almost constantly and actually querying the drive via smartctl and not the areca-cli. Long live smartctl. Now to resilver with my spare disk and fix my script so I don't get screwed over by this issue again.

Edit: Something else I'd like to add. If I did a # zpool iostat 1 I'd get non-zero values for writes, but then zero for 3-6 seconds, which is how you'd expect the zpool to behave when everything is working. So don't expect zpool iostat to help you out.

Also, I did a # gstat and watched the latency of the devices. All 6 drives in the zpool(the bad disk + the 5 good drives) were attached to the Areca controller. All 6 drives would show very high latency values on occasion(2000ms or higher). So I thought something was up. Also my other zpool is attached to the Areca controller and if the latency of the other zpool was really high and a read or write was being made to the other zpool at the same time them all of the hard disks attached to my Areca controller would be extremely high. So gstat was a way to identify that there was a disk issue, but was not useful in identifying the exact disk for my situation. It may have provided more value (perhaps the exact device that is having issues) for the onboard Intel SATA or a HBA adapter.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Code:
# areca-cli disk smart drv=12
--cut--
197 Current Pending Sector Count              0x32     200      0  OK
--cut--


Hmm. Even the "long" printout from the Areca CLI says that Current Pending Sector Count is good. How odd. I do get this information in my email, but as you can see there's no reason to even question how healthy the disk is. Let's use smartctl...

This is NOT saying that the pending sector count is 0. It's saying that the threshold "failure point" is 0. The normalized value is 200. The attribute is listed as "OK" because 200 > 0.

This method of listing smart info is not that great because it doesn't give you access to the raw values. For example, how do you check the drives temperature? Generally you need the raw value, as it's in Celsius normally. As the raw value (temp) rises, the normalized value decreases until it hits the threshold. Unless you know how the normalized value is calculated, it is useless to determine the actual temperature of the drive.

Code:
# smartctl -q noserial -a --device=areca,12 /dev/arcmsr0
--cut--
194 Temperature_Celsius     0x0022   112   104   000    Old_age   Always       -       40
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       19
--cut--

Now this information is useful. The raw values tell us the temperature and the actual number of pending sectors. The other info simply tell us that according to 'smart', the values are in the 'normal' range.

Note the quotes around normal. Smart attributes are outside the 'normal' range when the normalized value hits the threshold. If this happens, the drive will fail smart. The thing is, the lowest an attribute can get, EVER, is 1. Look at the thresholds the manufacturer has defined in the case above: Thresholds of 0. In other words, attributes "temperature" and "pending sectors" will NEVER trip smart. Therefore, any attribute with a threshold of 0 will always be listed as "ok" by software that only looks at thresholds and normalized values. That's why the output of "areca-cli" lists "OK" by the attribute. It is ok. At least according to how the smart attributes were setup by WD. This attribute will always be 'OK'.

I'm not surprised you can't run smart tests on drives connected to a 'hardware' raid controller. At least smartctl supports it enough to get real smart info off it.

I have a highpoint rocketraid 3520 8 port card that runs hardware raid 5 supporting an ntfs partition. I also can't run smart tests on drives. The only time I have, is when I've pulled the drives from the system, and hooked them up to another machine. I don't even think smartctl lets me peek at the smart info that yours does. My card does let me setup automatic email's on events though. And I have scheduled automatic raid 'verifies'. I get emails like this:

Code:
Thu, 02 May 2013 20:06:28 Mountain Daylight Time:    
[hptiop]: 	Successfully repaired bad sector on disk 'ST31000528AS-disk_serial_here' at Controller1-Channel8: LBA 0x43de5a00 sectors 0x80 .


This tells me the controller has detected a bad sector, and forced the drive to remap it.

The cards web interface (http://127.0.0.1:7402) does give me a little more info, like drive temperature, total reallocated sectors, etc. If it was developing errors like yours, it would have dropped it from the array completely, and started beeeeeping about a bad drive, and a degraded array. And emailed me of course. Here's a case where it did:

Code:
	2011/8/26 2:40:6	Disk 'ST31000528AS-disk_serial_here' at Controller1-Channel2 failed.
	2011/8/26 2:40:6	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:58 An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:54	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:48	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:44	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:40	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:36	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:30	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:26	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:22	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2. 
	2011/8/26 2:39:22	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.


Note it's in reverse chronological order. Read from the bottom up.


I think the moral of this story, is don't use hardware raid controllers with zfs, but cyberjock already knows that. I personally won't use any controller that doesn't give me FULL smartctl functionality. Not being able to run smart long tests directly from smartctl is a show stopper to me. Not having access to the raw smart values is a showstopper too.

It still baffles me that people run hourly short tests on their drives. I don't know why. A drive has to be pretty much bricked before it'll fail a short test. It really does very little, especially if the manufacturer has most of the smart attribute thresholds set to 0, as is the case with most drives. Note the drive above has a 0 for temperature threshold too. This drive will never fail a smart test because it's too hot. Cyberjock, I'm curious if that drive would pass a short test. It very well might. Only if the bad spot happens to be in a location that the short tests seeks to to will the test fail. The fact it's failing the long test means you shouldn't have any problems with an rma. If it's still under warranty of course.

I really think some sort of long term smart tracking script is in order. I'd like to implement it, but I don't have the knowledge. It would have to check all smart attributes, and apply some logic to the values. You can't just check for changing values, as most attributes change over time. You can't just check for slowly changing values, as some change quite quickly without there being cause for alarm. Some attributes you could simply monitor for non-zero raw values, such as pending, and reallocated. But there would have to be a way to 'acknowledge' the non-zero value, and only raise a flag if the value gets any higher. I have some disks with 20-30 reallocated sectors that work just fine. And have had 20-30 reallocated sectors for years. But if they got any worse, at any rate of change, I'd be looking to retire them.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
This is NOT saying that the pending sector count is 0. It's saying that the threshold "failure point" is 0. The normalized value is 200. The attribute is listed as "OK" because 200 > 0.

This method of listing smart info is not that great because it doesn't give you access to the raw values. For example, how do you check the drives temperature? Generally you need the raw value, as it's in Celsius normally. As the raw value (temp) rises, the normalized value decreases until it hits the threshold. Unless you know how the normalized value is calculated, it is useless to determine the actual temperature of the drive.

You're right. I was 1/2 asleep and was running in adrenaline trying to get the drive swapped out and the resilvering started.

But the temperature is provided. Scroll back up and look at my first code section. The email I get is slightly different from the standard default though. I simplified it down into 3 sections in my script. One section for temperatures only(in device order), another for Current Pending Sector Count(again, in device order), and the last section is the long output(yet again, in device order). Since "temperature" was correct(its in Fahrenheit and changes but whatever) I had no reason to question that Current Pending Sector Count would do something... anything.. when the raw value changed on the hard drive. Low and behold it doesn't. That's my whole problem and the reason for the post. As far as I can tell, the Current Pending Sector Count that is being parsed from the RAID controller is using static values, so it'll always appear good. They never change!

Now this information is useful. The raw values tell us the temperature and the actual number of pending sectors. The other info simply tell us that according to 'smart', the values are in the 'normal' range.

Note the quotes around normal. Smart attributes are outside the 'normal' range when the normalized value hits the threshold. If this happens, the drive will fail smart. The thing is, the lowest an attribute can get, EVER, is 1. Look at the thresholds the manufacturer has defined in the case above: Thresholds of 0. In other words, attributes "temperature" and "pending sectors" will NEVER trip smart. Therefore, any attribute with a threshold of 0 will always be listed as "ok" by software that only looks at thresholds and normalized values. That's why the output of "areca-cli" lists "OK" by the attribute. It is ok. At least according to how the smart attributes were setup by WD. This attribute will always be 'OK'.

Yeah, the fact that WD used a value of zero is just stupid.

I'm not surprised you can't run smart tests on drives connected to a 'hardware' raid controller. At least smartctl supports it enough to get real smart info off it.

Actually, I'd like to amend what I said about running SMART tests. It appears that you can run SMART tests via smartctl(but not areca-cli). Strangely though, if you start a long test and then view smart data on the drive it doesn't appear that the test is running. But if you try to start another long test within the expected time frame of the previous test running(my drives says 446 minutes), then it tells you that a test is already in progress. Go figure! If you view the SMART data on the drive that is running the test it says "No self-tests have been logged. To run self-tests, use: smartctl -t". Well, that's what I did to start the test. I'll see what happens when the disk finishes its long test. Hopefully I'll be able to see the results. I'll definitely post back and let everyone know if the test results are viewable.

I have a highpoint rocketraid 3520 8 port card that runs hardware raid 5 supporting an ntfs partition. I also can't run smart tests on drives. The only time I have, is when I've pulled the drives from the system, and hooked them up to another machine. I don't even think smartctl lets me peek at the smart info that yours does. My card does let me setup automatic email's on events though. And I have scheduled automatic raid 'verifies'. I get emails like this:

Code:
Thu, 02 May 2013 20:06:28 Mountain Daylight Time:    
[hptiop]: 	Successfully repaired bad sector on disk 'ST31000528AS-disk_serial_here' at Controller1-Channel8: LBA 0x43de5a00 sectors 0x80 .


This tells me the controller has detected a bad sector, and forced the drive to remap it.

The cards web interface (http://127.0.0.1:7402) does give me a little more info, like drive temperature, total reallocated sectors, etc. If it was developing errors like yours, it would have dropped it from the array completely, and started beeeeeping about a bad drive, and a degraded array. And emailed me of course. Here's a case where it did:

Code:
	2011/8/26 2:40:6	Disk 'ST31000528AS-disk_serial_here' at Controller1-Channel2 failed.
	2011/8/26 2:40:6	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:58 An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:54	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:48	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:44	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:40	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:36	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:30	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:26	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.
	2011/8/26 2:39:22	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2. 
	2011/8/26 2:39:22	An error occured on the disk at 'ST31000528AS-disk_serial_here' at Controller1-Channel2.


Note it's in reverse chronological order. Read from the bottom up.

Yeah, I figured SMART tests via the RAID controller would be out of the question too. I tried to use smartctl in the beginning, but everything I read said it should work, but when I tried it it wouldn't. Now it does, so my command line syntax must have been wrong.

My controller can send emails too, but it can't send emails to services that use SSL or TLS. So I couldn't setup the controller to auto-email me when there's a problem. Boo! There's posts in the Areca forum support area that people have complained about this. Areca said that only their newest controllers support it.

I knew about the potential issues with using a RAID controller, but I was shocked when the pending sector count is basically a cheat. Since I knew long ago there might be a day when I wanted to run SMART tests on my hard drives I have 4 ports on Intel SATA and the other 20(of 24) ports on my Areca. Just a quick drive swap and I can test any suspect drive.

I think the moral of this story, is don't use hardware raid controllers with zfs, but cyberjock already knows that. I personally won't use any controller that doesn't give me FULL smartctl functionality. Not being able to run smart long tests directly from smartctl is a show stopper to me. Not having access to the raw smart values is a showstopper too.

It wasn't a show stopper for me, but if I were building a new system for myself or someone else, it would be. Disabled veterans don't have cash to drop on lots of controllers and stuff. :P I tried to sell my other spare 24 port RAID controller for less than 1/2 price of new(with a free BBU) but there were no buyers. I had this 24 port controller already and I don't have 3xPCIe slots to buy 3 M1015s.

It still baffles me that people run hourly short tests on their drives. I don't know why. A drive has to be pretty much bricked before it'll fail a short test. It really does very little, especially if the manufacturer has most of the smart attribute thresholds set to 0, as is the case with most drives. Note the drive above has a 0 for temperature threshold too. This drive will never fail a smart test because it's too hot. Cyberjock, I'm curious if that drive would pass a short test. It very well might. Only if the bad spot happens to be in a location that the short tests seeks to to will the test fail. The fact it's failing the long test means you shouldn't have any problems with an rma. If it's still under warranty of course.

I did run a short test. It does pass.. haha. I just didn't mention it when I made my post because I ran the test after. As I told Stephen in another thread(he did confused short test with the smart query and said he does short tests every 30 mins) but the short tests are useless. If a hard drive is broken enough that it fails a short test it probably isn't even being detected by the SATA controller its connected to. So you'd already know its bad.

I really think some sort of long term smart tracking script is in order. I'd like to implement it, but I don't have the knowledge. It would have to check all smart attributes, and apply some logic to the values. You can't just check for changing values, as most attributes change over time. You can't just check for slowly changing values, as some change quite quickly without there being cause for alarm. Some attributes you could simply monitor for non-zero raw values, such as pending, and reallocated. But there would have to be a way to 'acknowledge' the non-zero value, and only raise a flag if the value gets any higher. I have some disks with 20-30 reallocated sectors that work just fine. And have had 20-30 reallocated sectors for years. But if they got any worse, at any rate of change, I'd be looking to retire them.

I agree. I use temps and Current Pending Sector Count because every drive I've had fail that was still detected by the SATA controller it was attached to always had a non-zero Current Pending Sector Count. A script that had some brains would be nice. Unfortunately, that's not for me to do :(
 

budmannxx

Contributor
Joined
Sep 7, 2011
Messages
120
This is NOT saying that the pending sector count is 0. It's saying that the threshold "failure point" is 0. The normalized value is 200. The attribute is listed as "OK" because 200 > 0.

But the temperature is provided. Scroll back up and look at my first code section.
...
As far as I can tell, the Current Pending Sector Count that is being parsed from the RAID controller is using static values, so it'll always appear good. They never change!

@titan_rw, massively informative post, thank you.

@cyberjock, if I'm interpreting what titan said correctly, I don't think your email is showing you the drive temps either. That VALUE=112 in your first Code section is the normalized value of the actual temp. In your second Code section, the smarctl output, the raw temp is 40C--that is 104F, not 112F. My guess is that your bad drive's Current Pending Sectors count hasn't changed recently, so the normalized "200" value hasn't changed. But the drive temp obviously changes all the time, so the normalized value changes too (e.g. it's 100 in the email sample from the top of your post, and 112 in the smartctl output).

Very easy to confuse once you start getting Farenheit into your thinking since 40C-45C (about the max HDDs should be allowed to get, as you've thankfully posted a million times in the forum) is 104F-113F, and the normalized values when the drive is around low-mid 40s Celcius apparently are around 100-112. Thanks HDD manufacturers! [/sarcasm]

Hopefully what doesn't get lost in this thread is that you've found yet another real-world example showing why using a hardware RAID controller may be a bad idea.

A script that had some brains would be nice. Unfortunately, that's not for me to do :(
Unfortunately I'm far too underqualified to tackle this as well.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
@titan_rw, massively informative post, thank you.

@cyberjock, if I'm interpreting what titan said correctly, I don't think your email is showing you the drive temps either. That VALUE=112 in your first Code section is the normalized value of the actual temp. In your second Code section, the smarctl output, the raw temp is 40C--that is 104F, not 112F. My guess is that your bad drive's Current Pending Sectors count hasn't changed recently, so the normalized "200" value hasn't changed. But the drive temp obviously changes all the time, so the normalized value changes too (e.g. it's 100 in the email sample from the top of your post, and 112 in the smartctl output).

You're right. I was 1/2 asleep and was running in adrenaline trying to get the drive swapped out and the resilvering started.

But the temperature is provided. Scroll back up and look at my first code section. The email I get is slightly different from the standard default though. I simplified it down into 3 sections in my script. One section for temperatures only(in device order), another for Current Pending Sector Count(again, in device order), and the last section is the long output(yet again, in device order). Since "temperature" was correct(its in Fahrenheit and changes but whatever)


If this:

Code:
194 Temperature 0x22 100 0 OK


Is the most 'verbose' the output gets, then budmannxx is correct. The raw temperature is NOT reported here, just as the raw value for reallocated is not reported. This tells you the 'normalized' value is 100. Unless you know the exact formula the manufacturer used to convert raw temperature into the normalized value, the normalized value is pretty useless. As the actual temperature goes up, the normalized value will go down. But how much, and how fast, only WD knows. Unless you did extensive 'calibration tests' to reverse engineer it. However it could be that the normalized value is too 'coarse' for what we need. Ie, it might take 5-10 degrees for a single point drop in normalized value. Who knows.

I had no reason to question that Current Pending Sector Count would do something... anything.. when the raw value changed on the hard drive. Low and behold it doesn't. That's my whole problem and the reason for the post. As far as I can tell, the Current Pending Sector Count that is being parsed from the RAID controller is using static values, so it'll always appear good. They never change!

Ahhh. But it does, and will change. The values are not static. As with the temperature above. As sectors get reallocated, the normalized value you're seeing will drop. However, how fast, and how many sectors need to be reallocated to see a drop, is set by the manufacturer.

Honestly, I think I read your thread on the script that you use(d). I saw that it was only reporting the 'normalized' value, and thought to myself "whatever, wouldn't be the info I want, but it's not my script". Sometimes, depending on the raid controller, that's all you have. My highpoint card, I have no smartctl use at all. Their web interface gives me a 'real' temperature, however all the rest of the smart attributes only report the 'normalized' values. Not too useful.


I actually have a perfect example of a failed drive as well. This is a 1.5tb wd green. Well outside of warranty. There's tons of reallocated sectors. The drive still functions, but is extremely slow. I can only write to it at about 2MB/sec.

nFxDo5p.jpg


I don't have the drive accessible via smartctl, so the output of crystal disk info on windows will have to suffice.

As you can see, "reallocation event count" has a 'normalized' value of 1. This is as low as it will ever get, so it's hit Rock Bottom per se. The threshold is 0, so this attribute is "OK". The raw value actually shows us there's 1156 reallocated sectors though. Far, far from 'ok'.

Pending sectors is 0, so (in theory at least), all the bad sectors have been found and reallocated.

"reallocated sectors count" has a normalized value of 41. WD actually set a threshold on this one, of 140. Since 41 <= 140, smart has been tripped on this attribute. Again, raw values show us 1265 bad sectors.

Also, now that I'm looking at it, check out "write error rate". This is a case where the raw value doesn't much help, unless you know what's being reported. I have another thread that explains the raw values of some seagate drives. They can get pretty convoluted. However, look at the normalized value, and the worst value. Normalized is 195, which is good. However, the worst value is 1, IE, Rock Bottom. Again, since there's no threshold set, this doesn't set off any alarms. If a proper threshold would have been set by WD, this attribute would be listed as "failed previously" or something like that.


I need to do more learning on perl, and / or bash. It cant be that hard to have something that checks certain attributes, and if different / greater than values defined in a 'config' file, do something. That way, if reallocated sectors goes from 0 to 1, you'd get an alert. You'd then have to go change the config so as to tell it "1 is ok, alert me if it's higher than 1 from now on". Crystal disk info pretty much does this. You set certain numbers that are 'ok' for reallocated sectors, etc. If the raw value exceeds these numbers, you'll get a 'warning' yellow box instead of green 'ok'. If the attribute trips smart, as in my drive, you get a red 'bad' box.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, so I did a smartctl and areca-cli commands. The areca-cli returns a "Value" of 113 and "Thres"(hold) of 0. The smartctl returns a "value" of 113, but a "raw_value" of 39. I thought that the 113 was Fahrenheit before, but it turns out that's wrong(bad bad assumptions on my part). My raw value of 39 is Celsius, which converted to Fahrenheit is 102F. So yeah, more BS and I had it wrong. Thanks for correcting me. I probably would have never noticed that specific problem, but I've definitely determined that the areca-cli is useless for obtaining smart data. I incorrectly assumed that the "value" from the areca-cli, since it changes throughout the day, was the temp in Fahrenheit. I knew that the drives weren't 113C. :P Thanks buddmannxx for clarifying that for me. So much is going on with this and now I'm questioning everything and wondering what else I have wrong. :P

As for the Current Pending Sector Count, all of my drives are WD Greens(mix of 2 and 3 TB drives) and all have a value of 200 by areca-cli. Even the bad drive is 200. Not sure what will trigger that to move, but I was looking for any change on those lines since all 24 of my Green drives have a value of 200 and Threshold of 0. I thought that if I had one bad sector that the 200 value would change to 199 and so on. It turns out that that is a bad assumption to make(as you've pointed out titan_rw) and I've figured it out... after the fact. LOL. I thought that the "value" for each parameter was interpreted by the areca controller(their engineers don't speaka good engrish) so I thought it was just an error in translation. Guess they had it all right and I had it wron by making that assumption.

As a sidenote, my spare controllers that I had to choose from when building this system(it was a Windows Server until December) were a Highpoint RocketRAID 3560(24 port SATA) or a RocketRAID 4520 and use a SAS expander I have for 24 ports. As you've pointed out titan_rw, the Highpoint controllers have no SMART support at all except through their CLI which doesn't work on FreeNAS. I have a ticket in to have it added, but its not a high priority right now(hopefully with 9.1 it will just work if we add it ourselves).

Lastly, the hard drive should have completed its long test that I started earlier from smartctl -t long. It has no log entry for the long test, so either I interrupted it with a server reboot earlier(which I did the math and it shouldn't have) or it didn't actually do the long test at all. Remember that it didn't say that a long test was in progress when I viewed the status of the drive after starting the test, but if I attempted to run a long test on the drive a minute later it said I had to wait for the current test to finish before I could start another. Clearly more testing is in order to figure out if the long tests actually run or not. I did do a short test and it does have the log entry in the hard drive's smart test results.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I've had drives that never report a test 'in progress'. Once the test finishes though, it showed up in the normal area as "test complete, no errors".

Looks like you picked the best suited raid controller out of the ones you have. I'd say the important part is that you can access smart attributes, and their raw values. And with smartctl it looks like you can. Being able to run smart tests live is nice, but considering you had the raid card, and it's doing what you need, (for the most part), I think you're fine.

I'm not sure how many pending sectors it takes to get the normalize value to change. That's up to the manufacturer and how they code smart on their end. That's one of the problems with smart, a lack of standards with the 'standard'.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm not sure how many pending sectors it takes to get the normalize value to change. That's up to the manufacturer and how they code smart on their end. That's one of the problems with smart, a lack of standards with the 'standard'.

That's what's really bothering me. I thought that 1 sector would increment that down 1 value. Turns out that its not a value of 1, but definitely something above 36(that's what is up to). The zpool is happily resilvering(8 hours to go). This whole thing is just a major disappointment. I thought I had this all thought out, understood, and had processes in place so I'd know when something wasn't right. Turns out that wasn't true. I do monitor my system pretty closely anyway, but I always felt that I didn't have a need to. Now I'm going to be watching those emails much closer, and I think I have an improved email.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
And... FreeNAS ticket 2175(add SMART support for Areca controllers).

After 3 hours I got source code working for true Areca support for FreeNAS. Hopefully this will help prevent someone else from repeating my issues with this controller. So now us Areca users can rejoice ;)
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I use temps and Current Pending Sector Count because every drive I've had fail that was still detected by the SATA controller it was attached to always had a non-zero Current Pending Sector Count. A script that had some brains would be nice. Unfortunately, that's not for me to do :(

Dooohhh..

There's no need for any kind of script I / you were thinking off. Smartd will do everything I want. It already emails about current pending, and offline uncorrectable sectors. And anything else you want it to do can be setup. By checking out

[url]http://smartmontools.sourceforge.net/man/smartd.conf.5.html[/URL]

it seems you can do pretty much anything with it.

-l selftest will warn about any smart self tests that have failed.

-l error or -l xerror will warn about any errors logged in the smart error log.

-C and -U seem to be default as they're for current pending, and offline uncorrectable.

-f -p -u -t -i -I -r -R can be used in a variety of ways for monitoring other variables, either their normalized value, or raw values.

Just for a start I'm going to try "-R 5! -R 10! -R 187! -R 197! -R 198! -R 199!". These are variables where any change in the raw value, and I want to know about it. Whereas something like "-r 1! -r 7!" would be better for those attributes, as the raw value is less meaningful (health status wise) than changes in the normalized value.

Or one could use "-t" to monitor changes in normalized values of ALL attributes, using "-I" to ignore attributes such as temperature, or power on hours that are expected to change. Then specifically add -R alerts for raw values as well.

So many options. Never knew it would do all that.

Hmm. Reading through it, it looks like "-a" is implied if no other options are stated in the config. The freenas smartd.conf file doesn't state any specific options, so if we go adding our own options, I assume the 'automatic' -a won't apply. So I guess we should start any custom options with -a to specifically imply all the 'standard' options, in addition to our own. In other words, no options implies "-a", and -a is synonymous with "-H -f -t -l error -l selftest -l selfteststs -C 197 -U 198"


Playing around with my settings now.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Cool, it even works. Set a "-R 9!" on one of my drives. This is alert on raw value change for attribute 9, power on hours. Just got this email:


Device: /dev/ada2, SMART Usage Attribute: 9 Power_On_Hours changed from 95 [Raw 4835] to 95 [Raw 4836]


Now to disable the test, and setup what I want.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There are some hard drives that will normally cause warnings because the parameters are always changing. Seagate's raw read error rate(parameter 1, whatever that is.. I might have it wrong) does weird things and you'd get a warning every single time the drive is queried. Not what I'd like to see every 30 mins.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Yea, if the normalized values are in constant flux, you can't setup a trigger event on them. By default however, you only get one alert per drive per 'trigger'. You can change that to one per day, or one per check interval (30 min by default) if you'd like.

If something is changing that frequently, there's probably little need to monitor the normalized value anyway. All you need to know is that there's no smart trip on that attribute. And smartd is setup to check for failed smart attributes anyway, so you don't need to do anything to be notified in that case.

And if that particular attribute has a threshold of 0, like we've seen on a lot of smart attributes, then there's no way to monitor it through automation.
 

budmannxx

Contributor
Joined
Sep 7, 2011
Messages
120
Dooohhh..

There's no need for any kind of script I / you were thinking off. Smartd will do everything I want. It already emails about current pending, and offline uncorrectable sectors. And anything else you want it to do can be setup. By checking out

[url]http://smartmontools.sourceforge.net/man/smartd.conf.5.html[/URL]

it seems you can do pretty much anything with it.

This is awesome. I never thought to look for where FreeNAS stores the options entered in Services>SMART.

There are some hard drives that will normally cause warnings because the parameters are always changing. Seagate's raw read error rate(parameter 1, whatever that is.. I might have it wrong) does weird things and you'd get a warning every single time the drive is queried. Not what I'd like to see every 30 mins.

Yeah, for most of the attributes, getting an email whenever they change would be insane. But for the 2 you seem to care about, Current_Pending_Sector and Temperature_Celcuis, it looks like there are specific flags for monitoring them: -C for current pending sectors (with a + option if you only want diffs since the last check) and -W DIFF for temperature. The -i and -I flags let you ignore whatever SMART attributes you don't care about to keep email clutter to a minimum.

For anyone else finding this thread, you put the additional smartd parameters in the "S.M.A.R.T. extra options" box found at Storage>Volumes>View Disks>Edit. You can just use the GUI for the -n, -m, and -W flags. To verify your changes you can:
Code:
cat /usr/local/etc/smartd.conf

To test your saved flags, you can reload the smartd.conf with:
Code:
smartd -c /usr/local/etc/smartd.conf


It would be cool if the GUI supported setting the flags once for multiple drives, as appears to be supported by smartd.conf itself. From the man page:

[NEW EXPERIMENTAL SMARTD FEATURE] If an entry in the configuration file starts with DEFAULT instead of a device name, then all directives in this entry are set as defaults for the next device entries.

This configuration:

DEFAULT -a -R5! -W 2,40,45 -I 194 -s L/../../7/00 -m admin@example.com
/dev/sda
/dev/sdb
/dev/sdc
DEFAULT -H -m admin@example.com
/dev/sdd
/dev/sde -d removable

has the same effect as:

/dev/sda -a -R5! -W 2,40,45 -I 194 -s L/../../7/00 -m admin@example.com
/dev/sdb -a -R5! -W 2,40,45 -I 194 -s L/../../7/00 -m admin@example.com
/dev/sdc -a -R5! -W 2,40,45 -I 194 -s L/../../7/00 -m admin@example.com
/dev/sdd -H -m admin@example.com
/dev/sde -d removable -H -m admin@example.com

Maybe I'll submit this as a feature request.

- - - Updated - - -

And if that particular attribute has a threshold of 0, like we've seen on a lot of smart attributes, then there's no way to monitor it through automation.

I could be misreading the man page, but wouldn't the -l (lowercase "L") flag let you monitor individual attributes, even if the threshold hasn't been crossed? You could also use -r if you wanted to see the raw values alongside the normalized ones.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
For anyone else finding this thread, you put the additional smartd parameters in the "S.M.A.R.T. extra options" box found at Storage>Volumes>View Disks>Edit. You can just use the GUI for the -n, -m, and -W flags. To verify your changes you can:
Code:
cat /usr/local/etc/smartd.conf

To test your saved flags, you can reload the smartd.conf with:
Code:
smartd -c /usr/local/etc/smartd.conf

There's no need to call smartd manually. Everytime you update the additional smart parameters in the disk section the system reloads smartd.

Also, specifiying -W manually is not needed. The system specifies this for you, set globally for the entire system. Change the settings in "services - smart". You may be able to override a drives warning temperature by specifying -W manually, but I personally wouldn't need too. A hot drive is a hot drive. Some of my drives, due to where they are in the case, always run a little cooler than others. But the temperature at which I'd consider them "too hot" is the same as the other drives. They'll just take longer to get there.

It would be cool if the GUI supported setting the flags once for multiple drives, as appears to be supported by smartd.conf itself. From the man page:

Maybe I'll submit this as a feature request.

I really don't see this as needed. smartd.conf is machine generated. So who cares if it's a little 'wordy' or contains duplicate info?

What we would need, is an automated way to apply the 'additional smart parameters' section to multiple drives. I believe there was already a feature request for this. Setting the same 'string' of additional parameters on 20+ drives would be tedious. As a workaround, one could use the sqlite3 command and modify the DB directly, if you had lots of drives to change.


I could be misreading the man page, but wouldn't the -l (lowercase "L") flag let you monitor individual attributes, even if the threshold hasn't been crossed? You could also use -r if you wanted to see the raw values alongside the normalized ones.

You're misquoting me to do omission. The discussion Cyberjock and I were having was in the case of an attribute that has a raw value like seagates "raw read error rate". This it bit encoded, so monitoring raw changes won't help you. This value will always be changing. Also, the normalized value may change quite widely as well. So you can't simply monitor for changes in the normalized value. The only option is to monitor for if / when this attribute exceeds the threshold value, and therefore trips smart. If the manufacturer has not set a threshold, ie, it's 0, then you can't monitor this attribute through automation: Raw value always changing. Nomalized value always changing. No smart threshold set for monitoring. What is left to monitor?


I just had a valid email regarding my additional smart parameters. This is a 'good' email as it shows an improvement in the drives health:

Code:
Device: /dev/ada0, SMART Usage Attribute: 197 Current_Pending_Sector changed from 100 [Raw 1] to 100 [Raw 0]


I knew there was a pending sector, as every time smartd starts, it would email me the fact the raw value is non zero. But now I know that there is no longer any pending sectors, and the lack of emails is not due to smartd being set to 'one email per start instance', but due to the problem not existing anymore.

Setting a bunch of additional parameters to 'track' won't always result in emails that are 'bad', or need action taken. That's up to the admin of the box to determine what, if any, action needs to be taken. But I appreciate the extra information on my drives.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I know. After I get my other ticket resolved then I'll be testing that code. Kind of at a stand still on that server until this encryption thing is resolved. That issue is actually kind of scary to me because I don't even know how to end-around and do it behind the GUI's back. And as of right now my zpool can no longer be recovered with any redundancy using the recovery key. Yikes!

I might open a thread on it in the forum and see if anyone has ideas.
 
Status
Not open for further replies.
Top