So, I just got bit by this "issue" so I figured I'd share.
As many of you may know, I have a custom email script that emails me nightly. I thought I was super cool because I had the script provide me with what I consider the 2 most important parameters for determining drive health. Temperature(for long term health) and Current Pending Sector Count(for short-term health). Temperature should be kept below 45C at all times, and below 40C at least 50% of the time(but preferably all of the time). Any non-zero Current Pending Sector Count is usually a situation that should be monitored very closely as you likely have a drive that is getting ready to fail.
So my script provides me with something like the following every morning:
I have more than 5 drives, but this proves the point. At a glance I can see that hard drive temps are okay and the current pending sector counts are okay.
But wait. Don't you dare think that current pending sector counts are okay. Check this out... I'll use my newly failed drive to prove my point:
Hmm. Even the "long" printout from the Areca CLI says that Current Pending Sector Count is good. How odd. I do get this information in my email, but as you can see there's no reason to even question how healthy the disk is. Let's use smartctl...
Notice the bold. Current Pending Sector Count is actually 19! I did try to perform an extended offline test, but you cannot perform any SMART tests on drives connected to an Areca controller. The only option I had was to swap the drive with a drive connected to my Intel onboard SATA and run the long test. Naturally it failed. I knew it would fail. I've been moving data to the zpool this drive is part of all day at an amazing 6MB/sec average. When I looked at the drive LEDs the other drives in the zpool would blink every 10-15 seconds for a fraction of a second, but this drive LED was on solid. Ironically this was a hardware RAID that had been abandoned for about 120 days, then put back online to retrieve the data(which went flawlessly), then converted to a zpool. Naturally the zpool has had weird performance issues since the moment it was created, but I assumed the issue was not a failing disk because my emails were saying everything was fine. What makes matters worse is if you do this:
I had moved the drive around as part of troubleshooting. I never suspected that the areca-cli would tell me the drive is good if it isn't.
So what's the moral of the story? Areca controllers, even in JBOD mode, are not the best choice. Not only do they not let you run SMART tests, but their CLI lies about the hard drive status. I knew that it wouldn't let you run SMARt tests, so on my 24 port Norco case I have 4 drives that connect to the Intel SATA controller. I thought that in the event that I needed to run a SMART test I need only swap out the questionable disk with a disk on the Intel SATA controller. Add to that the fact that for this type of error doing a # zpool status have you zero errors on all disks. The only way I would have known about the exact issue was by noticing the LED light on almost constantly and actually querying the drive via smartctl and not the areca-cli. Long live smartctl. Now to resilver with my spare disk and fix my script so I don't get screwed over by this issue again.
Edit: Something else I'd like to add. If I did a # zpool iostat 1 I'd get non-zero values for writes, but then zero for 3-6 seconds, which is how you'd expect the zpool to behave when everything is working. So don't expect zpool iostat to help you out.
Also, I did a # gstat and watched the latency of the devices. All 6 drives in the zpool(the bad disk + the 5 good drives) were attached to the Areca controller. All 6 drives would show very high latency values on occasion(2000ms or higher). So I thought something was up. Also my other zpool is attached to the Areca controller and if the latency of the other zpool was really high and a read or write was being made to the other zpool at the same time them all of the hard disks attached to my Areca controller would be extremely high. So gstat was a way to identify that there was a disk issue, but was not useful in identifying the exact disk for my situation. It may have provided more value (perhaps the exact device that is having issues) for the onboard Intel SATA or a HBA adapter.
	
		
			
		
		
	
			
			As many of you may know, I have a custom email script that emails me nightly. I thought I was super cool because I had the script provide me with what I consider the 2 most important parameters for determining drive health. Temperature(for long term health) and Current Pending Sector Count(for short-term health). Temperature should be kept below 45C at all times, and below 40C at least 50% of the time(but preferably all of the time). Any non-zero Current Pending Sector Count is usually a situation that should be monitored very closely as you likely have a drive that is getting ready to fail.
So my script provides me with something like the following every morning:
The following lists the Current Pending Sector Count for all hard drives on the system in order:
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
The following lists the current temperatures for all hard drives on the system in order:
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
I have more than 5 drives, but this proves the point. At a glance I can see that hard drive temps are okay and the current pending sector counts are okay.
But wait. Don't you dare think that current pending sector counts are okay. Check this out... I'll use my newly failed drive to prove my point:
Code:
# areca-cli disk smart drv=12 S.M.A.R.T Information For Drive[#12] # Attribute Items Flag Value Thres State =============================================================================== 1 Raw Read Error Rate 0x2f 200 51 OK 3 Spin Up Time 0x27 144 21 OK 4 Start/Stop Count 0x32 100 0 OK 5 Reallocated Sector Count 0x33 200 140 OK 7 Seek Error Rate 0x2e 200 0 OK 9 Power-on Hours Count 0x32 86 0 OK 10 Spin Retry Count 0x32 100 0 OK 11 Calibration Retry Count 0x32 100 0 OK 12 Device Power Cycle Count 0x32 100 0 OK 192 Power-off Retract Count 0x32 200 0 OK 193 Load Cycle Count 0x32 199 0 OK 194 Temperature 0x22 112 0 OK 196 Reallocation Event Count 0x32 200 0 OK 197 Current Pending Sector Count 0x32 200 0 OK 198 Off-line Scan Uncorrectable Sector Count 0x30 200 0 OK 199 Ultra DMA CRC Error Count 0x32 200 0 OK ===============================================================================
Hmm. Even the "long" printout from the Areca CLI says that Current Pending Sector Count is good. How odd. I do get this information in my email, but as you can see there's no reason to even question how healthy the disk is. Let's use smartctl...
Code:
# smartctl -q noserial -a --device=areca,12 /dev/arcmsr0
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD30EZRX-00MMMB0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May 10 00:09:41 2013 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (51180) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 492) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   144   144   021    Pre-fail  Always       -       9791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       778
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10533
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       178
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       174
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5967
194 Temperature_Celsius     0x0022   112   104   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       19
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     10533         2375802360
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.Notice the bold. Current Pending Sector Count is actually 19! I did try to perform an extended offline test, but you cannot perform any SMART tests on drives connected to an Areca controller. The only option I had was to swap the drive with a drive connected to my Intel onboard SATA and run the long test. Naturally it failed. I knew it would fail. I've been moving data to the zpool this drive is part of all day at an amazing 6MB/sec average. When I looked at the drive LEDs the other drives in the zpool would blink every 10-15 seconds for a fraction of a second, but this drive LED was on solid. Ironically this was a hardware RAID that had been abandoned for about 120 days, then put back online to retrieve the data(which went flawlessly), then converted to a zpool. Naturally the zpool has had weird performance issues since the moment it was created, but I assumed the issue was not a failing disk because my emails were saying everything was fine. What makes matters worse is if you do this:
Code:
# areca-cli event info Date-Time Device Event Type Elapsed Time Errors =============================================================================== 2013-05-10 04:57:59 H/W MONITOR Raid Powered On 2013-05-10 04:36:30 SW API Interface API Log In 2013-05-10 04:21:14 H/W MONITOR Raid Powered On 2013-05-10 03:39:25 H/W MONITOR Raid Powered On 2013-05-09 08:04:46 SW API Interface API Log In 2013-05-09 06:09:03 H/W MONITOR Raid Powered On 2013-05-09 05:45:37 H/W MONITOR Raid Powered On 2013-05-09 05:31:36 IDE Channel #17 Reading Error 2013-05-09 05:29:42 H/W MONITOR Raid Powered On 2013-05-08 08:02:46 SW API Interface API Log In 2013-05-08 02:00:49 IDE Channel #11 Reading Error 2013-05-08 01:44:35 IDE Channel #11 Reading Error 2013-05-08 01:44:27 IDE Channel #11 Reading Error 2013-05-08 01:44:20 IDE Channel #11 Reading Error 2013-05-08 01:44:10 IDE Channel #11 Reading Error 2013-05-08 01:44:03 IDE Channel #11 Reading Error 2013-05-08 01:43:56 IDE Channel #11 Reading Error 2013-05-08 01:43:49 IDE Channel #11 Reading Error 2013-05-08 01:43:41 IDE Channel #11 Reading Error 2013-05-08 01:42:57 IDE Channel #11 Reading Error 2013-05-08 01:42:50 IDE Channel #11 Reading Error 2013-05-08 01:42:43 IDE Channel #11 Reading Error 2013-05-08 01:42:35 IDE Channel #11 Reading Error 2013-05-08 01:42:28 IDE Channel #11 Reading Error 2013-05-08 01:42:20 IDE Channel #11 Reading Error 2013-05-08 01:42:07 IDE Channel #11 Reading Error 2013-05-08 01:42:00 IDE Channel #11 Reading Error 2013-05-08 01:41:53 IDE Channel #11 Reading Error 2013-05-08 01:41:46 IDE Channel #11 Reading Error 2013-05-07 17:16:13 SW API Interface API Log In 2013-05-07 15:18:15 H/W MONITOR Raid Powered On 2013-05-07 15:01:52 H/W MONITOR Raid Powered On 2013-05-07 13:02:42 IDE Channel #11 Reading Error 2013-05-07 12:59:50 IDE Channel #11 Reading Error 2013-05-07 12:59:37 IDE Channel #11 Reading Error 2013-05-07 12:59:12 IDE Channel #11 Reading Error 2013-05-07 12:58:57 IDE Channel #11 Reading Error 2013-05-07 12:58:50 IDE Channel #11 Reading Error 2013-05-07 12:58:43 IDE Channel #11 Reading Error 2013-05-07 12:57:42 IDE Channel #11 Reading Error 2013-05-07 12:57:28 IDE Channel #11 Reading Error 2013-05-07 12:53:55 IDE Channel #11 Reading Error 2013-05-07 12:53:48 IDE Channel #11 Reading Error 2013-05-07 12:53:21 IDE Channel #11 Reading Error 2013-05-07 12:53:05 IDE Channel #11 Reading Error 2013-05-07 12:52:55 IDE Channel #11 Reading Error 2013-05-07 12:52:29 IDE Channel #11 Reading Error 2013-05-07 12:51:59 IDE Channel #11 Reading Error 2013-05-07 12:51:52 IDE Channel #11 Reading Error 2013-05-07 12:51:23 IDE Channel #11 Reading Error 2013-05-07 12:51:16 IDE Channel #11 Reading Error 2013-05-07 12:51:07 IDE Channel #11 Reading Error 2013-05-07 12:50:59 IDE Channel #11 Reading Error 2013-05-07 12:50:38 IDE Channel #11 Reading Error 2013-05-07 12:50:18 IDE Channel #11 Reading Error 2013-05-07 12:49:56 IDE Channel #11 Reading Error 2013-05-07 12:49:44 IDE Channel #11 Reading Error 2013-05-07 12:49:30 IDE Channel #11 Reading Error 2013-05-07 12:49:19 IDE Channel #11 Reading Error 2013-05-07 12:49:11 IDE Channel #11 Reading Error 2013-05-07 12:49:04 IDE Channel #11 Reading Error 2013-05-07 12:48:37 IDE Channel #11 Reading Error 2013-05-07 12:48:30 IDE Channel #11 Reading Error 2013-05-07 12:48:01 IDE Channel #11 Reading Error =============================================================================== GuiErrMsg<0x00>: Success.
I had moved the drive around as part of troubleshooting. I never suspected that the areca-cli would tell me the drive is good if it isn't.
So what's the moral of the story? Areca controllers, even in JBOD mode, are not the best choice. Not only do they not let you run SMART tests, but their CLI lies about the hard drive status. I knew that it wouldn't let you run SMARt tests, so on my 24 port Norco case I have 4 drives that connect to the Intel SATA controller. I thought that in the event that I needed to run a SMART test I need only swap out the questionable disk with a disk on the Intel SATA controller. Add to that the fact that for this type of error doing a # zpool status have you zero errors on all disks. The only way I would have known about the exact issue was by noticing the LED light on almost constantly and actually querying the drive via smartctl and not the areca-cli. Long live smartctl. Now to resilver with my spare disk and fix my script so I don't get screwed over by this issue again.
Edit: Something else I'd like to add. If I did a # zpool iostat 1 I'd get non-zero values for writes, but then zero for 3-6 seconds, which is how you'd expect the zpool to behave when everything is working. So don't expect zpool iostat to help you out.
Also, I did a # gstat and watched the latency of the devices. All 6 drives in the zpool(the bad disk + the 5 good drives) were attached to the Areca controller. All 6 drives would show very high latency values on occasion(2000ms or higher). So I thought something was up. Also my other zpool is attached to the Areca controller and if the latency of the other zpool was really high and a read or write was being made to the other zpool at the same time them all of the hard disks attached to my Areca controller would be extremely high. So gstat was a way to identify that there was a disk issue, but was not useful in identifying the exact disk for my situation. It may have provided more value (perhaps the exact device that is having issues) for the onboard Intel SATA or a HBA adapter.