SMART errors, but all SMART tests pass?

cat3481 · Oct 13, 2013

I am running FreeNAS 9.1.0, I have 6 * 3tb hard drives, using RAIDZ2. The system was built a little over a month ago, all new components and hard drives.

When I log into the GUI front end in my web browser the main log (a preview is shown at the bottom of the screen) is reporting the following:

Code:

Oct  6 06:32:01 freenas smartd[2337]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!

Oct 12 10:32:01 freenas smartd[2337]: Device: /dev/ada0, FAILED SMART self-check. BACK UP DATA NOW!
Oct 13 08:02:02 freenas smartd[2337]: Device: /dev/ada2, FAILED SMART self-check. BACK UP DATA NOW!
Oct 13 08:02:02 freenas smartd[2337]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!

Very clear, but upsetting, since that's 4 out of 6 hard drives. I have spent today doing extra backups of my NAS, for obvious reasons. Wanting some more details, I opened the console and ran the command:

smartctrl -h /dev/ada0 | more

the interesting bits were:

Code:

=== START OF INFORMATION SECTION ===                                                                                             
Model Family:    Western Digital Red (AF)                                                                                       
Device Model:    WDC WD30EFRX-68AX9N0                                                                                           
Serial Number:    WD-WCC1T1490741                                                                                                 
LU WWN Device Id: 5 0014ee 2b38719e7                                                                                             
Firmware Version: 80.00A80                                                                                                       
User Capacity:    3,000,592,982,016 bytes [3.00 TB]                                                                               
Sector Sizes:    512 bytes logical, 4096 bytes physical                                                                         
Device is:        In smartctl database [for details use: -P show]                                                                 
ATA Version is:  ACS-2 (minor revision not indicated)                                                                           
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)                                                                         
Local Time is:    Sun Oct 13 18:53:39 2013 PDT                                                                                   
SMART support is: Available - device has SMART capability.                                                                       
SMART support is: Enabled                                                                                                         
                                                                                                                                 
=== START OF READ SMART DATA SECTION ===                                                                                         
SMART overall-health self-assessment test result: PASSED                                                                         
 
...                                                                                                                                 
                                                                                                                                 
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                 
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0                                         
  3 Spin_Up_Time            0x0027  179  175  021    Pre-fail  Always      -      6008                                       
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      18                                         
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0                                         
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0                                         
  9 Power_On_Hours          0x0032  099  099  000    Old_age  Always      -      1067                                       
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0                                         
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0                                         
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      18                                         
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      8                                         
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      9                                         
194 Temperature_Celsius    0x0022  122  115  000    Old_age  Always      -      28                                         
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0                                         
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0                                         
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0                                         
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      17                                         
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0                                         
                                                                                                                                 
SMART Error Log Version: 1                                                                                                       
No Errors Logged                                                                                                                 
                                                                                                                                 
SMART Self-test log structure revision number 1                                                                                   
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                   
# 1  Extended offline    Completed without error      00%      1038        -                                                   
# 2  Extended offline    Completed without error      00%      870        -

Unless I am reading this totally wrong, this is saying there are zero errors, and everything is looking good. I am getting the same results for the other 5 drives in the system. I can see no useful difference between the "failing" drives and the healthy drives.

I then ran

zpool status Main-Storage

which returned:

Code:

  pool: Main-Storage                                                                                                                
 state: ONLINE                                                                                                                      
  scan: scrub repaired 0 in 4h50m with 0 errors on Sat Oct  5 04:50:21 2013                                                         
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                  
        Main-Storage                                    ONLINE       0     0     0                                                  
          raidz2-0                                      ONLINE       0     0     0                                                  
            gptid/ef2bd5c3-1188-11e3-8467-c83a35d12cd7  ONLINE       0     0     0                                                  
            gptid/efa5f830-1188-11e3-8467-c83a35d12cd7  ONLINE       0     0     0                                                  
            gptid/f0222234-1188-11e3-8467-c83a35d12cd7  ONLINE       0     0     0                                                  
            gptid/f09a8dc3-1188-11e3-8467-c83a35d12cd7  ONLINE       0     0     0                                                  
            gptid/f1108fc6-1188-11e3-8467-c83a35d12cd7  ONLINE       0     0     0                                                  
            gptid/f186fb53-1188-11e3-8467-c83a35d12cd7  ONLINE       0     0     0                                                  
                                                                                                                                    
errors: No known data errors

Can anyone offer any thoughts or suggestions? Am I just misunderstanding this? The log messages are very clear, but seem to contradict the SMART test results.

In theory a scrub will be run every Saturday, as recommended here: http://doc.freenas.org/index.php/ZFS_Scrubs

but my initial settings were not quite right, so only one scrub has been run so far it seems.

If I have not provided enough details, which is likely, what information will help?

cyberjock · Oct 13, 2013

You provided all of the information you needed. :)

You are right, everything looks fine. I don't understand why you are getting the SMART errors. It is odd that all 5 drives are giving you the warning to backup your data now. You do have some UDMA CRC errors(cable related normally) so a new SATA cable can help alleviate those. But cable errors don't cause the "backup your data now" messages.

I would run a short test, conveyance test, and long test on your disks in that order. You can't run more than 1 test at a time, and you can review the results with the same command you ran above "smartctl -a /dev/XXX". If all 3 test are good I'd say that the messages are in error. But, I'd be looking into why you are getting the warnings because if you start ignoring them now you'll be in trouble when they are for real.

You may need to go to the smartctl forums for help and you might need to call WD for help too.

cat3481 · Oct 15, 2013

Thank you for the reply, its good to know I was reading the SMART test results correctly. The tests are currently running, and so far everything is coming up clean.

Today I found some more "interesting" (and worrying) messages in my main log:

Code:

Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 30 60 23 40 be 00 00 01 00 00
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): ATA status: 00 ()
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): Retrying command
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 30 61 23 40 be 00 00 01 00 00
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): ATA status: 00 ()
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): Retrying command
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 30 62 23 40 be 00 00 01 00 00
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): ATA status: 00 ()
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Oct 14 10:40:21 freenas kernel: (ada0:ahcich0:0:0:0): Retrying command

which was repeated several more times.

I am starting to wonder about the controller card used to make sure there were enough SATA connections to connect all of the hard drives. Unfortunately I don't know much about it, someone else put the hardware together for me. I am in contact with the shop that built the box about this. Does this make sense as a theory?

Following the link in your signature for the hardware recommendations I am struck by the comments on RAID controllers, especially the comment:

Neither will that SATA PCIe controller that you got in the 99c bin at the corner computer shop. Well, it might "work" but be cautious, it may like to drop bits.

do you know if that would look anything like this? None of the components were from a 99c bin, but I still wonder. Or am I just leaping to random conclusions here?

cyberjock · Oct 15, 2013

It could be a poor quality card/drivers. All I can say is try swapping the card out for something more recommended, like an M1015.

Important Announcement for the TrueNAS Community.

SMART errors, but all SMART tests pass?

cat3481

Cadet

cyberjock

Inactive Account

cat3481

Cadet

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

SMART errors, but all SMART tests pass?

cat3481

Cadet

cyberjock

Inactive Account

cat3481

Cadet

cyberjock

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SMART errors, but all SMART tests pass?"

Similar threads