FreeNAS maintenance... confirm my diagnosis?

Richelieu · Jan 22, 2014

Hey all -

First time poster, been using FreeNAS 8.3.0 for close to a year now. Lemme rephrase that, I've had it set up, working with no problems for almost a year now. (that includes a stretch of 180+ days of constant uptime, with almost no maintenance needed... just checked the drives now and then and I was good to go...)

In the last 2-3 weeks, I started having problems. Just looking for some "you read it right, still need to do this", "totally wrong, what were you thinking" sorts of thoughts from everyone.

I have a RAIDZ2 array set up, 7x2TB drives. (I know, works better with 6, but I had 7, so when I built it, that's what I used).. They're connected to a Highpoint 27xx controller. (yeah, I know, they're crap, don't run SMART worth a darn, etc... but as I said, this is the first issue I've had in a year of using it, so I'm not complaining...)

I came home to find that the GUI had completely stopped working. Couldn't SSH in, couldn't web connect... Nothing. I thought all was lost! I rebooted the machine and everything came up, but I had the yellow "Alert" indicator... I scrubbed the volume and found that a drive had some issues, but it cleaned itself up. I found out it was drive 1, got the serial number, everything good. I also found that my console monitor shows "1 pending sector..." for the 1/1/1 drive... so that sorta confirmed to me drive 1 had an issue.

Now, once the scrub was complete, the "Alert" indicator went back to green.... The drive obviously had a read error... the SMART log (which I run manually... since the drivers are crap, see above, I know...) shows that. But after the scrub, everything went green...

Question 1: Does this mean that the "system" is just working around the read error on the drive? I should replace the drive, but it doesn't necessarily count right now as one of my two "failures" before my array has issues?

Things went okay for about 4-5 days... and I've since had a second drive start having issues. I'm in the middle of a scrub on that drive now to get things all cleared up... the "Alert" icon is currently showing yellow. (the Volume Status is showing checksum errors on drive 4, as my SMART logs indicated it should...)

Question 2: Does this count as my second drive out of action and I need to immediately start swapping drives, or if the scrub (that's currently ongoing) fixes things, will I be back to normal? I realize once a drive has a read error, they'll just continue to pile up... so it's best to get them replaced... just curious if I need to immediately swap now, or can swap one out now, then one when the warranty replacement arrives...

I manually started a scrub to see what it cleans up prior to doing a replacement... (I've read it's best to have the data in good condition before starting a replacement/resilvering...)

Code:

  pool: HD                                                                     
state: ONLINE                                                                 
status: One or more devices has experienced an unrecoverable error.  An       
        attempt was made to correct the error.  Applications are unaffected.   
action: Determine if the device needs to be replaced, and clear the errors     
        using 'zpool clear' or replace the device with 'zpool replace'.       
  see: http://www.sun.com/msg/ZFS-8000-9P                                     
  scan: scrub in progress since Wed Jan 22 19:55:50 2014                       
        452G scanned out of 8.65T at 239M/s, 9h59m to go                       
        242K repaired, 5.10% done                                             
config:                                                                       
                                                                               
        NAME                                            STATE    READ WRITE CKS
UM                                                                             
        HD                                              ONLINE      0    0   
0                                                                             
          raidz2-0                                      ONLINE      0    0   
0                                                                             
            gptid/b9c89418-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
0                                                                             
            gptid/ba04b1d7-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
0                                                                             
            gptid/ba3f6ae5-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
0                                                                             
            gptid/ba7079df-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
25  (repairing)                                                               
            gptid/baaaf5bb-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
0                                                                             
            gptid/bb31273a-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
0                                                                             
            gptid/bb8063bd-414c-11e2-b2e1-902b34adb688  ONLINE      0    0   
0                                                                             
                                                                               
errors: No known data errors

I've got the manual open, I see what I need to do for the replacement.. just looking for a little guidance on how I'm reading the errors and the actions I've taken so far to make sure things stay operational...

Thanks guys!
Rich

warri · Jan 22, 2014

Question 1: Yes, ZFS repaired the files affected by bad sectors and the hard drive probably mapped out those sectors.

Question 2: You will be back to normal after the scrub is finished. Nonetheless you should replace the affected drives (see below). Usually pending and uncorrectable sectors are first signs of failing drives.

If errors are creeping up and are being corrected by ZFS (repaired KB in scrubs) like in your case you should definitely consider replacing the problematic drives. To make sure can you please post the complete smart output of the drives in question? You can obtain them with smartctl -q noserial -a /dev/adaX (replace X with actual drive numbers).

Richelieu · Jan 22, 2014

Warri -

Thanks for the quick reply!

I figured that was the case... I know on HW RAID that I deal with on an almost daily basis, if the drive has an issue, we hot swap to a new drive... The card doesn't really care whether it's something it can fix or not... Though truth be told, I'm not sure I've seen a drive had a read error and not a total disintegration... Most I've seen die hard...

Due to my HPT27xx card, I run my SMART tests with smartctl -t short -d hpt,1/1/1 /dev/hpt27xx and get the output by using smartctl -a -d hpt,1/1/1 /dev/hpt27xx.

Drive #1:

Code:

smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p4 amd64] (local build)   
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net     
                                                                               
=== START OF INFORMATION SECTION ===                                           
Device Model:    WDC WD20EFRX-68AX9N0                                         
Serial Number:    xxxxx                                           
LU WWN Device Id: 5 0014ee 058c37522                                           
Firmware Version: 80.00A80                                                     
User Capacity:    2,000,398,934,016 bytes [2.00 TB]                           
Sector Sizes:    512 bytes logical, 4096 bytes physical                       
Device is:        Not in smartctl database [for details use: -P showall]       
ATA Version is:  8                                                           
ATA Standard is:  ACS-2 (revision not indicated)                               
Local Time is:    Wed Jan 22 19:06:09 2014 CST                                 
SMART support is: Available - device has SMART capability.                     
SMART support is: Enabled                                                     
                                                                               
=== START OF READ SMART DATA SECTION ===                                       
SMART overall-health self-assessment test result: PASSED                       
                                                                               
General SMART Values:                                                         
Offline data collection status:  (0x00) Offline data collection activity       
                                        was never started.                     
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.   
Total time to complete Offline                                                 
data collection:                (27240) seconds.                               
Offline data collection                                                       
capabilities:                    (0x7b) SMART execute Offline immediate.       
                                        Auto Offline data collection on/off supp
ort.                                                                           
                                        Suspend Offline collection upon new   
                                        command.                               
                                        Offline surface scan supported.       
                                        Self-test supported.                   
                                        Conveyance Self-test supported.       
                                        Selective Self-test supported.         
SMART capabilities:            (0x0003) Saves SMART data before entering       
                                        power-saving mode.                     
                                        Supports SMART auto save timer.       
Error logging capability:        (0x01) Error logging supported.               
                                        General Purpose Logging supported.     
Short self-test routine                                                       
recommended polling time:        (  2) minutes.                               
Extended self-test routine                                                     
recommended polling time:        (  5) minutes.                               
SCT capabilities:              (0x70bd) SCT Status supported.                 
                                        SCT Error Recovery Control supported. 
                                        SCT Feature Control supported.         
                                        SCT Data Table supported.             
                                                                               
SMART Attributes Data Structure revision number: 16                           
Vendor Specific SMART Attributes with Thresholds:                             
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_
FAILED RAW_VALUE                                                               
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -
      29                                                                     
  3 Spin_Up_Time            0x0027  172  171  021    Pre-fail  Always      -
      4375                                                                   
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -
      30                                                                     
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -
      0                                                                       
  7 Seek_Error_Rate        0x002e  200  197  000    Old_age  Always      -
      0                                                                       
  9 Power_On_Hours          0x0032  087  087  000    Old_age  Always      -
      9950                                                                   
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -
      0                                                                       
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -
      0                                                                       
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -
      30                                                                     
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -
      29                                                                     
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -
      0                                                                       
194 Temperature_Celsius    0x0022  122  107  000    Old_age  Always      -
      25                                                                     
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -
      0                                                                       
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -
      1                                                                       
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -
      0                                                                       
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -
      0                                                                       
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -
      0                                                                       
                                                                               
SMART Error Log Version: 1                                                                                                                                     
No Errors Logged                                                               
                                                                               
SMART Self-test log structure revision number 1                               
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error                                                               
# 1  Short offline      Completed: read failure      90%      9950        235
4043704                                                                       
# 2  Short offline      Completed: read failure      90%      9880        235
4043704                                                                       
# 3  Short offline      Completed: read failure      90%      9784        235
4043704                                                                       
                                                                               
SMART Selective self-test log data structure revision number 1                 
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                   
    1        0        0  Not_testing                                           
    2        0        0  Not_testing                                           
    3        0        0  Not_testing                                           
    4        0        0  Not_testing                                           
    5        0        0  Not_testing                                           
Selective self-test flags (0x0):                                               
  After scanning selected spans, do NOT read-scan remainder of disk.           
If Selective self-test is pending on power-up, resume after 0 minute delay.

Drive #4:

Code:

smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p4 amd64] (local build)   
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net     
                                                                               
=== START OF INFORMATION SECTION ===                                           
Device Model:    WDC WD20EFRX-68AX9N0                                         
Serial Number:    xxxxx                                          
LU WWN Device Id: 5 0014ee 058c36d2e                                           
Firmware Version: 80.00A80                                                     
User Capacity:    2,000,398,934,016 bytes [2.00 TB]                           
Sector Sizes:    512 bytes logical, 4096 bytes physical                       
Device is:        Not in smartctl database [for details use: -P showall]       
ATA Version is:  8                                                           
ATA Standard is:  ACS-2 (revision not indicated)                               
Local Time is:    Wed Jan 22 19:15:09 2014 CST                                 
SMART support is: Available - device has SMART capability.                     
SMART support is: Enabled                                                     
                                                                               
=== START OF READ SMART DATA SECTION ===                                       
SMART overall-health self-assessment test result: PASSED                       
                                                                               
General SMART Values:                                                         
Offline data collection status:  (0x00) Offline data collection activity       
                                        was never started.                     
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.   
Total time to complete Offline                                                 
data collection:                (25140) seconds.                               
Offline data collection                                                       
capabilities:                    (0x7b) SMART execute Offline immediate.       
                                        Auto Offline data collection on/off supp
ort.                                                                           
                                        Suspend Offline collection upon new   
                                        command.                               
                                        Offline surface scan supported.       
                                        Self-test supported.                   
                                        Conveyance Self-test supported.       
                                        Selective Self-test supported.         
SMART capabilities:            (0x0003) Saves SMART data before entering       
                                        power-saving mode.                     
                                        Supports SMART auto save timer.       
Error logging capability:        (0x01) Error logging supported.               
                                        General Purpose Logging supported.     
Short self-test routine                                                       
recommended polling time:        (  2) minutes.                               
Extended self-test routine                                                     
recommended polling time:        ( 254) minutes.                               
Conveyance self-test routine                                                   
recommended polling time:        (  5) minutes.                               
SCT capabilities:              (0x70bd) SCT Status supported.                 
                                        SCT Error Recovery Control supported. 
                                        SCT Feature Control supported.         
                                        SCT Data Table supported.             
                                                                               
SMART Attributes Data Structure revision number: 16                           
Vendor Specific SMART Attributes with Thresholds:                             
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_
FAILED RAW_VALUE                                                               
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -
      246                                                                     
  3 Spin_Up_Time            0x0027  174  172  021    Pre-fail  Always      -
      4283                                                                   
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -
      30                                                                     
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -
      0                                                                       
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -
      0                                                                       
  9 Power_On_Hours          0x0032  087  087  000    Old_age  Always      -
      9945                                                                   
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -
      0                                                                       
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -
      0                                                                       
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -
      30                                                                     
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -
      29                                                                     
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -
      0                                                                       
194 Temperature_Celsius    0x0022  125  106  000    Old_age  Always      -
      22                                                                     
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -
      0                                                                       
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -
      2                                                                       
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -
      0                                                                       
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -
      0                                                                       
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -
      0                                                                       
                                                                               
SMART Error Log Version: 1                                                     
No Errors Logged                                                               
                                                                               
SMART Self-test log structure revision number 1                               
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error                                                               
# 1  Short offline      Completed: read failure      90%      9945        243
2116760                                                                       
# 2  Short offline      Completed without error      00%      9874        - 
# 3  Short offline      Completed without error      00%      9779        - 
                                                                               
SMART Selective self-test log data structure revision number 1                 
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                   
    1        0        0  Not_testing                                           
    2        0        0  Not_testing                                           
    3        0        0  Not_testing                                           
    4        0        0  Not_testing                                           
    5        0        0  Not_testing                                           
Selective self-test flags (0x0):                                               
  After scanning selected spans, do NOT read-scan remainder of disk.           
If Selective self-test is pending on power-up, resume after 0 minute delay.

All other drives under the "Self-test log structure" have a status of "Completed without error".

Rich

warri · Jan 22, 2014

Usually with only 1 or 2 pending sectors I probably wouldn't replace the drives, only if the numbers are starting to go up quickly. But additionally your SMART tests are failing, and there are two flaky drives - so in this case I'd go for "better be safe than sorry".

Richelieu · Jan 22, 2014

Thanks for the double check Warri. I'm watching the scrub ongoing right now and it's had to fix 324Kb out of 1.53Tb scanned so far... and they all appear confined to one drive... I think tomorrow when the scrub is done I'll double check the results and probably swap that drive out... I've already got my spare spun up and some data moved on it and SMART tested... so I know I can rely on it... (I'd hate to have my spare go in and have faults in it right off the bat!)

One note on those SMART logs from the disks I'm having issues with...

Disk #4:

Code:

=== START OF READ SMART DATA SECTION ===                                     
SMART overall-health self-assessment test result: PASSED                     
                                                                                                                                                                     
SMART Attributes Data Structure revision number: 16                         
Vendor Specific SMART Attributes with Thresholds:                           
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_
FAILED RAW_VALUE                                                             
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -
      246                                                                   
  3 Spin_Up_Time            0x0027  174  172  021    Pre-fail  Always      -
      4283                                                                 
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -
      30                                                                   
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -
      0                                                                     
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -
      0                                                                     
  9 Power_On_Hours          0x0032  087  087  000    Old_age  Always      -
      9945                                                                 
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -
      0                                                                     
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -
      0                                                                     
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -
      30                                                                   
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -
      29                                                                   
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -
      0                                                                     
194 Temperature_Celsius    0x0022  125  106  000    Old_age  Always      -
      22                                                                   
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -
      0                                                                     
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -
      2                                                                     
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -
      0                                                                     
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -
      0                                                                     
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -
      0                                                                     
                                                                             
SMART Error Log Version: 1                                                   
No Errors Logged                                                             
                                                                             
SMART Self-test log structure revision number 1                             
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error                                                             
# 1  Short offline      Completed: read failure      90%      9945        243
2116760                                                                     
# 2  Short offline      Completed without error      00%      9874        -
# 3  Short offline      Completed without error      00%      9779        -

So in this section, it has at the top saying "SMART overall-health self-assessment test result: PASSED", but the attributes have multiple "Pre-Fails" in them, and then the log structure has "Read Failures" in it... Soooo... the drive thinks it's okay in spite of the pre-fails and read failures? Would that be a correct assessment?

warri · Jan 23, 2014

Yes, the drive is probably still operating in its manufacturer specified "safe zones" - for whatever they are worth ;) SMART values are always subject to interpretation, but as I said before if ZFS has to fix data, in my opinion the drive doesn't work reliably anymore.

You could additionally perform a long smart test and see what it reports. This will take 4 hours+ though before you see any results. But it's entirely possible that the long smart test passes, while a short one fails..

Richelieu · Jan 23, 2014

One problem I have run into this morning. I checked on the status of my array this morning, expecting it to be close if not finished scrubbing. I logged in through the web interface and it wouldn't display everything, it would just hang. I couldn't get the main menu on the left in which to pick "Shell" so I could to a "zpool status".

I went downstairs and checked the console output. Went into a shell there and did the "zpool status". It showed it was 35% done, but it wasn't updating. I sent the commands in about 1-2 minutes apart for 3-4 times and the percent complete, transfer rate, and rate scanned did not change. It seems like something crashed/locked up last night while it was scrubbing.

I'm tempted to reboot, but first... any data I should get? Should I be worried about this?

warri · Jan 23, 2014

Can you post the last log outputs of dmesg? Anything in /data/crash?

Scrubbing should continue automatically after a reboot. Btw, it is not necessary to perform a scrub before you exchange disks, since the resilver operation is essentially a scrub and you have double redundancy.

EDIT: Two system lockups in a short time are suspicious. Can you run a RAM test if you are not using ECC RAM?

EDIT2: I also agree with Yatti420. Give it some more time and double check the status in 30 minutes or so.

Yatti420 · Jan 23, 2014

Give it more time.. I don't interrupt scrubs.. Yes I would replace both disks.. If it's enough to pull my pool into a degraded state - then those drives are gone.

Richelieu · Jan 29, 2014

RAM test came out clean... I'm going to run another one in a week or so.

Everything is back to normal! I was able to get my spare drive into the system to replace the "worse" of the two drives... which got things on a more level footing... but apparently both drives started to fail hardcore at the same time... They wouldn't run SMART test... one was clicking pretty bad too...

I ended up pulling the drive rack out, and one by one hooking up the drive to a WIN PC and ran the WD Smart Util on them.. picked out the bad drives... did a replace on drive one... then a day later I got the second spare drive (RMA from WD) and swapped it in... 15 hours later... My array is healthy with no errors and functioning correctly.

Thanks for all the help on this one! I was pretty sure I was diagnosing things properly, but when you've got data you'd like to keep (but aren't overly afraid if you lose it) I figure it was good to make sure I was reading it right.

Important Announcement for the TrueNAS Community.

FreeNAS maintenance... confirm my diagnosis?

Richelieu

Cadet

warri

Guru

Richelieu

Cadet

warri

Guru

Richelieu

Cadet

warri

Guru

Richelieu

Cadet

warri

Guru

Yatti420

Wizard

Richelieu

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

FreeNAS maintenance... confirm my diagnosis?

Cadet

Guru

Cadet

Guru

Cadet

Guru

Cadet

Guru

Wizard

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS maintenance... confirm my diagnosis?"

Similar threads