Extensive Scrub Duration

amitkhas · Apr 29, 2014

Hi guys,

I wanted to ask how long the Scrub should take? In the last 18 hours, it has gone up 2%. I feel like the rate is roughly 3%/24 hours. Therefore, it would take ~33+ days to complete! I don't think this is right... any idea what could be causing it?

Here are system specs
FreeNAS 9.2.1.5 Release x64
6x 2TB Hitachi 7200RPM drives in RAIDZ1
2.2TB of 8.9TB used (6.7TB available)

Intel i5-2500K CPU @3.30 GHz
8GB RAM

warri · Apr 29, 2014

Should definitely be faster. Any errors while scrubbing? Are your drives ok? Check the SMART values.

amitkhas · Apr 29, 2014

I apologize in advance for these 'stupid' questions:

How do I view the report while scrubbing to see if there are any errors? Can I see it from the GUI?

How do I verify the drives are OK? The volume status is "healthy." Is there a way to check the individual drives?

How do I check the SMART values? SMART is enabled on each drive. Do you see it from the GUI?

warri · Apr 29, 2014

You can obtain the information through the Shell included in the WebGUI or if you log in via SSH.

This gives you the status of the current scrub, and an estimated time frame:

Code:

zpool status

If any checksum errors show up here, that would be a reason to worry about.

You can poll the SMART values of your drives by issuing:

Code:

smartctl -a -q noserial /dev/adaX

Replace X with the respective drive numbers (often starting with 0, then 1, etc.)

There you should look out especially for read errors or write errors. You can also just post the output here and we can help you, but please use the [code]-tags to retain the correct formatting.

amitkhas · Apr 29, 2014

Here is the output when I issue zpool status. It says 266h remaining and it's only 60% done, yikes!

Code:

[root@freenas ~]# zpool status                                                                                                     
  pool: share                                                                                                                     
state: ONLINE                                                                                                                     
status: The pool is formatted using a legacy on-disk format.  The pool can                                                         
        still be used, but some features are unavailable.                                                                         
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the                                                           
        pool will no longer be accessible on software that does not support feature                                               
        flags.                                                                                                                     
  scan: scrub in progress since Mon Apr 28 00:09:20 2014                                                                           
        1.59T scanned out of 2.62T at 1.13M/s, 266h0m to go                                                                       
        15.0M repaired, 60.48% done                                                                                               
config:                                                                                                                           
                                                                                                                                   
        NAME        STATE    READ WRITE CKSUM                                                                                     
        share      ONLINE      0    0    0                                                                                     
          raidz1-0  ONLINE      0    0    0                                                                                     
            ada0p2  ONLINE      0    0    0                                                                                     
            ada1p2  ONLINE      0    0    0                                                                                     
            ada2p2  ONLINE      0    0    0                                                                                     
            ada3p2  ONLINE      0    0    0                                                                                     
            ada4p2  ONLINE      0    0    0  (repairing)                                                                       
            ada5p2  ONLINE      0    0    0                                                                                     
                                                                                                                                   
errors: No known data errors

When I tried to acquire the SMART values, some of the text was scrolled off screen from the Shell. There is no scrollbar, so I was only able to copy the bottom text.

SMART Values: ada0

Code:

                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                           
recommended polling time:        (  1) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 328) minutes.                                                                                   
SCT capabilities:              (0x003d) SCT Status supported.                                                                     
                                        SCT Error Recovery Control supported.                                                     
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                 
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0                                           
  2 Throughput_Performance  0x0005  136  136  054    Pre-fail  Offline      -      83                                         
  3 Spin_Up_Time            0x0007  140  140  024    Pre-fail  Always      -      395 (Average 423)                           
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      246                                         
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0                                           
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0                                           
  8 Seek_Time_Performance  0x0005  133  133  020    Pre-fail  Offline      -      27                                         
  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9436                                       
10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0                                           
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      246                                         
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      543                                         
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      543                                         
194 Temperature_Celsius    0x0002  176  176  000    Old_age  Always      -      34 (Min/Max 15/42)                         
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0                                           
197 Current_Pending_Sector  0x0022  072  072  000    Old_age  Always      -      698                                         
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0                                           
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART Values ada1:

Code:

Error 267 occurred at disk power-on lifetime: 9362 hours (390 days + 2 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                 
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 15 eb 22 40 0c  Error: UNC 21 sectors at LBA = 0x0c4022eb = 205529835                                                     
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  25 00 80 80 22 40 40 00      13:05:01.510  READ DMA EXT                                                                         
  25 00 80 80 22 40 40 00      13:04:58.759  READ DMA EXT                                                                         
  25 00 80 80 8e 96 40 00      13:04:58.740  READ DMA EXT                                                                         
  25 00 80 00 09 fb 40 00      13:04:58.732  READ DMA EXT                                                                         
  25 00 80 00 9b f4 40 00      13:04:58.699  READ DMA EXT                                                                         
                                                                                                                                   
Error 266 occurred at disk power-on lifetime: 9362 hours (390 days + 2 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                 
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 15 eb 22 40 0c  Error: UNC 21 sectors at LBA = 0x0c4022eb = 205529835                                                     
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  25 00 80 80 22 40 40 00      13:04:58.759  READ DMA EXT                                                                         
  25 00 80 80 8e 96 40 00      13:04:58.740  READ DMA EXT                                                                         
  25 00 80 00 09 fb 40 00      13:04:58.732  READ DMA EXT                                                                         
  25 00 80 00 9b f4 40 00      13:04:58.699  READ DMA EXT                                                                         
  35 00 04 9f 2b eb 40 00      13:04:58.633  WRITE DMA EXT                                                                         
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART Values ada2:

Code:

Error 1654 occurred at disk power-on lifetime: 9383 hours (390 days + 23 hours)                                                   
  When the command that caused the error occurred, the device was active or idle.                                                 
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 62 13 b7 7a 03  Error: UNC 98 sectors at LBA = 0x037ab713 = 58373907                                                       
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  c8 00 ce a7 b6 7a e3 00      00:32:21.941  READ DMA                                                                             
  c8 00 ce a7 b6 7a e3 00      00:32:19.164  READ DMA                                                                             
  c8 00 00 a7 b5 7a e3 00      00:32:19.164  READ DMA                                                                             
  c8 00 9a d9 b4 7a e3 00      00:32:19.163  READ DMA                                                                             
  c8 00 cd 0c b4 7a e3 00      00:32:19.162  READ DMA                                                                             
                                                                                                                                   
Error 1653 occurred at disk power-on lifetime: 9383 hours (390 days + 23 hours)                                                   
  When the command that caused the error occurred, the device was active or idle.                                                 
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 62 13 b7 7a 03  Error: UNC 98 sectors at LBA = 0x037ab713 = 58373907                                                       
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  c8 00 ce a7 b6 7a e3 00      00:32:19.164  READ DMA                                                                             
  c8 00 00 a7 b5 7a e3 00      00:32:19.164  READ DMA                                                                             
  c8 00 9a d9 b4 7a e3 00      00:32:19.163  READ DMA                                                                             
  c8 00 cd 0c b4 7a e3 00      00:32:19.162  READ DMA                                                                             
  c8 00 ce 0b b3 7a e3 00      00:32:19.162  READ DMA                                                                             
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

amitkhas · Apr 29, 2014

SMART Values ada3:

Code:

                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                           
recommended polling time:        (  1) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 333) minutes.                                                                                   
SCT capabilities:              (0x003d) SCT Status supported.                                                                     
                                        SCT Error Recovery Control supported.                                                     
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                 
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0                                           
  2 Throughput_Performance  0x0005  135  135  054    Pre-fail  Offline      -      85                                         
  3 Spin_Up_Time            0x0007  140  140  024    Pre-fail  Always      -      395 (Average 423)                           
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      247                                         
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0                                           
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0                                           
  8 Seek_Time_Performance  0x0005  133  133  020    Pre-fail  Offline      -      27                                         
  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9441                                       
10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0                                           
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      247                                         
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      549                                         
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      549                                         
194 Temperature_Celsius    0x0002  187  187  000    Old_age  Always      -      32 (Min/Max 15/43)                         
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0                                           
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0                                           
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0                                           
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART Values ada4:

Code:

                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                           
recommended polling time:        (  1) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 307) minutes.                                                                                   
SCT capabilities:              (0x003d) SCT Status supported.                                                                     
                                        SCT Error Recovery Control supported.                                                     
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                 
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate    0x000b  007  007  016    Pre-fail  Always  FAILING_NOW 3828210787                                 
  2 Throughput_Performance  0x0005  137  137  054    Pre-fail  Offline      -      77                                         
  3 Spin_Up_Time            0x0007  134  134  024    Pre-fail  Always      -      428 (Average 427)                           
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      248                                         
  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 2005                                       
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0                                           
  8 Seek_Time_Performance  0x0005  133  133  020    Pre-fail  Offline      -      27                                         
  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9362                                       
10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0                                           
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      248                                         
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      541                                         
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      541                                         
194 Temperature_Celsius    0x0002  187  187  000    Old_age  Always      -      32 (Min/Max 15/43)                         
196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      2777                                       
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0                                           
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0                                           
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART Values ada5:

Code:

                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                           
recommended polling time:        (  1) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 321) minutes.                                                                                   
SCT capabilities:              (0x003d) SCT Status supported.                                                                     
                                        SCT Error Recovery Control supported.                                                     
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                 
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0                                           
  2 Throughput_Performance  0x0005  136  136  054    Pre-fail  Offline      -      80                                         
  3 Spin_Up_Time            0x0007  132  132  024    Pre-fail  Always      -      436 (Average 433)                           
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      253                                         
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0                                           
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0                                           
  8 Seek_Time_Performance  0x0005  133  133  020    Pre-fail  Offline      -      27                                         
  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9393                                       
10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0                                           
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      253                                         
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      556                                         
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      556                                         
194 Temperature_Celsius    0x0002  187  187  000    Old_age  Always      -      32 (Min/Max 15/42)                         
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0                                           
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0                                           
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0                                           
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

I am not quite sure how to read all this. Seems like there is an error?

cyberjock · Apr 29, 2014

Let's see:

First, you only included part of the output.. so I can't tell you "how bad" things are. Based on what little you provided here's what I know:

ada0 is failing
ada1 is throwing errors(failing?)
ada2 is throwing errors(failing?)
ada3 is fine.
ada4 has a high reallocated sector count, so it's probably failing too. It does have "FAILING NOW" indicators flagged
ada5 is fine.

You've failed to schedule SMART testing.

You've either failed to properly setup SMART monitoring or you've failed to setup emailing properly, or both.

Considering you have RAIDZ1, I wouldn't expect to walk away from this with your data when all is said and done. You're in serious trouble with your pool. I don't think you even realize how close to losing your data you are right now. It's only dumb luck you even have access to your pool at all right now. If you run out and buy replacement disks, make a new pool from scratch(and use RAIDZ2!) you *might* be able to copy your files to the new pool. Trying to replace the disks one by one is impossible since you'd be losing the little bit of redundancy you have while you have other bad disks.

This situation is *precisely* why we tell people that they are crazy irresponsible to go with RAIDZ1.

Considering you have 4 of 6 failing disks, as soon as any one of your disks fails for good your pool is likely to be gone. You'll have no more redundancy to recover from the errors on the other disks.

So I hope you have backups.. and if you don't, you'd better start copying the most important data while you can.

DrKK · Apr 29, 2014

This is so common.

At least 9 times out of 10, when people come in here with problems like this, they've NEVER run SMART tests, and they have no recurring regimen of SMART test runs on a biweekly basis. Like this guy has NEVER run one, and had he been running them, he would have detected problems on these drives much earlier, and could have replaced them before it got to a high risk/precarious situation.

Maybe we should put more in the documentation to make sure users set up a recurring SMART test regimen?

Sir, I would advise you to stop the scrub, if you can, "zpool scrub -s NAME_OF_POOL" (it might not let you because everything is so bad off), and copy what you can off the pool that you want to keep, IF you can.

But this pool has had drives in it that almost surely have been a problem for a long time, and you would have detected it (and FreeNAS would have emailed you) had you had the rigorous SMART test regimen that we'd recommend.

amitkhas · Apr 29, 2014

Thank you very much for the input.

When I built the FreeNAS system 2 years ago, there was no recommendation at the time to NOT use RAIDz1. I wish I had known that at the time. The Hard Drives were bought at that time, so 4 of the 6 are failing within 2 years.

For the entire 2 years, I was running FreeNAS 8.0.2. I had SMART tests scheduled, but I don't think the emailing functionality worked properly. That version did not have the ability to send Test Emails. I upgraded to FreeNAS 9.2.1.5 only 2 days ago when I noticed the transfer speeds were sluggish. The failing disks are likely the culprit of the slow transfer speeds?

In any case, a few questions on what I can do now:

This may be a dumb question, but why is it impossible to try to repair the pool by replacing the disks 1 by 1? The other disks may be failing, but they are still operational. None of the disks have failed completely (for now). Wouldn't this be sufficient to rebuild the pool while the clock ticks?

As a lesson learned, what would be the recommended RAID redundancy since RAIDz1 is now dead. RAIDz2? Or will that also soon die and RAIDZ3 would be the new recommendation? I wouldn't want to find myself in this position again in another 2 years.

Here's a long shot: Is it possible that the failing disks could be a result of a false positive from the upgrade. In other words, somehow upgrading from v8.0.2 to the current version resulted in incorrectly reading the pool? Perhaps the "auto-import volume" was not done properly? The FreeNAS upgrade was not through upgrading the firmware. It was a clean, fresh install.

cyberjock · Apr 29, 2014

Well, RAIDZ1 is nothing more than RAID5, and RAID5 was declared dead in 2009. See http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162. This is why I have the link in my sig that RAIDZ1/RAID5 is dead. People using it are playing with fire, but going to bed every night not knowing they have set themselves up for failure in the future. For you, the future is now.

The failing disks are absolutely "a" cause for slow transfer speeds like yours. There may be other problems, but I will tell you that 1MB/sec or so on scrubs is a definite flag for failed disks.

You can't repair the pool because you have several failing disks, and when you do a disk replacement you *will* have to expect that the other disks are 100% correct and error free. We already know they aren't, so if you replace a disk you'll see the resilver start, and then *bang*. Your pool will go offline and you'll never see your data again. If you had done RAIDZ2 then you could probably do it because you'd lose one disk's redundancy during the disk replacement and resilvering, but there would still be one disk's redundancy. You will have zero redundancy while the resilver takes place, and we already know for more than 300% certainty that you don't have all of your data on all of your disks right now.

RAIDZ2 will be "dead" probably around 2019 or so. There's no way to know as technology changes. If hard drives become more reliable then it might be 2025 or later. If they become less reliable it might be 2017.

There is no chance that these are false positives. The SMART data you provided is hard drive internal diagnostic information. The hard drives themselves know about the problems we are discussing.

amitkhas · Apr 30, 2014

I'm still surprised that within 2 years, 4 of 6 hard drives are failing. Only 2TB of storage has been utilized. I haven't done the math based on the failure rates, but 2 years seems short.

Fortunately, I only have 2TB of data on it. I'm trying to back it up, but the transfer speeds are slow. It often gets hung up. Even transferring 5gb takes many hours. Also, some files are not able to be transferred out. It says it's a network error. But the "network error" only impacts specific files. I'm thinking it's because of the bad sectors, and not a 'network error.'

The network error is: error 0x8007003B in windows 8.1.

DrKK · Apr 30, 2014

amitkhas said:
I'm still surprised that within 2 years, 4 of 6 hard drives are failing. Only 2TB of storage has been utilized. I haven't done the math based on the failure rates, but 2 years seems short.

Fortunately, I only have 2TB of data on it. I'm trying to back it up, but the transfer speeds are slow. It often gets hung up. Even transferring 5gb takes many hours. Also, some files are not able to be transferred out. It says it's a network error. But the "network error" only impacts specific files. I'm thinking it's because of the bad sectors, and not a 'network error.'

The network error is: error 0x8007003B in windows 8.1.

I'm going to ask something stupid.

That hex error, 0x8007003B, classically this shows up when you attempt to copy a file larger than 4GB to a filesystem that won't handle files over 4GB. Just FYI.

cyberjock · Apr 30, 2014

My guess is that the error is caused by what DrKK says or the data stream timed out because your disks are failing. If you attach the disk locally and try to copy your files over locally you'll have a higher success rate. That's why in my previous post I mentioned creating a new pool and copying your data to the new pool locally. ;)

I was way ahead of you.

amitkhas · Apr 30, 2014

My network is a home network on a gigabit switch. I never have any connection dropouts. I normally (when drives were health) see 50-60mb/sec transfer rates. Now, I see 1-2mb/sec, and even then, it occasionally stops for 5-10 minutes before it resumes. My guess is the transfer rates are slow because of the failing disks, not network.

I am backing up the 2TB worth of data to a few different external drives as a temporary storage until I am able to purchase replacement disks for a new pool. How would I go about attaching the disk locally? Connect it via USB, set up a pool, then transfer files? I suppose it may be better because it's going through USB rather than ethernet?

cyberjock · Apr 30, 2014

You don't understand what I'm saying... when one or more of your disks start having problems it halts the CIFS/Samba data stream. In Windows, after 30 seconds or a minute(I forget which) Windows declares the server to be offline and drops the connection. Poof, you get an error and whatever file it was copying isn't copied.

I never thought your transfer rates were because of your network. Even if you had 10Mb networking you wouldn't get errors, it would just be slow. The error means that *something is wrong*. You should try to alleviate that problem if possible, which is why I said you should connect the disk locally. When local the CIFS data stream won't time out and cp has no timeout limit, so it'll keep trucking along even if it takes 2 minutes to get your data back.

Well, external drives are a total nightmare and should never ever be trusted with data for reliability reasons. Your best bet would be too use an internal drive. USB presents it's own serious risks for data integrity and reliability. If you have no choice but to use USB, I'd plug one into the computer and format it ZFS and do a cp copy from the FreeNAS box locally. Just realize you are taking grave risk using USB.

I realize this sound harsh with me complaining about your pool, then USB... it's like you can't get a break. But you are wanting to do all of the things we're constantly telling people not to do. Some listen, some don't. Of those that don't, plenty come back later to admit they lost data and they've learned from their mistake. :/

amitkhas · Apr 30, 2014

It may be harsh, but I really do appreciate your expertise. I am learning a lot, and glad that you are willing to inform me rather than let me suffer on my own :) These are valuable lessons learned so I know what to do next time.

My FreeNAS box is pretty much maxed out with the 6 drives. It doesn't have the spare capacity to connect additional disks. Thus, I think I may have to resort to the USB external.

Either way, once I have another disk connected, what is the exact cp copy command I should issue? I'm assuming this would be issued via the shell? I want to make sure I do it correctly.

cyberjock · Apr 30, 2014

In your shoes USB sounds like the only option. Just make sure its a drive that does work properly and is reliable.

The cp command will be dependent on where your data is and where you are wanting to move it. It'll probably be something like "cp -R /mnt/pool1 /mnt/pool2" but that's about the best I can offer. You should give the cp manpages a read though to see if you want to do any other parameters. I might do -v so I can watch it copy so you can tell if it freezes for a day or something.

Keep in mind there will be no way for you to know how far it is done until it finishes. :/

As long as your data is moving I'd leave it alone. The best way for you to guess how far along it is is to do check the quantity of data you are moving and the quantity of data in your new pool. That'll give you the best guess.

Good luck! You definitely need it.

amitkhas · May 4, 2014

So I tried connecting the 2TB Western Digital External hard drive via USB. However, when I attempt to "Import Volume" in FreeNAS, the disk does not appear in the drop down menu. Any idea why it may not be appearing? I do not know if the disk is being recognized by FreeNAS at all.

cyberjock · May 4, 2014

Do you have a zpool on that drive? If not that's why it's not importable...

amitkhas · May 4, 2014

I thought it had to have a zpool in order to utilize "Auto-Import."

It's configured as NTFS, which I thought is fine for "Import."

Important Announcement for the TrueNAS Community.

Extensive Scrub Duration

Dabbler

Guru

Dabbler

Guru

Dabbler

Dabbler

Inactive Account

FreeNAS Generalissimo

Dabbler

Inactive Account

Dabbler

FreeNAS Generalissimo

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Extensive Scrub Duration"

Similar threads