Just fixed my ZFS NAS errors and i now need suggestions

MarciSD · Jun 29, 2014

Hi,

A few days ago, my FreeNAS "broke" after a power failiure. I did a little research and found out about zpool scrub. Here was my zpool status before scrub:

Code:

[root@freenas ~]# zpool status                                                                                                     
  pool: FNAS                                                                                                               
state: ONLINE                                                                                                                     
status: One or more devices has experienced an error resulting in data                                                             
        corruption.  Applications may be affected.                                                                                 
action: Restore the file in question if possible.  Otherwise restore the                                                           
        entire pool from backup.                                                                                                   
  see: http://illumos.org/msg/ZFS-8000-8A                                                                                         
  scan: scrub in progress since Sun Jun 29 01:21:28 2014                                                                           
        1.40T scanned out of 1.40T at 42.6M/s, 0h0m to go                                                                         
        0 repaired, 100.00% done                                                                                                   
config:                                                                                                                           
                                                                                                                                   
        NAME                                          STATE    READ WRITE CKSUM                                                   
        FNAS                                           ONLINE      69    0    0                                                   
          gptid/2f36ff86-3ab2-11e3-a5dd-bc5ff495ed99  ONLINE      0    0    0                                                   
          gptid/1e026827-4385-11e3-9e04-bc5ff495ed99  ONLINE      69    0    0                                                   
                                                                                                                                   
errors: 5 data errors, use '-v' for a list

And here is the status after scrub:

Code:

                                                                                   
[root@freenas ~]# zpool status                                                                                                     
  pool: FNAS                                                                                                               
state: ONLINE                                                                                                                     
status: One or more devices has experienced an unrecoverable error.  An                                                           
        attempt was made to correct the error.  Applications are unaffected.                                                       
action: Determine if the device needs to be replaced, and clear the errors                                                         
        using 'zpool clear' or replace the device with 'zpool replace'.                                                           
  see: http://illumos.org/msg/ZFS-8000-9P                                                                                         
  scan: scrub repaired 0 in 9h36m with 0 errors on Sun Jun 29 10:57:59 2014                                                       
config:                                                                                                                           
                                                                                                                                   
        NAME                                          STATE    READ WRITE CKSUM                                                   
        FNAS                                           ONLINE      69    0    0                                                   
          gptid/2f36ff86-3ab2-11e3-a5dd-bc5ff495ed99  ONLINE      0    0    0                                                   
          gptid/1e026827-4385-11e3-9e04-bc5ff495ed99  ONLINE      69    0    0                                                   
                                                                                                                                   
errors: No known data errors

Then I used the SMART test to determine the state of my discs and this one looks problematic.

Code:

=== START OF INFORMATION SECTION ===                                                                                
Model Family:     Western Digital Caviar Blue Serial ATA                                                                     
Device Model:     WDC WD10EALS-00Z8A0                                                                                          
Serial Number:    WD-WCATR0945010                                                                                                
LU WWN Device Id: 5 0014ee 259b796b6                                                                                              
Firmware Version: 05.01D05                                                                                                          
User Capacity:    1,000,204,886,016 bytes [1.00 TB]                                                                            
Sector Size:      512 bytes logical/physical                                                                                        
Device is:        In smartctl database [for details use: -P show]                                                              
ATA Version is:   ATA8-ACS (minor revision not indicated)                                                                  
SATA Version is:  SATA 2.6, 3.0 Gb/s                                                                                                
Local Time is:    Sun Jun 29 11:20:32 2014 PDT                                                                                    
SMART support is: Available - device has SMART capability.                                                               
SMART support is: Enabled  
 
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                    
Extended self-test routine                                                                                                          
recommended polling time:        ( 182) minutes.                                                                                    
Conveyance self-test routine                                                                                                        
recommended polling time:        (   5) minutes.                                                                                    
SCT capabilities:              (0x3037) SCT Status supported.                                                                       
                                        SCT Feature Control supported.                                                              
                                        SCT Data Table supported.                                                                   
                                                                                                                                    
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   171   171   051    Pre-fail  Always       -       71842                   
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4250                           
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2330                      
  5 Reallocated_Sector_Ct   0x0033   106   106   140    Pre-fail  Always   FAILING_NOW 750        
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                            
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6517                    
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                          
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                      
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1714                  
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       114              
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2215                  
194 Temperature_Celsius     0x0022   105   093   000    Old_age   Always       -       42                      
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       680               
197 Current_Pending_Sector  0x0032   199   199   000    Old_age   Always       -       218                  
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       9                        
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0              
200 Multi_Zone_Error_Rate   0x0008   165   149   000    Old_age   Offline      -       7051                                         
                                                                                                                                    
SMART Error Log Version: 1                                                                                                          
No Errors Logged                                                                                                                    
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Completed: read failure       90%      6485         995865938                                              
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                      
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                        
    1        0        0  Not_testing                                                                                                
    2        0        0  Not_testing                                                                                                
    3        0        0  Not_testing                                                                                                
    4        0        0  Not_testing                                                                                                
    5        0        0  Not_testing                                                                                                
Selective self-test flags (0x0):                                                                                                    
  After scanning selected spans, do NOT read-scan remainder of disk.                                                                
If Selective self-test is pending on power-up, resume after 0 minute delay.

Now, my question is. What are the next steps that i need to do? Do I need to replace my disc? Or is it safe to just remove it from my ZFS pool?

My device is a simple desktop-turned-NAS and its not using a ECC memory.

Thanks in advance for the advice.

danb35 · Jun 29, 2014

Are those two disks striped or mirrored? If striped (which they appear to be), you need to copy as much of the data as possible off of that pool ASAP and set up a pool with some redundancy. If one of those disks fails completely, you'll lose all the data on your pool--and that disk is definitely showing signs of trouble.

I also note that the drive is running a little hot--it's 42 deg C, while 40 deg C is the recommended max. That's not likely to immediately kill the drive, but it sure won't help.

Ericloewe · Jun 29, 2014

You "found out" about scrubs? Honestly, if you can't be bothered to research the software you'll be using to store your data, you shouldn't touch it.

http://forums.freenas.org/index.php...ning-vdev-zpool-zil-and-l2arc-for-noobs.7775/

http://web.freenas.org/images/resources/freenas9.2.1/freenas9.2.1_guide.pdf

As far as I can tell, you're running two striped drives. You can't replace one, you have to copy the data over to a new pool. Yes, that drive is failing, big-time.

Things may still be recoverable if you fix this ASAP. However, you've made several mistakes:

Did not use ECC
Striped all your drives, allowing for 0 redundancy
Did not schedule (or even manually run!) scrubs and SMART tests (along with e-mailing the results of these tests...)!
Drives are running way too hot. Recommended maximum is 40ºC, that drive has seen 45ºC and is currently at 42ºC.

I urge you to read the stuff I linked so that in the future you don't repeat these same mistakes.

cyberjock · Jun 29, 2014

+1 for everything Ericloewe says.

panz · Jul 1, 2014

Ericloewe said:
[*]Drives are running way too hot. Recommended maximum is 40ºC, that drive has seen 45ºC and is currently at 42ºC.

Temperature is an interesting thing: I'm running two 1TB drives (Samsung and Seagate drives) contained in two LaCie "Neil Poulton" enclosures from year 2008 (24/7). I've Just put them into my FreeNAS box and tested them with all sort of stress routines. They're just fine.

Just discovered that, during their life into those fashioned boxes, they reached 72° C!!!

Ericloewe · Jul 1, 2014

panz said:
Temperature is an interesting thing: I'm running two 1TB drives (Samsung and Seagate drives) contained in two LaCie "Neil Poulton" enclosures from year 2008 (24/7). I've Just put them into my FreeNAS box and tested them with all sort of stress routines. They're just fine.

Just discovered that, during their life into those fashioned boxes, they reached 72° C!!!

Holy *redacted*!

And I thought my old Samsung F3s stuck in external aluminum enclosures had it bad at 43ºC. Hell, that's (almost?) hot enough to pasteurize stuff!

panz · Jul 1, 2014

"Poulton Design" is rubbish: it's a plastic enclosure (yes, plastic) with small holes on the bottom side of the "brick". No vent holes elsewhere! It becomes so hot OUTSIDE than you can barely touch it!

I pulled the drives from the Poulton nightmare and put them inside my 24 bay FreeNAS box: after one week of testing and stressing (smartctl, dd, iozone, etc.) they are perfectly healthy!

So, temperature. I don't know. My actual Red drives never get hotter than 37° C when they're resilvering, scrubbing or doing a smart long test. So I follow the advice: not let them go over 40, but I'm confident that 45° C won't hurt them. Only my 2 cents.

rs225 · Jul 1, 2014

If you have another 1TB drive laying around and spare bay for it, you should connect it immediately and use it to make a mirror of your failing drive.

zpool attach FNAS gptid/1e026827-4385-11e3-9e04-bc5ff495ed99 gptid/your_new_drive

joeschmuck · Jul 1, 2014

@MarciSD
You were lucky to have had the power failure you did have because otherwise you would never have known about the imminent danger you were in until it was too late. But of course, maybe the data isn't important to you so the loss is acceptable, I have no way on knowing how you use your system.

@panz
72C ! No way, that thing must have felt like it orbited the sun.

Important Announcement for the TrueNAS Community.

Just fixed my ZFS NAS errors and i now need suggestions

MarciSD

Cadet

danb35

Hall of Famer

Ericloewe

Server Wrangler

cyberjock

Inactive Account

panz

Guru

Ericloewe

Server Wrangler

panz

Guru

rs225

Guru

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Just fixed my ZFS NAS errors and i now need suggestions

Cadet

Hall of Famer

Server Wrangler

Inactive Account

Guru

Server Wrangler

Guru

Guru

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Just fixed my ZFS NAS errors and i now need suggestions"

Similar threads