Zpool resilvering constantly restarting

RandomTask · Jan 16, 2019

I recently started seeing two of my drives with SMART test error messages in the WebGUI

Code:

- Device: /dev/ada1, Read SMART Self-Test Log Failed
- Device: /dev/ada3, Self-Test Log error count increased from 1 to 2

Running long smartctl tests and reviewing the results revealed the following:

Code:

# smartctl -a /dev/ada1
// ... drive information
SMART overall-health self-assessment test result: PASSED
// ... more drive information

SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   119   095   051    Pre-fail  Always       -       260985                                      
  3 Spin_Up_Time            0x0027   237   175   021    Pre-fail  Always       -       1133                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       73                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   036   036   000    Old_age   Always       -       47249                                        
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       73                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       59                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       13                                          
194 Temperature_Celsius     0x0022   108   095   000    Old_age   Always       -       39                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       13                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1                                            
                                                                                                                                   
SMART Error Log Version: 1                                                                                                          
No Errors Logged                                                                                                                    
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Extended offline    Interrupted (host reset)      90%     46915         -                                                      
# 2  Short offline       Completed: read failure       90%     46804         362115280                                              
# 3  Short offline       Completed: read failure       70%     46804         3906320080                                            
# 4  Short offline       Completed: read failure       90%     46804         362115280                                              
# 5  Extended offline    Completed: read failure       90%     46781         362115280                                              
# 6  Short offline       Completed: read failure       90%     46781         362115280                                              
# 7  Extended offline    Completed: read failure       90%     30850         362119544                                              
# 8  Extended offline    Completed: read failure       90%     30843         362119544                                              
# 9  Extended offline    Completed: read failure       90%     30826         362119544                                              
#10  Extended offline    Completed: read failure       90%     30819         362119544                                              
#11  Extended offline    Completed: read failure       90%     30587         362119544                                              
#12  Extended offline    Completed: read failure       90%     30580         362119544                                              
#13  Extended offline    Completed: read failure       90%     30347         362119544                                              
#14  Extended offline    Completed: read failure       90%     30340         362119544                                              
#15  Extended offline    Completed: read failure       90%     30323         362119544                                              
#16  Extended offline    Completed: read failure       90%     30316         362119544                                              
#17  Extended offline    Completed: read failure       90%     30107         362119544                                              
#18  Extended offline    Completed: read failure       90%     30100         362119544                                              
#19  Extended offline    Completed: read failure       90%     30083         362119544                                              
#20  Extended offline    Completed: read failure       90%     30076         362119544                                              
#21  Extended offline    Completed: read failure       90%     29843         362119544

Code:

# smartctl -a /dev/ada3

// ... drive information
SMART overall-health self-assessment test result: PASSED
// ... more drive information

SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       777                                          
  3 Spin_Up_Time            0x0027   179   177   021    Pre-fail  Always       -       4033                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       74                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       46058                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       73                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       56                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       17                                          
194 Temperature_Celsius     0x0022   108   095   000    Old_age   Always       -       39                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1                                            
                                                                                                                                   
SMART Error Log Version: 1                                                                                                          
No Errors Logged                                                                                                                    
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Extended offline    Completed: read failure       80%     45772         296677872                                              
# 2  Short offline       Completed: read failure       90%     45680         305110318                                              
# 3  Extended offline    Completed: read failure       90%     45601         296266848                                              
# 4  Extended offline    Aborted by host               90%     45594         -                                                      
# 5  Extended offline    Completed without error       00%     29662         -                                                      
# 6  Extended offline    Completed without error       00%     29655         -                                                      
# 7  Extended offline    Completed without error       00%     29638         -                                                      
# 8  Extended offline    Completed without error       00%     29631         -                                                      
# 9  Extended offline    Completed without error       00%     29399         -                                                      
#10  Extended offline    Completed without error       00%     29392         -                                                      
#11  Extended offline    Completed without error       00%     29159         -                                                      
#12  Extended offline    Completed without error       00%     29152         -                                                      
#13  Extended offline    Completed without error       00%     29135         -                                                      
#14  Extended offline    Completed without error       00%     29128         -                                                      
#15  Extended offline    Completed without error       00%     28919         -                                                      
#16  Extended offline    Completed without error       00%     28912         -                                                      
#17  Extended offline    Completed without error       00%     28895         -                                                      
#18  Extended offline    Completed without error       00%     28888         -

Despite the tests reporting the drives are OK, I've read a number of posts that say once you start seeing pending sectors stack up, it's time to replace the drive. Those posts also suggest that the cables should be the first thing to scrutinize, so I've ordered a complete new set, they should arrive soon.

Before replacing the drives, I did attempt dd on one of the offending sectors just to see if the pending sectors would reduce - it didn't seem to help.

I've used the WebGUI to replace the big offender: ada1 however, the resilvering process seems unable to finish, often restarting one or more times each day. It has been running since January 1 and has yet to complete successfully.

I did happen to notice once that ada1 was offline and then was online again, I'm wondering if the cables may be faulty and causing the process to restart or maybe I need to replace ada3 as well because it has unreadable sectors?

I've also disabled SMART tests and zpool scrubs while the resilvering was happening just in case that was causing the reset, but it doesn't seem to have made a difference. I had scrubs set to run once a month and long SMART tests to run once a month (each disk on a different day).

My zpool reports as healthy:

Code:

pool: Tartarus                                                                                                                    
 state: ONLINE                                                                                                                      
status: One or more devices is currently being resilvered.  The pool will                                                          
        continue to function, possibly in a degraded state.                                                                        
action: Wait for the resilver to complete.                                                                                          
  scan: resilver in progress since Wed Jan 16 20:00:55 2019                                                                        
        52.2G scanned out of 1.27T at 5.79M/s, 61h26m to go                                                                        
        8.67G resilvered, 4.01% done                                                                                                
config:                                                                                                                            
                                                                                                                                   
        NAME                                              STATE     READ WRITE CKSUM                                                
        Tartarus                                          ONLINE       0     0     0                                                
          raidz2-0                                        ONLINE       0     0     0                                                
            gptid/b532dea8-2c7e-11e2-b689-902b34563f56    ONLINE       0     0     0                                                
            replacing-1                                   ONLINE       0     0 1.05K                                                
              gptid/b587a21a-2c7e-11e2-b689-902b34563f56  ONLINE       0     0     0  (resilvering)                                
              gptid/9caf3b27-0b81-11e9-af0c-902b34563f56  ONLINE       0     0     0  (resilvering)                                
            gptid/b5d9849a-2c7e-11e2-b689-902b34563f56    ONLINE       0     0     0                                                
            gptid/b62b9a91-2c7e-11e2-b689-902b34563f56    ONLINE       0     0     0                                                
            gptid/b6808fc8-2c7e-11e2-b689-902b34563f56    ONLINE       0     0     0                                                
            gptid/b6d39638-2c7e-11e2-b689-902b34563f56    ONLINE       0     0     0                                                
                                                                                                                                   
errors: No known data errors                                                                                                        
                                                                                                                                   
  pool: freenas-boot                                                                                                                
 state: ONLINE                                                                                                                      
  scan: scrub repaired 0 in 0h2m with 0 errors on Thu Dec 13 03:47:08 2018                                                          
config:                                                                                                                            
                                                                                                                                   
        NAME        STATE     READ WRITE CKSUM                                                                                      
        freenas-boot  ONLINE       0     0     0                                                                                    
          da0p2     ONLINE       0     0     0                                                                                      
                                                                                                                                   
errors: No known data errors

So my questions are:
1. Can I turn the server off if the resilvering process doesn't complete so that I can replace the cables or am I stuck?
2. Is there a log or something that I can look into to see why the resilvering keeps restarting?
3. Is there something else I should be doing to make the resilvering successful?

My Specs:

Code:

FreeNAS-9.10.2-U1 (86c7ef5)
Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
Gigabyte GA-Z77X-UD5H Motherboard
16GB RAM
8GB Sandisk USB for OS
6x 2TB WD Red 7200 RPM SATA (Added a 7th to start the replacement process)
1 zpool, 2 disks of redundancy

No RAID setup, all disks are connected directly to the motherboard

I've reviewed the information in these posts but I'm stumped as to why the resilvering keeps restarting:
https://www.ixsystems.com/documenta...rage.html#replacing-drives-to-grow-a-zfs-pool
https://www.ixsystems.com/documentation/freenas/9.10/storage.html#replacing-a-failed-drive
https://forums.freenas.org/index.ph...-10-freenas-reference-manual-not-clear.54195/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
https://forums.freenas.org/index.ph...us-resilvering-please-help.57619/#post-406483

I've exported the debug information from the WebGUI as well and can upload if needed.

Chris Moore · Jan 16, 2019

RandomTask said:

This drive should have been replaced THOUSANDS of hours ago. The first time it failed a self test. What is your problem?

RandomTask said:

Again, a drive that should have already been replaced.

RandomTask said:
Can I turn the server off if the resilvering process doesn't complete so that I can replace the cables or am I stuck?

Why do you think the cables need replacement? You should have pulled the defective drive out. The problems with the drive are what is causing the repeated restarts. It is probably accumulating new errors as it is trying to read from the defective drive.

RandomTask · Jan 17, 2019

Thanks for your quick reply!

My concern with replacing the cables was that I've seen the drives periodically go "offline", I'd check the cables and restart the machine and then it would see the drive again and resilver.

In my reading (specifically one of the threads I linked) I was under the impression that you shouldn't remove a drive from a pool until it had been replaced if it didn't give you the "offline" button in the WebGUI.

I have recorded their serial numbers so I can pull the drives, is it safe to shut the system down (properly) and just remove the faulty drives?

RandomTask · Jan 17, 2019

OK, I've swapped out the offending disks and things seem to be looking much better. Thanks again!

Important Announcement for the TrueNAS Community.

Zpool resilvering constantly restarting

RandomTask

Dabbler

Chris Moore

Hall of Famer

RandomTask

Dabbler

RandomTask

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Zpool resilvering constantly restarting

RandomTask

Dabbler

Chris Moore

Hall of Famer

RandomTask

Dabbler

RandomTask

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Zpool resilvering constantly restarting"

Similar threads