RAIDZ1 2. drive UNAVAIL while swapping

birke

Cadet
Joined
Jan 8, 2023
Messages
7
After an apparent drive fault and attempting to swap the defective drive, I now have two UNAVAIL drives. Is there any chance this is my fault, and not the drive?

I have 4 WD red 4TB in an encrypted RAIDZ1 volume. Before Christmas one drive went UNAVAIL and the volume degraded. I ordered a new drive, and attempted to swap the drive out (hotswap since I couldnt figure out how to swap the drive witout volume unlocked), but afterwards the system was unreachable. After reboot, the volume would not unlock and is reporting "error getting available space". zpool import shows:

Code:
[root@freenas ~]# zpool import                                                                                                     
   pool: Bunker                                                                                                                     
     id: 14960215150828443373                                                                                                       
  state: UNAVAIL                                                                                                                   
 status: One or more devices are missing from the system.                                                                           
 action: The pool cannot be imported. Attach the missing                                                                           
        devices and try again.                                                                                                     
   see: http://illumos.org/msg/ZFS-8000-3C                                                                                         
 config:                                                                                                                           
                                                                                                                                    
        Bunker                                              UNAVAIL  insufficient replicas                                         
          raidz1-0                                          UNAVAIL  insufficient replicas                                         
            1162922274020371250                             UNAVAIL  cannot open                                                   
            2820994357183782969                             UNAVAIL  cannot open                                                   
            gptid/2010cbfe-da2a-11e7-8037-7085c24bc353.eli  ONLINE                                                                 
            gptid/242f9af5-da2a-11e7-8037-7085c24bc353.eli  ONLINE                                                                 
[root@freenas ~]#        


I swapped cables between the two "good" and "bad" drives, still the same drives are UNAVAIL but "moved".
Smarttest (offline) though show no problems as far as I can tell / find out.

The "new" UNAVAIL drive:

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)                                                             
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                         
                                                                                                                                    
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital Red                                                                                               
Device Model:     WDC WD40EFRX-68N32N0                                                                                             
Serial Number:    WD-WCC7K7FCF08A                                                                                                   
LU WWN Device Id: 5 0014ee 20f2fcefb                                                                                               
Firmware Version: 82.00A82                                                                                                         
User Capacity:    4,000,787,030,016 bytes [4.00 TB]                                                                                 
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Form Factor:      3.5 inches                                                                                                       
Device is:        In smartctl database [for details use: -P show]                                                                   
ATA Version is:   ACS-3 T13/2161-D revision 5                                                                                       
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)                                                                           
Local Time is:    Sun Jan  8 08:50:41 2023 PST                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                           
                                                                                                                                    
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Disabled.                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                   
                                        without error or no self-test has ever                                                     
                                        been run.                                                                                   
Total time to complete Offline                                                                                                     
data collection:                (46140) seconds.                                                                                   
Offline data collection                                                                                                             
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                   
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                             
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                  
recommended polling time:        ( 489) minutes.                                                                                    
Conveyance self-test routine                                                                                                        
recommended polling time:        (   5) minutes.                                                                                    
SCT capabilities:              (0x303d) SCT Status supported.                                                                      
                                        SCT Error Recovery Control supported.                                                      
                                        SCT Feature Control supported.                                                              
                                        SCT Data Table supported.                                                                  
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   162   162   021    Pre-fail  Always       -       6883                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       56                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43145                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       56                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       179                                          
194 Temperature_Celsius     0x0022   123   107   000    Old_age   Always       -       27                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                            
                                                                                                                                   
SMART Error Log Version: 1                                                                                                          
No Errors Logged                                                                                  

                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     43144         -                                                      
# 2  Short captive       Interrupted (host reset)      90%     43144         -                                                      
# 3  Short captive       Interrupted (host reset)      90%     43143         -                                                      
# 4  Short captive       Interrupted (host reset)      90%     43143         -                                                      
# 5  Short offline       Completed without error       00%     41510         -                                                      
# 6  Short offline       Completed without error       00%     41342         -                                                      
# 7  Short offline       Completed without error       00%     41174         -                                                      
# 8  Short offline       Completed without error       00%     41006         -                                                      
# 9  Short offline       Completed without error       00%     40839         -                                                      
#10  Short offline       Completed without error       00%     40671         -                                                      
#11  Short offline       Completed without error       00%     40503         -                                                      
#12  Short offline       Completed without error       00%     40335         -                                                      
#13  Short offline       Completed without error       00%     40167         -                                                      
#14  Short offline       Completed without error       00%     39999         -                                                      
#15  Short offline       Completed without error       00%     39832         -                                                      
#16  Short offline       Completed without error       00%     39664         -                                                      
#17  Short offline       Completed without error       00%     39532         -                                                      
#18  Short offline       Completed without error       00%     39364         -                                                      
#19  Short offline       Completed without error       00%     39197         -                                                      
#20  Short offline       Completed without error       00%     39029         -                                                      
#21  Short offline       Completed without error       00%     38861         -                                                      
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                      
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                        
    1        0        0  Not_testing                                                                                                
    2        0        0  Not_testing                                                                                                
    3        0        0  Not_testing                                                                                                
    4        0        0  Not_testing                                                                                                
    5        0        0  Not_testing                                                                                                
Selective self-test flags (0x0):                                                                                                    
  After scanning selected spans, do NOT read-scan remainder of disk.                                                                
If Selective self-test is pending on power-up, resume after 0 minute delay.                


    


Drive that went UNAVAIL before Christmas:

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)                                                             
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                         
                                                                                                                                    
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital Red                                                                                               
Device Model:     WDC WD40EFRX-68N32N0                                                                                             
Serial Number:    WD-WCC7K5ZJKAC9                                                                                                   
LU WWN Device Id: 5 0014ee 20f2fd4d1                                                                                               
Firmware Version: 82.00A82                                                                                                         
User Capacity:    4,000,787,030,016 bytes [4.00 TB]                                                                                 
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Form Factor:      3.5 inches                                                                                                       
Device is:        In smartctl database [for details use: -P show]                                                                   
ATA Version is:   ACS-3 T13/2161-D revision 5                                                                                       
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)                                                                           
Local Time is:    Sun Jan  8 08:54:07 2023 PST                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                           
                                                                                                                                    
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Disabled.                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                   
                                        without error or no self-test has ever                                                     
                                        been run.                                                                                   
Total time to complete Offline                                                                                                     
data collection:                (45180) seconds.                                                                                   
Offline data collection                                                                                                             
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                   
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                             
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                              
recommended polling time:        ( 480) minutes.                                                                                    
Conveyance self-test routine                                                                                                        
recommended polling time:        (   5) minutes.                                                                                    
SCT capabilities:              (0x303d) SCT Status supported.                                                                      
                                        SCT Error Recovery Control supported.                                                      
                                        SCT Feature Control supported.                                                              
                                        SCT Data Table supported.                                                                  
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   164   163   021    Pre-fail  Always       -       6791                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   040   040   000    Old_age   Always       -       44416                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       48                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       69                                          
194 Temperature_Celsius     0x0022   120   106   000    Old_age   Always       -       30                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                            
                                                                                                                                   
SMART Error Log Version: 1                                                                                                          
No Errors Logged                                                          

SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     44415         -                                                      
# 2  Short captive       Interrupted (host reset)      90%     44415         -                                                      
# 3  Short captive       Interrupted (host reset)      90%     44413         -                                                      
# 4  Short offline       Completed without error       00%     41682         -                                                      
# 5  Short offline       Completed without error       00%     41514         -                                                      
# 6  Short offline       Completed without error       00%     41346         -                                                      
# 7  Short offline       Completed without error       00%     41179         -                                                      
# 8  Short offline       Completed without error       00%     41011         -                                                      
# 9  Short offline       Completed without error       00%     40843         -                                                      
#10  Short offline       Completed without error       00%     40675         -                                                      
#11  Short offline       Completed without error       00%     40507         -                                                      
#12  Short offline       Completed without error       00%     40340         -                                                      
#13  Short offline       Completed without error       00%     40172         -                                                      
#14  Short offline       Completed without error       00%     40004         -                                                      
#15  Short offline       Completed without error       00%     39836         -                                                      
#16  Short offline       Completed without error       00%     39668         -                                                      
#17  Short offline       Completed without error       00%     39537         -                                                      
#18  Short offline       Completed without error       00%     39369         -                                                      
#19  Short offline       Completed without error       00%     39201         -                                                      
#20  Short offline       Completed without error       00%     39033         -                                                      
#21  Short offline       Completed without error       00%     38865         -                                                      
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                      
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                        
    1        0        0  Not_testing                                                                                                
    2        0        0  Not_testing                                                                                                
    3        0        0  Not_testing                                                                                                
    4        0        0  Not_testing                                                                                                
    5        0        0  Not_testing                                                                                                
Selective self-test flags (0x0):                                                                                                    
  After scanning selected spans, do NOT read-scan remainder of disk.                                                                
If Selective self-test is pending on power-up, resume after 0 minute delay.          

  


I have the encryption passphrase and the geli.key backed up.
Is there any chance I can get one of the drives back online? Any suggestions?
Should I run the long smart tests for further information?

Thanks in advance for any help!
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
you have discovered the reason RAIDz1 is highly discouraged, particularly on drives over 2TB.
you can only lose 1 drive but have lost 2. this is also likely to occur during rebuild but you experienced it even sooner.
RAIDz1 should only be used for data you don't care about or data you have good backups for.

ideally, restore from backups.

there are paid services but I doubt this is worth thousands of dollars to attempt to recover, especially since you encrypted it.

more smart tests is just going to put more wear on the drives, so don't. it wont recover anything, just keep telling you bad news

there is a chance, perhaps, that if you import the pool as read-only, you *might* be able to get it online long enough to copy from it, since it wont fail the drives due to writes.

you also are missing your hardware and OS version, but that's forgivable since it looks like your pool is gone and that's never a good day :frown:

as you didnt post your hardware, it's difficult to make any other recommendations, but there are hardware configs that are dangerous as well. if you have one of those, it might be possible to switch to something reliable and get better results.
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Plug back the unavail drives (you need at least one of the two alive) and, if you can, the new disk. Then you need to replace the drive from the WebUI/GUI.
If both of the drives died, I wish you have a backup.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
I don't see any problem with the drives. Unless I missed something, the SMART tests are good on both?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I don't see any problem with the drives. Unless I missed something, the SMART tests are good on both?
There is no long test in the logs.

From the smartctl we can deduce (FreeBSD 11) the opener is on FreeNAS. I got no power there beside some common sense.
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I don't see any problem with the drives. Unless I missed something, the SMART tests are good on both?
The problem with SMART tests is that the absence of an error does not always mean that the drive is ok.

It is a bit like with electrocardiograms (ECG) and heart attacks. Even if you had an ECG just 5 minutes ago and it was perfectly fine, you can still have an attack now. At least that is what according to my mother her cardiologist told her.
 

birke

Cadet
Joined
Jan 8, 2023
Messages
7
Missing info:
FreeNAS-11.0-U4 (54848d13b) (bit older install)
Intel(R) Pentium(R) CPU J4205 @ 1.50GHz
Asrock J4205-ITX motherboard using the 4x onboard SATA
Booting from SSD on USB

In addition: The above tests are with both old / unavail drives installed.

I actually started the longer tests before going to bed yesterday:

Drive 1:

Code:
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART overall-health self-assessment test result: PASSED                                                                            
[...]                                                  

SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   162   162   021    Pre-fail  Always       -       6883                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       56                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43164                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       56                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       181                                          
194 Temperature_Celsius     0x0022   124   107   000    Old_age   Always       -       26                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0                       

[...]
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Completed without error       00%     43156         -                   
...
 



Drive 2:
Code:
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART overall-health self-assessment test result: PASSED                                                                            
[...]
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   164   163   021    Pre-fail  Always       -       6791                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   040   040   000    Old_age   Always       -       44435                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       48                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       70                                          
194 Temperature_Celsius     0x0022   121   106   000    Old_age   Always       -       29                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0                                            
                                                         
[...]

SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Completed without error       00%     44426         -                                          
...
 


Should the extended tests not show problems if there are really issues with the drives?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
2 x SATA3 6.0 Gb/s Connectors, support NCQ, AHCI and Hot Plug
- 2 x SATA3 6.0 Gb/s Connectors by ASMedia ASM1061, support NCQ, AHCI and Hot Plug


are the 2 non working drives connected to the ASMedia ports?

you are using hardware outside the recomended, and fully compatible, realm. your experience will vary.
it's a realtek NIC as well. known terrible for servers.
 

birke

Cadet
Joined
Jan 8, 2023
Messages
7
are the 2 non working drives connected to the ASMedia ports?

Port names indicate that they are not, and I swapped connectors on the drive side with the two working drives.
I'm not sure how to tell which ports are which controller, the motherboard manual isn't helpfull.

The ports are labled SATA3_1, SATA3_2, the second column SATA3_A1 and SATA3_A2.

The first UNAVAIL drive /dev/ada3 is currently on port SATA3_A2, and was previously on SATA3_A1
The second UNAVAIL drive /dev/ada0 is currently on port SATA3_1 and was previously on SATA3_2.


I will definitely upgrade to a more fault tolerant and recommended solution after this experience, but right now I've not quite lost hope on getting the volume back together since the drives are not indicating problems, or are my hopes misplaced and this is a lost cause? Restoring from backup isn't an option...

I could also temporarly use another motherboard (intel x99) if that could help.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
the motherboard manual isn't helpfull
yup, one of the reasons for the recommended hardware

if you have swapped the connections around then it's pretty much either the drives or the controller. another motherboard/controller combination should be able to tell you which.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I will definitely upgrade to a more fault tolerant and recommended solution after this experience, but right now I've not quite lost hope on getting the volume back together since the drives are not indicating problems, or are my hopes misplaced and this is a lost cause? Restoring from backup isn't an option...
If you are able to get even one of them up you can bring back the pool: the good news is that they look healthy so it's not a hopeless endeavour; the bad news is that you have to troubleshoot cables and motherboard.
 

birke

Cadet
Joined
Jan 8, 2023
Messages
7
Good news, I got the pool back online on the other motherboard and am currently backing up data!

After a few minutes during backup, I got a critical alert "One or more devices has experienced an unrecoverable error", which seem to be checksum faults from the same drive that had issues last year:

Code:
root@truenas[~]# zpool status
  pool: Bunker
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 3.12G in 00:01:03 with 0 errors on Sat Jan 14 03:39:54 2023
config:

        NAME                                                STATE     READ WRITE CKSUM
        Bunker                                              ONLINE       0     0     0
          raidz1-0                                          ONLINE       0     0     0
            gptid/1a28e9dd-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/1d2b4490-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0    67
            gptid/2010cbfe-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/242f9af5-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          ada4p2    ONLINE       0     0     0

errors: No known data errors
root@truenas[~]#


20 minutes later:

Code:
        NAME                                                STATE     READ WRITE CKSUM
        Bunker                                              ONLINE       0     0     0
          raidz1-0                                          ONLINE       0     0     0
            gptid/1a28e9dd-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/1d2b4490-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0   723
            gptid/2010cbfe-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/242f9af5-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0


System is currently:
TrueNAS-13.0-U3.1
Intel(R) Core(TM) i7-5930K CPU
Asus x99-Delux Motherboard, Drives connected to Intel X99 controller

Everything except the drives are differnet including cables and PSU.

I assume replacing the drive that is throwing errors is still the way to go?
I did not perform any SMART self-tests yet.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I would:
- zpool clear Bunker
- zpool scrub Bunker
- run a long smart test on gptid/1d2b4490-da2a-11e7-8037-7085c24bc353.eli

Then assess the situation: if anything looks out of the norm, replace the drive.
 

birke

Cadet
Joined
Jan 8, 2023
Messages
7
I did as advised. Scrub repaird a bunch of faults with 0 errors, I cleared again then did the long smart test that also completed without faults.

Code:
root@truenas[~]# zpool status
  pool: Bunker
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 12.8G in 06:04:45 with 0 errors on Sat Jan 14 13:00:02 2023
config:

        NAME                                                STATE     READ WRITE CKSUM
        Bunker                                              ONLINE       0     0     0
          raidz1-0                                          ONLINE       0     0     0
            gptid/1a28e9dd-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/1d2b4490-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0  315K
            gptid/2010cbfe-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/242f9af5-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0

errors: No known data errors


I cleared again, then performed the smart test:

Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0Serial Number:    WD-WCC7K7FCF08A
LU WWN Device Id: 5 0014ee 20f2fcefbFirmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 15 03:30:58 2023 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (46140) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 489) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   162   162   021    Pre-fail  Always       -       6900
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43194
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       187
194 Temperature_Celsius     0x0022   113   106   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     43188         -
# 2  Extended offline    Completed without error       00%     43156         -
# 3  Extended captive    Interrupted (host reset)      90%     43147         -
# 4  Short offline       Completed without error       00%     43144         -
# 5  Short captive       Interrupted (host reset)      90%     43144         -
# 6  Short captive       Interrupted (host reset)      90%     43143         -
# 7  Short captive       Interrupted (host reset)      90%     43143         -
# 8  Short offline       Completed without error       00%     41510         -
# 9  Short offline       Completed without error       00%     41342         -
#10  Short offline       Completed without error       00%     41174         -
#11  Short offline       Completed without error       00%     41006         -
#12  Short offline       Completed without error       00%     40839         -
#13  Short offline       Completed without error       00%     40671         -
#14  Short offline       Completed without error       00%     40503         -
#15  Short offline       Completed without error       00%     40335         -
#16  Short offline       Completed without error       00%     40167         -
#17  Short offline       Completed without error       00%     39999         -
#18  Short offline       Completed without error       00%     39832         -
#19  Short offline       Completed without error       00%     39664         -
#20  Short offline       Completed without error       00%     39532         -
#21  Short offline       Completed without error       00%     39364         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Can I assume the drives are OK or are there further tests I can do?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Did you clear before or after the scrub?

The drive looks fine from the smart data.
 
Last edited:

birke

Cadet
Joined
Jan 8, 2023
Messages
7
Did you clear before or after the scrub?
Both. The above output is from after the scrub (before clearing again and starting smartctl). Since the displayed faults were in the thousands, I wanted to clear to see new / individual faults.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
If you get more checksum errors, replace the drive.
 
Last edited:

birke

Cadet
Joined
Jan 8, 2023
Messages
7
scrub finished with no further problems detected.
Any further testing I should do for the drives, or can I assume they are good and I just need to get them back into a more reliable system?


Code:
  pool: Bunker
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 05:52:52 with 0 errors on Sun Jan 15 13:02:25 2023
config:

        NAME                                                STATE     READ WRITE CKSUM
        Bunker                                              ONLINE       0     0     0
          raidz1-0                                          ONLINE       0     0     0
            gptid/1a28e9dd-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/1d2b4490-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/2010cbfe-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0
            gptid/242f9af5-da2a-11e7-8037-7085c24bc353.eli  ONLINE       0     0     0

errors: No known data errors
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
chksum errors can often be the result of dodgy cables. Try replacing the SATA cable to the drive generating the chksum errors
 
Top