Disk replaced for upgrade, now zpool degraded after resilvering

vidjcb · Jan 5, 2019

Hello,

I had to replace a faulty 3TB drive from a 4 disk pool. I did it per manual using a new 10TB disk and resilvering went fine.
Now my intention was replace all remaining 3TB disks one by one, with 10TB disks. What I did to avoid confusion was power off the box, swap the first disk, power back on, and from the GUI add the new 10TB to the pool to start resilvering.

After a long 5 hours wait I found the pool ended in degraded state:

Code:

zpool status -v                                                                                             
  pool: Volume001                                                                                                             
state: DEGRADED                                                                                                               
status: One or more devices has experienced an error resulting in data                                                         
        corruption.  Applications may be affected.                                                                             
action: Restore the file in question if possible.  Otherwise restore the                                                       
        entire pool from backup.                                                                                               
   see: http://illumos.org/msg/ZFS-8000-8A                                                                                     
  scan: scrub in progress since Fri Jan  4 16:07:38 2019                                                                       
        3.44T scanned at 3.83G/s, 427G issued at 476M/s, 8.23T total                                                           
        0 repaired, 5.07% done, 0 days 04:46:51 to go                                                                         
config:                                                                                                                       
                                                                                                                              
        NAME                                            STATE     READ WRITE CKSUM                                             
        Volume001                                       DEGRADED     0     0    49                                             
          raidz1-0                                      DEGRADED     0     0    98                                             
            gptid/dbc00018-1029-11e9-a5ca-94f12895b13c  ONLINE       0     0     0                                             
            gptid/b053e7c9-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                           
            gptid/b15ed11f-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                           
            gptid/a9248c05-0604-11e9-9ed5-94f12895b13c  DEGRADED     0     0     0  too many errors                           
                                                                                                                              
errors: Permanent errors have been detected in the following files:                                                           
                                                                                                                              
        <metadata>:<0x0>                                                                                                       
                                                                                                                              
  pool: freenas-boot                                                                                                           
state: ONLINE                                                                                                                 
  scan: scrub repaired 0 in 0 days 00:00:50 with 0 errors on Fri Jan  5 03:45:50 2018                                         
config:                                                                                                                       
                                                                                                                              
        NAME        STATE     READ WRITE CKSUM                                                                                 
        freenas-boot  ONLINE       0     0     0                                                                               
          da0p2     ONLINE       0     0     0                                                                                 
                                                                                                                              
errors: No known data errors

After that I tried:

reboot and wait for more 5h resilver... Nothing
run zpool clear Volume0001, wait for more 5h resilver and yet I still having the pool degraded.

I also tried:

Code:

glabel status                                                                                                 
                                      Name  Status  Components                                                                  
                              label/efibsd     N/A  da0p1                                                                       
gptid/27cf8ad3-e8ad-11e7-aa15-94f12895b13c     N/A  da0p1                                                                       
gptid/b053e7c9-e8af-11e7-b707-94f12895b13c     N/A  ada0p2                                                                      
gptid/b15ed11f-e8af-11e7-b707-94f12895b13c     N/A  ada1p2                                                                      
gptid/a9248c05-0604-11e9-9ed5-94f12895b13c     N/A  ada2p2                                                                      
gptid/dbc00018-1029-11e9-a5ca-94f12895b13c     N/A  ada3p2

And apparently there is no "ghost" disk present.

At this point I kindly ask for guru help and I will be more than grateful for that

Thanks in advance!

Best,

VIDJCB

kdragon75 · Jan 5, 2019

Let's start with confirming what part of the manual you used. Please link to the section that you followed for replacing the disk. From there, lets get some details about your hardware and zpool. Please provide the output of the following two commands.
lspci
zpool status -v

vidjcb · Jan 5, 2019

OK When the first disk failed, I followed section 8.1.10: Replacing a Failed Drive. Everything went OK in that part

When I started replacing the next 3TB disk by a 10TB one, I just shutdown the box without offline it. I replace it via GUI when I powered back ON the system with the new disk inserted.

Code:

lspci                                                                                                            
00:00.0 Host bridge: Intel Corporation Device 5918 (rev 05)                                                                        
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)                                        
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)                                          
00:17.0 RAID bus controller: Intel Corporation Device a106 (rev 31)                                                                
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)                                            
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)                                            
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)                                            
00:1d.2 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #11 (rev f1)                                          
00:1d.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #12 (rev f1)                                          
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)                                                      
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)                                                          
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)                                                                    
01:00.0 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support (rev 06)  
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH (rev 01)                                            
01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging (rev 06
)                                                                                                                                  
01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out Standard Virtual USB Controller (rev 03)                    
01:00.7 Unassigned class [ff00]: Hewlett-Packard Company Device 193f                                                              
02:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe                                              
02:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe                                              
[root@freenas ~]#

Now The last thing I tried was scrub the pool but having less than 40 minutes to complete, nothing happened:

Code:

zpool status -v                                                                                                  
  pool: Volume001                                                                                                                  
state: DEGRADED                                                                                                                  
status: One or more devices has experienced an error resulting in data                                                            
        corruption.  Applications may be affected.                                                                                
action: Restore the file in question if possible.  Otherwise restore the                                                          
        entire pool from backup.                                                                                                  
   see: http://illumos.org/msg/ZFS-8000-8A                                                                                        
  scan: scrub in progress since Sat Jan  5 08:28:56 2019                                                                          
        8.23T scanned at 475M/s, 7.20T issued at 415M/s, 8.23T total                                                              
        28K repaired, 87.48% done, 0 days 00:43:21 to go                                                                          
config:                                                                                                                            
                                                                                                                                   
        NAME                                            STATE     READ WRITE CKSUM                                                
        Volume001                                       DEGRADED     0     0    98                                                
          raidz1-0                                      DEGRADED     0     0   196                                                
            gptid/dbc00018-1029-11e9-a5ca-94f12895b13c  ONLINE       0     0     0                                                
            gptid/b053e7c9-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                                
            gptid/b15ed11f-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                                
            gptid/a9248c05-0604-11e9-9ed5-94f12895b13c  DEGRADED     0     0     7  too many errors  (repairing)                  
                                                                                                                                   
errors: Permanent errors have been detected in the following files:                                                                
                                                                                                                                   
        <metadata>:<0x0>                                                                                                          
                                                                                                                                   
  pool: freenas-boot                                                                                                              
state: ONLINE                                                                                                                    
  scan: scrub repaired 0 in 0 days 00:05:15 with 0 errors on Sat Jan  5 03:50:15 2019                                              
config:                                                                                                                            
                                                                                                                                   
        NAME        STATE     READ WRITE CKSUM                                                                                    
        freenas-boot  ONLINE       0     0     0                                                                                  
          da0p2     ONLINE       0     0     0                                                                                    
                                                                                                                                   
errors: No known data errors                                                                                                      
[root@freenas ~]#

UPDATE:

Scrub ended after over 6h but things seem to look pretty much the same:

Code:

zpool status -v                                                                                                   
  pool: Volume001                                                                                                                   
 state: DEGRADED                                                                                                                   
status: One or more devices has experienced an error resulting in data                                                             
        corruption.  Applications may be affected.                                                                                 
action: Restore the file in question if possible.  Otherwise restore the                                                           
        entire pool from backup.                                                                                                   
   see: http://illumos.org/msg/ZFS-8000-8A                                                                                         
  scan: scrub repaired 28K in 0 days 06:04:44 with 49 errors on Sat Jan  5 14:33:40 2019                                           
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                 
        Volume001                                       DEGRADED     0     0    98                                                 
          raidz1-0                                      DEGRADED     0     0   196                                                 
            gptid/dbc00018-1029-11e9-a5ca-94f12895b13c  ONLINE       0     0     0                                                 
            gptid/b053e7c9-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                                 
            gptid/b15ed11f-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                                 
            gptid/a9248c05-0604-11e9-9ed5-94f12895b13c  DEGRADED     0     0     7  too many errors                                 
                                                                                                                                    
errors: Permanent errors have been detected in the following files:                                                                 
                                                                                                                                    
        <metadata>:<0x0>                                                                                                           
                                                                                                                                    
  pool: freenas-boot                                                                                                               
 state: ONLINE                                                                                                                     
  scan: scrub repaired 0 in 0 days 00:05:15 with 0 errors on Sat Jan  5 03:50:15 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                     
        freenas-boot  ONLINE       0     0     0                                                                                   
          da0p2     ONLINE       0     0     0                                                                                     
                                                                                                                                    
errors: No known data errors                                                                                                       
[root@freenas ~]

Thanks,

vidjcb · Jan 8, 2019

Intrestingly, once resilvering is complete, the only ONLINE disk is the one I replaced:

Code:

zpool status -v                                                                                                   
  pool: Volume001                                                                                                                   
 state: DEGRADED                                                                                                                   
status: One or more devices has experienced an error resulting in data                                                             
        corruption.  Applications may be affected.                                                                                 
action: Restore the file in question if possible.  Otherwise restore the                                                           
        entire pool from backup.                                                                                                   
   see: http://illumos.org/msg/ZFS-8000-8A                                                                                         
  scan: resilvered 1.39T in 0 days 04:06:24 with 49 errors on Tue Jan  8 10:45:25 2019                                             
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                 
        Volume001                                       DEGRADED     0     0    49                                                 
          raidz1-0                                      DEGRADED     0     0    98                                                 
            gptid/dbc00018-1029-11e9-a5ca-94f12895b13c  ONLINE       0     0     0                                                 
            gptid/b053e7c9-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                                 
            gptid/b15ed11f-e8af-11e7-b707-94f12895b13c  DEGRADED     0     0     0  too many errors                                 
            gptid/a9248c05-0604-11e9-9ed5-94f12895b13c  DEGRADED     0     0     0  too many errors                                 
                                                                                                                                    
errors: Permanent errors have been detected in the following files:                                                                 
                                                                                                                                    
        <metadata>:<0x0>                                                                                                           
                                                                                                                                    
  pool: freenas-boot                                                                                                               
 state: ONLINE                                                                                                                     
  scan: scrub repaired 0 in 0 days 00:05:15 with 0 errors on Sat Jan  5 03:50:15 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                     
        freenas-boot  ONLINE       0     0     0                                                                                   
          da0p2     ONLINE       0     0     0                                                                                     
                                                                                                                                    
errors: No known data errors

I really hope someone can help me with this. Meanwhile I am removing as much data as I can from the volume just in case I need start everything over!

Thanks,

kdragon75 · Jan 8, 2019

Are you running smart checks? How are the drives cabled? I would start with checking the smart status on the drives.

vidjcb · Jan 8, 2019

OK, I:

- Powered OFF the box and check back all cables, all were re-seated / plugged.
- Ran SMART tests in all disks and for my surprise, the first disk that I replaced and tested before, reported errors:

Code:

smartctl -a /dev/ada2 | more

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)                                                             
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org                                                         
                                                                                                                                    
=== START OF INFORMATION SECTION ===                                                                                               
Device Model:     WDC WD100EMAZ-00WJTA0                                                                                             
Serial Number:    2YHGJ02D                                                                                                         
LU WWN Device Id: 5 000cca 273d4b0c5                                                                                               
Firmware Version: 83.H0A83                                                                                                         
User Capacity:    10,000,831,348,736 bytes [10.0 TB]                                                                               
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Form Factor:      3.5 inches                                                                                                       
Device is:        Not in smartctl database [for details use: -P showall]                                                           
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4                                                                             
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)                                                                           
Local Time is:    Tue Jan  8 12:37:41 2019 -05                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                           
                                                                                                                                    
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x80) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Enabled.                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                   
                                        without error or no self-test has ever                                                     
                                        been run.                                                                                   
Total time to complete Offline                                                                                                     
data collection:                (   93) seconds.                                                                                   
Offline data collection                                                                                                             
capabilities:                    (0x5b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                   
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                       
                                        No Conveyance Self-test supported.                                                         
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        (1170) minutes.

SCT capabilities:              (0x003d) SCT Status supported.                                                                       
                                        SCT Error Recovery Control supported.                                                       
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                   
                                                                                                                                    
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0                                           
  2 Throughput_Performance  0x0004   129   129   054    Old_age   Offline      -       112                                         
  3 Spin_Up_Time            0x0007   147   147   024    Pre-fail  Always       -       448 (Average 449)                           
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       31                                           
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0                                           
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18                                           
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       197                                         
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0                                           
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31                                           
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100                                         
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       36                                           
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       36                                           
194 Temperature_Celsius     0x0002   203   203   000    Old_age   Always       -       32 (Min/Max 19/48)                           
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0                                           
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       319                                         
                                                                                                                                    
SMART Error Log Version: 1                                                                                                         
ATA Error Count: 319 (device log contains only the most recent five errors)                                                         
        CR = Command Register [HEX]                                                                                                 
        FR = Features Register [HEX]                                                                                               
        SC = Sector Count Register [HEX]                                                                                           
        SN = Sector Number Register [HEX]                                                                                           
        CL = Cylinder Low Register [HEX]                                                                                           
        CH = Cylinder High Register [HEX]                                                                                           
        DH = Device/Head Register [HEX]                                                                                             
        DC = Device Command Register [HEX]                                                                                         
        ER = Error register [HEX]                                                                                                   
        ST = Status register [HEX]                                                                                                 
Powered_Up_Time is measured from power on, and printed as                                                                           
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,                                                                               
SS=sec, and sss=millisec. It "wraps" after 49.710 days.                                                                             
                                                                                                                                    
Error 319 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)                                                           
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --
  84 51 00 07 08 00 40  Error: ICRC, ABRT at LBA = 0x00000807 = 2055                                                               
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  25 00 08 00 08 00 40 00      00:04:21.090  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:04:00.084  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:04:00.075  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:04:00.048  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:59.061  READ DMA EXT                                                                           
                                                                                                                                    
Error 318 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)                                                           
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  84 51 00 0f 08 00 40  Error: ICRC, ABRT at LBA = 0x0000080f = 2063                                                               
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  25 00 10 00 08 00 40 00      00:04:00.083  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:04:00.048  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:59.061  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:59.045  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:58.051  READ DMA EXT                                                                           
                                                                                                                                    
Error 317 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)                                                           
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  84 51 00 07 08 00 40  Error: ICRC, ABRT at LBA = 0x00000807 = 2055                                                               
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  25 00 08 00 08 00 40 00      00:04:00.075  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:59.061  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:59.045  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:58.051  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:58.012  READ DMA EXT                                                                           
                                                                                                                                    
Error 316 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)                                                           
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  84 51 00 0f 08 00 40  Error: ICRC, ABRT at LBA = 0x0000080f = 2063                                                               
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  25 00 10 00 08 00 40 00      00:03:59.067  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:59.045  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:58.051  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:58.012  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:57.000  READ DMA EXT                                                                           
                                                                                                                                    
Error 315 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)                                                           
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  84 51 00 07 08 00 40  Error: ICRC, ABRT at LBA = 0x00000807 = 2055                                                               
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  25 00 08 00 08 00 40 00      00:03:59.050  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:58.051  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:58.012  READ DMA EXT                                                                           
  25 00 10 00 08 00 40 00      00:03:57.000  READ DMA EXT                                                                           
  25 00 08 00 08 00 40 00      00:03:56.992  READ DMA EXT                                                                           
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                     
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

Krautmaster · Oct 22, 2019

I face the same completely weird issue. At least i have a 100% backup so no worry.

I replaced disk by disk as i did the encryption after pool creation, following this guide:
https://www.ixsystems.com/community/threads/how-to-encrypt-an-existing-raidz-or-mirror.16975/

The disks are new.

1st disk exchange worked well.
2nd disk it resilvers for a day (like in the first) and then it all stays degraded. Too many error. The heck is ongoing?

Code:

  pool: RaidZ
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Oct 22 18:10:50 2019
        1.91T scanned at 2.91G/s, 73.8G issued at 112M/s, 24.2T total
        0 repaired, 0.30% done, 2 days 14:29:18 to go
config:

        NAME                                                  STATE     READ WRITE CKSUM
        RaidZ                                                 DEGRADED     0     0 15.2K
          raidz1-0                                            DEGRADED     0     0 30.4K
            gptid/2f9d8da7-ebed-11e9-a786-000c29445504.eli    DEGRADED     0     0     0  too many errors
            replacing-1                                       DEGRADED     0     0     0
              6770610332852622660                             OFFLINE      0     0     0  was /dev/gptid/30bc058d-ebed-11e9-a786-000c29445504
              gptid/30bc058d-ebed-11e9-a786-000c29445504.eli  ONLINE       0     0     0
            gptid/31d75a1a-ebed-11e9-a786-000c29445504        DEGRADED     0     0     0  too many errors
            gptid/32f29bc9-ebed-11e9-a786-000c29445504        DEGRADED     0     0     0  too many errors
        cache
          gptid/349445cf-ebed-11e9-a786-000c29445504          ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>

I juist started a scrub on that pool.

that ZFS stuff seems pretty unreliable - i can hardly imagine that a second disk failed that time thus all data seems to be still online? So what might be ongoing?

Important Announcement for the TrueNAS Community.

Disk replaced for upgrade, now zpool degraded after resilvering

vidjcb

Cadet

kdragon75

Wizard

vidjcb

Cadet

vidjcb

Cadet

kdragon75

Wizard

vidjcb

Cadet

Krautmaster

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Disk replaced for upgrade, now zpool degraded after resilvering

vidjcb

Cadet

kdragon75

Wizard

vidjcb

Cadet

vidjcb

Cadet

kdragon75

Wizard

vidjcb

Cadet

Krautmaster

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk replaced for upgrade, now zpool degraded after resilvering"

Similar threads