LIGISTX
Guru
- Joined
- Apr 12, 2015
- Messages
- 525
Hi everyone,
Not my first time I have had drives go bad... but it is the first time I have had 2 fault at the same time. Thankfully its a Z2 array, and I have a cold spare waiting to go in. But with 2 drives down, I am a little weary of how to approach this.
Would it make sense to do a zpool clear to force it to think everything is ok, and then replace one of the drives with my cold spare? See how the resilver goes, do a scrub, and monitor the situation? I don't want to get myself into a worse situation by jumping to any conclusions prematurly.
I have had a few SMART errors pop up over the past few SMART tests, I guess I didn't dig deep enough into them because I had thought the drives were still working fine (I have had eronious errors in the past that didn't actually result in bad drives or any corruption). Looking at the SMART status of da5 (Drive with 62 faults according to zpool status), I am seeing:
da7 (61 faults):
Zpool status:
What should my next steps here be?
To stave off any questison about hardware, it can all be found in my signature, but the controller is a H310 to a SAS expander, and has been in good working order for 7+ years (minus a few drive failures over the years). Its possible a SAS -> SATA cable (or two) is going bad, its happened to me before. But I am thinking this is not that sort of situation.
	
		
			
		
		
	
			
			Not my first time I have had drives go bad... but it is the first time I have had 2 fault at the same time. Thankfully its a Z2 array, and I have a cold spare waiting to go in. But with 2 drives down, I am a little weary of how to approach this.
Would it make sense to do a zpool clear to force it to think everything is ok, and then replace one of the drives with my cold spare? See how the resilver goes, do a scrub, and monitor the situation? I don't want to get myself into a worse situation by jumping to any conclusions prematurly.
I have had a few SMART errors pop up over the past few SMART tests, I guess I didn't dig deep enough into them because I had thought the drives were still working fine (I have had eronious errors in the past that didn't actually result in bad drives or any corruption). Looking at the SMART status of da5 (Drive with 62 faults according to zpool status), I am seeing:
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 43 3 Spin_Up_Time 0x0027 186 161 021 Pre-fail Always - 5700 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 234 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 028 028 000 Old_age Always - 52765 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 228 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 224 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 131 194 Temperature_Celsius 0x0022 121 105 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 055 000 Old_age Always - 608 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged
da7 (61 faults):
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1 3 Spin_Up_Time 0x0027 184 159 021 Pre-fail Always - 5766 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 235 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 028 028 000 Old_age Always - 52758 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 228 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 224 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 107 194 Temperature_Celsius 0x0022 120 105 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged
Zpool status:
Code:
pool: pergamum
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 1.89M in 06:25:16 with 0 errors on Thu Dec 21 06:25:36 2023
config:
    NAME                                            STATE     READ WRITE CKSUM
    pergamum                                        DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
        gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
        gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac  ONLINE       0     0     0
        gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     0
        gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
        gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
        gptid/af89686d-44ea-11e8-8cad-e0071bffdaee  FAULTED     61     0     0  too many errors
        gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
        gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  FAULTED     62     0     0  too many errors
        gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
errors: No known data errorsWhat should my next steps here be?
To stave off any questison about hardware, it can all be found in my signature, but the controller is a H310 to a SAS expander, and has been in good working order for 7+ years (minus a few drive failures over the years). Its possible a SAS -> SATA cable (or two) is going bad, its happened to me before. But I am thinking this is not that sort of situation.
Attachments
			
				Last edited: 
			
		
	
								
								
									
	
		
			
		
		
	
	
		
			
		
	
	
		
			
		
		
	
								
							
							 
				 
 
		 
			 
	 
	 
 
		 
			
		
	
	
		 
 
		 
 
		 
 
		 
 
		