Recently I had a power outage that has affected my Truenas Scale system. Different drives are marked as faulted with 15 read errors during scrub tasks. I'm 80% sure the drives are fine as they are only 2-3 months old and it's not the same drive or the same slot each time. The pool is an all SSD pool, x6 wide raidz2 with 2 hot spares. The data is backed up in multiple locations and safe so that's not a concern. however, I do not want to rebuild the pool. I don't want to nuke the data, I'd like to be able to recover from this in the event it happens again somehow to prove it can be fixed without data loss.
System info:
What I've Tried:
1. S6PJNS0W501176B
2. S6PJNS0W500034H
3. S6PJNS0W500042J
4. S6PJNS0W502767R
5. S6PJNS0W501290Y
6. S6PJNS0W501175A
7. S6PJNS0W501171K
8. S6PJNS0W501170P
Notes on SMART Tests:
I know I have some CRC Errors on 3 drives, but this doesn't really bother me or cause any concern, these errors have not increased anymore and I'm only getting read errors, no checksum errors. I did get a write error one time, but I don't have this data to share and it has not occurred again. I think the drives are fine!
Latest Resilver Email:
Final Notes:
Brain hurts. Not sure how to proceed, probably going to replace the CPUs depending on what the community has to say. I've dropped BANK on an automatic generator system to make sure this never happens again. Knock on wood. I'd like to fix the server but I can get another r720xd for $200 and call it a day, but that doesn't help figure out the issue.
I've read a LOT in the forums and tried to avoid posting but I'm at my wit's end.
This is my first post. Hopefully, I've included all the relevant information. How to proceed?
Thank you in advance for any assistance.
System info:
- TrueNAS-SCALE-22.12.3.3
- R720xd x24 2.5 bay with Rear flex bay
- X2 Xeon E5-2697 v2
- H710 HBA (IT mode)
- 256gb ECC Registered DDR3 Ram
- 10Gbe Networking card
- x2 1100w PSU
- x8 Samsung_SSD_870_EVO_4TB (Raidz2 x6 wide vdev, with x2 hot spares)
- x2 Samsung_SSD_870_EVO_250GB (mirrored boot pool)
What I've Tried:
- Replace HBA (multiple times. I even bought an HBA from the Art of Server on eBay that is in IT mode. I've flashed 2 HBAs to IT mode myself, an H310 & H710)
- Replaced Backplane & SAS cables (more SAS cables on the way)
- Replaced PSU's
- Replaced Ram
- Replaced Motherboard
- Swapped drive locations (not including rear flex bay)
1. S6PJNS0W501176B
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2094 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 79 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 8 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 062 060 000 Old_age Always - 38 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 73 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 12615480528 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
2. S6PJNS0W500034H
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2094 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 76 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 7 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 062 060 000 Old_age Always - 38 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 74 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 11553271266 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
3. S6PJNS0W500042J
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2089 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 84 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 8 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 063 045 000 Old_age Always - 37 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 1 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 81 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 16103206435 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
4. S6PJNS0W502767R
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2093 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 79 177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 063 060 000 Old_age Always - 37 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 76 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 3863904288 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
5. S6PJNS0W501290Y
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 408 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 111 177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 064 050 000 Old_age Always - 36 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 1 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 105 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 5267480562 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
6. S6PJNS0W501175A
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2094 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 75 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 7 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 062 058 000 Old_age Always - 38 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 73 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 12045385001 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
7. S6PJNS0W501171K
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2094 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 73 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 8 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 074 057 000 Old_age Always - 26 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 70 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 18555807651 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
8. S6PJNS0W501170P
Code:
SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2094 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 80 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 11 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 074 058 000 Old_age Always - 26 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 3 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 77 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 40138025532 252 Added_Bad_Flash_Blk_Ct 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
Notes on SMART Tests:
I know I have some CRC Errors on 3 drives, but this doesn't really bother me or cause any concern, these errors have not increased anymore and I'm only getting read errors, no checksum errors. I did get a write error one time, but I don't have this data to share and it has not occurred again. I think the drives are fine!
Latest Resilver Email:
Code:
ZFS has finished a resilver: eid: 51 class: resilver_finish host: truenas-scale time: 2023-10-05 03:13:46-0400 pool: mainframe state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: resilvered 990G in 00:40:39 with 0 errors on Thu Oct 5 03:13:46 2023 config: NAME STATE READ WRITE CKSUM mainframe DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 e7ca087b-3fcc-4c87-a72e-269a50751068 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 4ae20dae-448b-43f8-93f7-6a3577743cba FAULTED 15 0 0 too many errors 2bdfb31c-c361-4a15-8ee7-9f83f938b1b7 ONLINE 0 0 0 a6b4e4cf-4d8a-4363-bb95-e7d88b6874cc ONLINE 0 0 0 1c25a11a-96bb-4d61-830e-8b80684b2590 ONLINE 0 0 0 f5015a2d-30f7-42f9-ae18-5c2e6feade1b ONLINE 0 0 0 87d21052-4dbe-4bdf-afa9-94aa2353e678 ONLINE 0 0 0 spares 2bdfb31c-c361-4a15-8ee7-9f83f938b1b7 INUSE currently in use adb4cb64-58fa-4d18-b965-0afcad5dac81 AVAIL errors: No known data errors
Final Notes:
Brain hurts. Not sure how to proceed, probably going to replace the CPUs depending on what the community has to say. I've dropped BANK on an automatic generator system to make sure this never happens again. Knock on wood. I'd like to fix the server but I can get another r720xd for $200 and call it a day, but that doesn't help figure out the issue.
I've read a LOT in the forums and tried to avoid posting but I'm at my wit's end.
This is my first post. Hopefully, I've included all the relevant information. How to proceed?
Thank you in advance for any assistance.