I have a 6-drive setup on RAIDZ2, and when I got a critical alert yesterday I figured I could just find the broken drive and replace it. But I'm a bit confused because it looks like one drive disconnected and a different drive has a lot of errors. I'm not sure if I'm worrying too much about the errors, or if I have multiple failures immanent, or if I'm just reading this all wrong.
zpool status gives me this output:
7c52a93a is my ada3 drive:
And this is the drive that was disconnected when I first started looking at the system after the critical alert. I rebooted and it showed back up. But at first, I thought ada4 was the drive with the error - I didn't notice 3 was missing at first. So I ran a long test on 4 and It's showing a few errors.
As I understand it, this is the relevant section for ada3:
And this is the relevant section for ada4:
When I do smartctl -a /dev/ada4, I also see an actual error log talking about register values and a command called "READ DMA" that eventually caused that error, but I'm not sure what to make of this. The entire self-test log for ada4 has Completed: read failure as the Status, while ada3 only has read failures on extended tests.
Please let me know what additional information I can provide - there's a lot of output here and I'm having a hard time figuring out what is relevant.
zpool status gives me this output:
Code:
pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec 3 03:45:26 2017 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 gptid/cdb7ef94-4e56-11e6-929f-0cc47acb9984 ONLINE 0 0 0 errors: No known data errors pool: tank0 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 35.0M in 2h17m with 0 errors on Sun Dec 3 01:04:40 2017 config: NAME STATE READ WRITE CKSUM tank0 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/769858e6-4e1c-11e6-bdfb-0cc47acb9984 ONLINE 0 0 0 gptid/77beff42-4e1c-11e6-bdfb-0cc47acb9984 ONLINE 0 0 0 gptid/78ddc39d-4e1c-11e6-bdfb-0cc47acb9984 ONLINE 0 0 0 gptid/7a024fac-4e1c-11e6-bdfb-0cc47acb9984 ONLINE 0 0 0 gptid/7b26d0a9-4e1c-11e6-bdfb-0cc47acb9984 ONLINE 0 0 0 gptid/7c52a93a-4e1c-11e6-bdfb-0cc47acb9984 DEGRADED 0 0 3.16K too many errors errors: No known data errors
7c52a93a is my ada3 drive:
Code:
$glabel status Name Status Components gptid/cda8d6d1-4e56-11e6-929f-0cc47acb9984 N/A da0p1 gptid/cdb7ef94-4e56-11e6-929f-0cc47acb9984 N/A da0p2 gptid/78ddc39d-4e1c-11e6-bdfb-0cc47acb9984 N/A ada0p2 gptid/7a024fac-4e1c-11e6-bdfb-0cc47acb9984 N/A ada1p2 gptid/7b26d0a9-4e1c-11e6-bdfb-0cc47acb9984 N/A ada2p2 gptid/7c52a93a-4e1c-11e6-bdfb-0cc47acb9984 N/A ada3p2 gptid/769858e6-4e1c-11e6-bdfb-0cc47acb9984 N/A ada4p2 gptid/77beff42-4e1c-11e6-bdfb-0cc47acb9984 N/A ada5p2
And this is the drive that was disconnected when I first started looking at the system after the critical alert. I rebooted and it showed back up. But at first, I thought ada4 was the drive with the error - I didn't notice 3 was missing at first. So I ran a long test on 4 and It's showing a few errors.
As I understand it, this is the relevant section for ada3:
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 205 205 021 Pre-fail Always - 6716 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 5 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 7253 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 5 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 1 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 31 194 Temperature_Celsius 0x0022 116 112 000 Old_age Always - 36 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 197 000 Old_age Offline - 6
And this is the relevant section for ada4:
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 102 101 051 Pre-fail Always - 8370 3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 5 5 Reallocated_Sector_Ct 0x0033 165 165 140 Pre-fail Always - 1050 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6060 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 5 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 1 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 33 194 Temperature_Celsius 0x0022 115 111 000 Old_age Always - 37 196 Reallocated_Event_Count 0x0032 087 087 000 Old_age Always - 113 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 30 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 3
When I do smartctl -a /dev/ada4, I also see an actual error log talking about register values and a command called "READ DMA" that eventually caused that error, but I'm not sure what to make of this. The entire self-test log for ada4 has Completed: read failure as the Status, while ada3 only has read failures on extended tests.
Please let me know what additional information I can provide - there's a lot of output here and I'm having a hard time figuring out what is relevant.