Replacing Drive Re-silver Errors

jerkbag · Dec 3, 2013

Hi -- in the process of replacing a drive on an otherwise healthy RAIDZ (4 drives) (I wanted to grow the size of the pool) I ended up getting a ton of errors (like 37K) overnight during the resilver process, and when I awoke it had slowed to a crawl -- I believe what happened is that a drive cable got unseated, because I shut down during the resilver, reseated all the cables, and when I brought it back up the errors stopped right away and the resilver finished quickly.

So now my resilver is finished but zpool status is telling me I have data corruption and to restore the files from backup, and it's a ton of files. As I am almost positive it was a cable issue -- I think the system lost a drive during the resilver. If I launch a scrub will it re-check and sort all this stuff out? Or have these errors been marked in the system somehow so it thinks they are gone forever? Surely now that it has all the drives again it can put it all back together?

Also -- I did the replace through the GUI. The resilver is done, but the old drive is still attached. I think the move is to scrub first, and then detach the old drive, correct?

Any help would be appreciated -- thanks!

cyberjock · Dec 3, 2013

First, if you had a cable come lose during a resilver you'd have lost the pool(insufficient replicas). It sounds like you have a second failing drive. Post the output of smartctl -a for each of your disks. Money is one of them is failing.

As for the bad files, I've never seen someone have corrupted files that fixed themselves. More than likely you'll have to delete and restore from backup. What can you expect though... when using a RAIDZ1 if a disk fails you are literally running a striped array of 3 disks until the resilvering completes. Would YOU ever do a 3 disk striped array and call it "reliable"? I wouldn't either.

jerkbag · Dec 3, 2013

you're right I guess the drive couldn't have gone complete offline it would have had insufficent replicas -- maybe it's failing or cable was creating errors. I guess my thought was that if these files were corrupted because of bad reads/lack of replicas *at the time* when resilvering, would they be restored in the event that the drives comes back online error free with a scrub? I'll post smartctl -a when I'm home thanks.

cyberjock · Dec 3, 2013

The smartctl output will tell us if you have(or have had) cable errors. :)

jerkbag · Dec 3, 2013

alrighty -- I'm about 50% of the way through a scrub with no errors so far. Here's the smart output -- they all say passed. Any idea what I should be looking for?

edit -- figured out code tags thanks!

Code:

smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (Adv. Format)
Device Model: WDC WD20EARX-008FB0
Serial Number: WD-WCAZAF388777
LU WWN Device Id: 5 0014ee 20737703d
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Dec 3 20:04:29 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status: (0x84)Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0)The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:(33300) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:( 2) minutes.
Extended self-test routine
recommended polling time:( 358) minutes.
Conveyance self-test routine
recommended polling time:( 5) minutes.
SCT capabilities: (0x30b5)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 189 186 021 Pre-fail Always - 5541
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 76
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 194 000 Old_age Always - 0
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 8900
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 65
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 42
193 Load_Cycle_Count 0x0032 119 119 000 Old_age Always - 244046
194 Temperature_Celsius 0x0022 109 098 000 Old_age Always - 41
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
 
SMART Error Log Version: 1
ATA Error Count: 20 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 
Error 20 occurred at disk power-on lifetime: 4713 hours (196 days + 9 hours)
When the command that caused the error occurred, the device was in standby mode.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 10 d8 07 00 40 Device Fault; Error: ABRT at LBA = 0x000007d8 = 2008
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 10 d8 07 00 40 08 6d+22:14:00.127 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.127 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
 
Error 19 occurred at disk power-on lifetime: 4713 hours (196 days + 9 hours)
When the command that caused the error occurred, the device was in standby mode.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 10 d8 07 00 40 Device Fault; Error: ABRT at LBA = 0x000007d8 = 2008
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 10 d8 07 00 40 08 6d+22:14:00.127 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
 
Error 18 occurred at disk power-on lifetime: 4713 hours (196 days + 9 hours)
When the command that caused the error occurred, the device was in standby mode.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 10 d8 07 00 40 Device Fault; Error: ABRT at LBA = 0x000007d8 = 2008
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
 
Error 17 occurred at disk power-on lifetime: 4713 hours (196 days + 9 hours)
When the command that caused the error occurred, the device was in standby mode.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 10 d8 07 00 40 Device Fault; Error: ABRT at LBA = 0x000007d8 = 2008
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
 
Error 16 occurred at disk power-on lifetime: 4713 hours (196 days + 9 hours)
When the command that caused the error occurred, the device was in standby mode.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 10 d8 07 00 40 Device Fault; Error: ABRT at LBA = 0x000007d8 = 2008
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 10 d8 07 00 40 08 6d+22:14:00.126 WRITE DMA
 
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (Adv. Format)
Device Model: WDC WD20EARX-008FB0
Serial Number: WD-WCAZAK235257
LU WWN Device Id: 5 0014ee 25d1bdc62
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Dec 3 20:04:37 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status: (0x84)Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0)The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:(33960) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:( 2) minutes.
Extended self-test routine
recommended polling time:( 365) minutes.
Conveyance self-test routine
recommended polling time:( 5) minutes.
SCT capabilities: (0x30b5)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 182 182 021 Pre-fail Always - 5875
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 521
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 277
194 Temperature_Celsius 0x0022 108 101 000 Old_age Always - 42
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EZRX-00D8PB0
Serial Number: WD-WMC4M1375651
LU WWN Device Id: 5 0014ee 058ff44dc
Firmware Version: 80.00A80
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ACS-2 (revision not indicated)
Local Time is: Tue Dec 3 20:04:41 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status: (0x80)Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0)The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:(25680) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:( 2) minutes.
Extended self-test routine
recommended polling time:( 260) minutes.
Conveyance self-test routine
recommended polling time:( 5) minutes.
SCT capabilities: (0x7035)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 253 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 31
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 2
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
194 Temperature_Celsius 0x0022 106 106 000 Old_age Always - 41
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST2000DM001-9YN164
Serial Number: Z1E0M7GE
LU WWN Device Id: 5 000c50 040678f1e
Firmware Version: CC4C
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Dec 3 20:04:45 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status: (0x00)Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0)The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:( 575) seconds.
Offline data collection
capabilities:(0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:( 1) minutes.
Extended self-test routine
recommended polling time:( 217) minutes.
Conveyance self-test routine
recommended polling time:( 2) minutes.
SCT capabilities: (0x3085)SCT Status supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 217317488
3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 137
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 076 060 030 Pre-fail Always - 52099670048
9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13290
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 87
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 076 076 000 Old_age Always - 24
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 8590065666
189 High_Fly_Writes 0x003a 093 093 000 Old_age Always - 7
190 Airflow_Temperature_Cel 0x0022 058 049 045 Old_age Always - 42 (Min/Max 29/42)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 51
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 306
194 Temperature_Celsius 0x0022 042 051 000 Old_age Always - 42 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 79435420152792
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1382624227075
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 240958225743952
 
SMART Error Log Version: 1
ATA Error Count: 24 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 
Error 24 occurred at disk power-on lifetime: 13283 hours (553 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 55 ff ff ff 4f 00 04:17:27.627 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:27.536 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:24.785 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:24.694 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:21.882 READ FPDMA QUEUED
 
Error 23 occurred at disk power-on lifetime: 13283 hours (553 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 55 ff ff ff 4f 00 04:17:24.785 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:24.694 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:21.882 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:21.760 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:18.995 READ FPDMA QUEUED
 
Error 22 occurred at disk power-on lifetime: 13283 hours (553 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 55 ff ff ff 4f 00 04:17:21.882 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:21.760 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:18.995 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:18.401 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:15.647 READ FPDMA QUEUED
 
Error 21 occurred at disk power-on lifetime: 13283 hours (553 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 55 ff ff ff 4f 00 04:17:18.995 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 04:17:18.401 READ LOG EXT
60 00 55 ff ff ff 4f 00 04:17:15.647 READ FPDMA QUEUED
61 00 10 ff ff ff 4f 00 04:17:15.062 WRITE FPDMA QUEUED
61 00 10 ff ff ff 4f 00 04:17:15.059 WRITE FPDMA QUEUED
 
Error 20 occurred at disk power-on lifetime: 13283 hours (553 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
 
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
 
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 55 ff ff ff 4f 00 04:17:15.647 READ FPDMA QUEUED
61 00 10 ff ff ff 4f 00 04:17:15.062 WRITE FPDMA QUEUED
61 00 10 ff ff ff 4f 00 04:17:15.059 WRITE FPDMA QUEUED
61 00 10 90 02 40 40 00 04:17:15.024 WRITE FPDMA QUEUED
60 00 04 ff ff ff 4f 00 04:17:14.981 READ FPDMA QUEUED
 
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

warri · Dec 3, 2013

The Seagate Barracuda ST2000DM001-9YN164 seems to have logged some error 7 hours ago (in uptime terms of the HDD). Your formatting makes the report really hard to read though, it would be appreciated it if you could do it again and copy the output into [code]-tags to prevent the original formatting. You can just edit your old post!

cyberjock · Dec 3, 2013

Well, for one your hard drives are too hot. 40C max, and all of yours are above that temp. :(

Your first disk had some internal errors. Your last disk had some internal errors too. They were probably the the cause for your problems before.

None of your disks recorded UDMA errors(which is an indicator for communication aka cable errors). So I think your cables are/were fine. Never had a problem with them, ever.

You are a bad boy though, you don't do long SMART tests. Spank yourself for being a bad boy, then go set them up! I'd recommend you do bi-weekly, but you must make sure they NEVER run at the same time as a scrub. Money tells me if you run long SMART tests you'll probably find one of the disks with internal errors failed(and perhaps both).

At the least, I'd SERIOUSLY consider switching to RAIDZ2 and definitely monitor your 2 disks with some errors like a hawk.

So what you need to do is:

1. Backup your important data.. this might not go well for you in the end.
2. Replace your bad disk(per the manual).
3. Monitor your remaining disks. They're overheating and that's not cool. They will certainly have shorter lifespans as a result.
4. Use RAIDZ2 in the future and monitor your disks more closely. You probably don't have FreeNAS emailing setup(which is bad), nor do you have SMART monitoring(which is bad).

Overall, missed a couple of key indicators of things going wrong before you lose data.

jerkbag · Dec 3, 2013

cyberjock said:
Well, for one your hard drives are too hot. 40C max, and all of yours are above that temp. :(

Your first disk had some internal errors. Your last disk had some internal errors too. They were probably the the cause for your problems before.

None of your disks recorded UDMA errors(which is an indicator for communication aka cable errors). So I think your cables are/were fine. Never had a problem with them, ever.

You are a bad boy though, you don't do long SMART tests. Spank yourself for being a bad boy, then go set them up! I'd recommend you do bi-weekly, but you must make sure they NEVER run at the same time as a scrub. Money tells me if you run long SMART tests you'll probably find one of the disks with internal errors failed(and perhaps both).

At the least, I'd SERIOUSLY consider switching to RAIDZ2 and definitely monitor your 2 disks with some errors like a hawk.

So what you need to do is:

1. Backup your important data.. this might not go well for you in the end.
2. Replace your bad disk(per the manual).
3. Monitor your remaining disks. They're overheating and that's not cool. They will certainly have shorter lifespans as a result.
4. Use RAIDZ2 in the future and monitor your disks more closely. You probably don't have FreeNAS emailing setup(which is bad), nor do you have SMART monitoring(which is bad).

Overall, missed a couple of key indicators of things going wrong before you lose data.

Are you sure about the temperature? I have thought it was high before in past SMART tests but when I looked up my drives normal operating temperature is 0-60c?

Anyway -- I guess my real question is do these data errors actually mean any data loss? When I check the errors with zpool status -v I will get lists of music files with permanent errors, for example, but they all play fine. I have a monthly scrub and daily email alerts setup etc. . . last scrub was a week ago. I guess it's not a cable, but I don't understand how I can have an error free scrub a week ago, go to replace a disk and get like 37K errors during a resilver that stop occurring after reboot, without that being an indication of some kind of temporary issue.

In any case, if the scrub I'm currently doing ends error free, does that mean that all my data is -- for the moment at least -- intact?

EDIT: oh wait I just noticed this. My list of errors look like this

Code:

  canary@auto-20131127.0200-1w:<0x0>
        canary@auto-20131127.0200-1w:/var/db/portsnap/files/.AppleDouble/6b0a3c5c8ae0cd277dd73c65d970e16925e51eb9d7eadb15157539fa40e46f11.gz
        canary@auto-20131127.0200-1w:/var/db/portsnap/files/.AppleDouble/5702bafaf994bb1a85a3f66d9b260ac81413ba42cba2027f5fb6e59373990d4c.gz
        canary@auto-20131127.0200-1w:/var/db/portsnap/files/.AppleDouble/170bebcc4b34f090dae7706819f6640c478d04c185c6d8b4b5087c5688b4f56e.gz
        canary@auto-20131127.0200-1w:/var/db/portsnap/files/.AppleDouble/2394057661e506eae8cc04e75a1c120cbb0de357527ae63f8f3ae5771acb4257.gz

etc. . .

Does that auto-20131127 mean that all these errors have something to do with a nightly replication task I have running? Would that explain why I can still access my files with "permanent" errors as all my errors are on the snapshot? Any idea how that would occur?

cyberjock · Dec 3, 2013

Those files that get listed under zpool status -v ARE corrupt. The corruption could range from a split second of audio that you don't notice or the file might not open at all.

What you'll need to do is delete those files to clear the errors and then restore from backup.

If you get any metadata locations listed under zpool status -v, then the pool is corrupted and you will have to destroy and recreate the pool from scratch.

Dusan · Dec 3, 2013

jerkbag said:
Does that auto-20131127 mean that all these errors have something to do with a nightly replication task I have running?

All the "@auto-..." files are snapshots ("previous versions").

cyberjock said:
If you get any metadata locations listed under zpool status -v, then the pool is corrupted and you will have to destroy and recreate the pool from scratch.

If all the metadata errors are in snapshots, then it is possible that destroying the snapshots will be enough.

cyberjock · Dec 4, 2013

metadata errors aren't in files. They are in metadata. They'll say things like 0x04 and won't give a file name.

Dusan said:
All the "@auto-..." files are snapshots ("previous versions").

If all the metadata errors are in snapshots, then it is possible that destroying the snapshots will be enough.

We haven't seen that here historically. Not sure exactly why. I guess if the OP deletes the files and posts what's left we can see what is going on. It seems more often than not metadata that is corrupted is only fixed by destroying the pool.

jerkbag · Dec 4, 2013

Thanks for the help everyone. So it turned out every single errors was from an old snapshot, so hopefully it's not as bad as it sounds. I have a few last questions:

1) So I did another scrub and it completed successfully with no errors. Does this mean that the 37k errors from before *no longer exist*? I can't 100% get my head around it, but I believe a scrub is system check of all data at a moment in time, correct? Like if these errors still existed it would find them the second time around as well? If so, am I right in thinking that all my data is currently perfectly intact? If that is case, is it possible that the problematic snapshot has been rolled over so the errors associated with it are no longer present? I am hoping that's what it is -- which would mean scrub didn't "fix" the errors as much as the files that caused them are gone.

2) A recommended smart test schedule :) ?

3) Can you migrate from raidz1 to raidz2 without starting over?

Thanks again for the info!

cyberjock · Dec 4, 2013

1. If you have no errors after a scrub then your data is probably safe.
2. Yeah, that's not something we can give you as you have to setup a schedule for your situation. I made a post a few weeks ago with my detailed schedule for hard drive maintenance. Might want to find that post or read up and come up with your own.
3. No, you have to destroy the pool and start over.

Edit: My post... http://forums.freenas.org/threads/need-help-setting-up-smart-zfs-scrubs.16182/#post-82403

jerkbag · Dec 4, 2013

cyberjock said:
1. If you have no errors after a scrub then your data is probably safe.
2. Yeah, that's not something we can give you as you have to setup a schedule for your situation. I made a post a few weeks ago with my detailed schedule for hard drive maintenance. Might want to find that post or read up and come up with your own.
3. No, you have to destroy the pool and start over.

Edit: My post... http://forums.freenas.org/threads/need-help-setting-up-smart-zfs-scrubs.16182/#post-82403

sounds good - thanks!

Important Announcement for the TrueNAS Community.

Replacing Drive Re-silver Errors

jerkbag

Dabbler

cyberjock

Inactive Account

jerkbag

Dabbler

cyberjock

Inactive Account

jerkbag

Dabbler

warri

Guru

cyberjock

Inactive Account

jerkbag

Dabbler

cyberjock

Inactive Account

Dusan

Guru

cyberjock

Inactive Account

jerkbag

Dabbler

cyberjock

Inactive Account

jerkbag

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Replacing Drive Re-silver Errors

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Guru

Inactive Account

Dabbler

Inactive Account

Guru

Inactive Account

Dabbler

Inactive Account

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replacing Drive Re-silver Errors"

Similar threads