Hello,
(sorry for my bad English)
I have an old FreeNAS-9.10.2-U6 (561f0d7a1) running on an old HP Prolian MIcroserver (first gen). Using two desktop Seagate 4TB disk in mirror setup. This is the current status:
Now, the long history.
Last December, I got an email from the FreeNAS server getting lot of smart error in one of the disk. The disk was totally unusable. So, I just physically remove it from the server (my bad, yes). Then installed a new disk (the same, Seagate 4TB), and issued an
This is the first problem: I have a ghost device (2975835688515449866) which I can't remove. The
More problems: The READ, WRITE and CHSUM columns show a lot of errors. But CKSUM columns for the two disk show the same value ALWAYS.
I already checked both disk using smart test, both short and long tests returns no error. This is the results for the two disk:
I'm not an expert, but I think that disk are ok. So, why such big CKSUM errors ?
I have dismantled the server, cleaning all dust and connectors, and rebuild it to prevent failed cables or connectors (the server is very old). But still getting same errors.
But, I have check the data and seems to be correct. I can't find any corrupted file.
The only problem is the directory tmp from OpenVPN jail, which is listed as permanent error. It's like a black hole. You can copy or create any file, but
I already do a
I don't know what is the <0xffffffffffffffff>:<0x0> .... searching in Google don't give any clue.
More information: a
So, I'm totally stuck with this. I'm out of ideas.
Any help, please ???
Thanks!
(sorry for my bad English)
I have an old FreeNAS-9.10.2-U6 (561f0d7a1) running on an old HP Prolian MIcroserver (first gen). Using two desktop Seagate 4TB disk in mirror setup. This is the current status:
Code:
pool: zfs1 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 46h25m with 1 errors on Tue Mar 26 01:26:29 2019 config: NAME STATE READ WRITE CKSUM zfs1 DEGRADED 1.65M 0 7.57K mirror-0 DEGRADED 6.58M 0 30.3K gptid/03cc6e09-df0f-11e5-9697-009c02a7fa32 DEGRADED 0 0 6.61M too many errors 2975835688515449866 UNAVAIL 0 0 0 was /dev/gptid/9d79ca97-47f5-11e8-b3a4-009c02a7fa32 gptid/b4dd3975-4f65-11e8-ab2f-009c02a7fa32 ONLINE 0 0 6.61M errors: Permanent errors have been detected in the following files: /mnt/zfs1/jails/OpenVPN/tmp <0xffffffffffffffff>:<0x0>
Now, the long history.
Last December, I got an email from the FreeNAS server getting lot of smart error in one of the disk. The disk was totally unusable. So, I just physically remove it from the server (my bad, yes). Then installed a new disk (the same, Seagate 4TB), and issued an
add
command. Resilver started.This is the first problem: I have a ghost device (2975835688515449866) which I can't remove. The
detach
command returns:Code:
[root@freenas] ~# zpool detach zfs1 2975835688515449866 cannot detach 2975835688515449866: no valid replicas [root@freenas] ~# zpool detach zfs1 /dev/gptid/9d79ca97-47f5-11e8-b3a4-009c02a7fa32 cannot detach /dev/gptid/9d79ca97-47f5-11e8-b3a4-009c02a7fa32: no valid replicas
More problems: The READ, WRITE and CHSUM columns show a lot of errors. But CKSUM columns for the two disk show the same value ALWAYS.
I already checked both disk using smart test, both short and long tests returns no error. This is the results for the two disk:
Code:
[root@freenas] ~# smartctl -a /dev/ada0 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 3.5 Device Model: ST4000DM004-2CV104 Serial Number: ZFN14JSY LU WWN Device Id: 5 000c50 0aff03e78 Firmware Version: 0001 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5425 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Mon Apr 15 21:13:26 2019 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 484) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 084 064 006 Pre-fail Always - 240686392 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 092 060 045 Pre-fail Always - 1542408167 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 8474 (177 116 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 16 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 2 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 061 040 Old_age Always - 28 (Min/Max 25/29) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 276 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 352 194 Temperature_Celsius 0x0022 028 040 000 Old_age Always - 28 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 084 064 000 Old_age Always - 240686392 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 8439h+10m+11.326s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 20354682795 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 101287520165 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 8458 - # 2 Short offline Completed without error 00% 8445 - # 3 Short offline Completed without error 00% 8410 - # 4 Short offline Completed without error 00% 8362 - # 5 Short offline Completed without error 00% 8314 - # 6 Short offline Completed without error 00% 8266 - # 7 Short offline Completed without error 00% 8218 - # 8 Short offline Completed without error 00% 8170 - # 9 Extended offline Completed without error 00% 8134 - #10 Short offline Completed without error 00% 8051 - #11 Short offline Completed without error 00% 8004 - #12 Short offline Completed without error 00% 7955 - #13 Short offline Completed without error 00% 7907 - #14 Short offline Completed without error 00% 7859 - #15 Short offline Completed without error 00% 7811 - #16 Extended offline Completed without error 00% 7779 - #17 Short offline Completed without error 00% 7715 - #18 Short offline Completed without error 00% 7667 - #19 Short offline Completed without error 00% 7619 - #20 Short offline Completed without error 00% 7571 - #21 Short offline Completed without error 00% 7523 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Code:
[root@freenas] ~# smartctl -a /dev/ada1 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 3.5 Device Model: ST4000DM004-2CV104 Serial Number: ZFN0GWZD LU WWN Device Id: 5 000c50 0a4b23fe1 Firmware Version: 0001 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5425 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Mon Apr 15 21:15:55 2019 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 473) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 075 064 006 Pre-fail Always - 29218056 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 8 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 092 060 045 Pre-fail Always - 1560721189 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 8252 (215 250 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 8 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 2 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 051 040 Old_age Always - 28 (Min/Max 26/29) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 233 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 333 194 Temperature_Celsius 0x0022 028 049 000 Old_age Always - 28 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 075 064 000 Old_age Always - 29218056 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 8219h+20m+48.654s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 47000063073 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 56545103875 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 8236 - # 2 Short offline Completed without error 00% 8223 - # 3 Short offline Completed without error 00% 8188 - # 4 Short offline Completed without error 00% 8143 - # 5 Short offline Completed without error 00% 8092 - # 6 Short offline Completed without error 00% 8044 - # 7 Short offline Completed without error 00% 7997 - # 8 Short offline Completed without error 00% 7948 - # 9 Extended offline Completed without error 00% 7901 - #10 Short offline Completed without error 00% 7830 - #11 Short offline Completed without error 00% 7781 - #12 Short offline Completed without error 00% 7733 - #13 Short offline Completed without error 00% 7685 - #14 Short offline Completed without error 00% 7637 - #15 Short offline Completed without error 00% 7589 - #16 Extended offline Completed without error 00% 7556 - #17 Short offline Completed without error 00% 7493 - #18 Short offline Completed without error 00% 7445 - #19 Short offline Completed without error 00% 7397 - #20 Short offline Completed without error 00% 7349 - #21 Short offline Completed without error 00% 7301 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
I'm not an expert, but I think that disk are ok. So, why such big CKSUM errors ?
I have dismantled the server, cleaning all dust and connectors, and rebuild it to prevent failed cables or connectors (the server is very old). But still getting same errors.
But, I have check the data and seems to be correct. I can't find any corrupted file.
The only problem is the directory tmp from OpenVPN jail, which is listed as permanent error. It's like a black hole. You can copy or create any file, but
ls
return 0 files.I already do a
scrub
command several times, but always ends with 0 recovery and 1 error, which I think is the tmp directory.I don't know what is the <0xffffffffffffffff>:<0x0> .... searching in Google don't give any clue.
More information: a
zpool clear
command only start a resilver and ends with same situation. Rebooting the server starts a resilver again and end with same situation. Resilver always go from first disk in list (the old good disk) to the second disk on list (new and good disk).So, I'm totally stuck with this. I'm out of ideas.
Any help, please ???
Thanks!