RandomTask
Dabbler
- Joined
- Jul 8, 2016
- Messages
- 11
I recently started seeing two of my drives with SMART test error messages in the WebGUI
Running long
Despite the tests reporting the drives are OK, I've read a number of posts that say once you start seeing pending sectors stack up, it's time to replace the drive. Those posts also suggest that the cables should be the first thing to scrutinize, so I've ordered a complete new set, they should arrive soon.
Before replacing the drives, I did attempt
I've used the WebGUI to replace the big offender:
I did happen to notice once that
I've also disabled SMART tests and zpool scrubs while the resilvering was happening just in case that was causing the reset, but it doesn't seem to have made a difference. I had scrubs set to run once a month and long SMART tests to run once a month (each disk on a different day).
My zpool reports as healthy:
So my questions are:
1. Can I turn the server off if the resilvering process doesn't complete so that I can replace the cables or am I stuck?
2. Is there a log or something that I can look into to see why the resilvering keeps restarting?
3. Is there something else I should be doing to make the resilvering successful?
My Specs:
I've reviewed the information in these posts but I'm stumped as to why the resilvering keeps restarting:
https://www.ixsystems.com/documenta...rage.html#replacing-drives-to-grow-a-zfs-pool
https://www.ixsystems.com/documentation/freenas/9.10/storage.html#replacing-a-failed-drive
https://forums.freenas.org/index.ph...-10-freenas-reference-manual-not-clear.54195/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
https://forums.freenas.org/index.ph...us-resilvering-please-help.57619/#post-406483
I've exported the debug information from the WebGUI as well and can upload if needed.
Code:
- Device: /dev/ada1, Read SMART Self-Test Log Failed - Device: /dev/ada3, Self-Test Log error count increased from 1 to 2
Running long
smartctl
tests and reviewing the results revealed the following:Code:
# smartctl -a /dev/ada1 // ... drive information SMART overall-health self-assessment test result: PASSED // ... more drive information SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 119 095 051 Pre-fail Always - 260985 3 Spin_Up_Time 0x0027 237 175 021 Pre-fail Always - 1133 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 73 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 036 036 000 Old_age Always - 47249 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 73 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 59 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 13 194 Temperature_Celsius 0x0022 108 095 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 13 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 90% 46915 - # 2 Short offline Completed: read failure 90% 46804 362115280 # 3 Short offline Completed: read failure 70% 46804 3906320080 # 4 Short offline Completed: read failure 90% 46804 362115280 # 5 Extended offline Completed: read failure 90% 46781 362115280 # 6 Short offline Completed: read failure 90% 46781 362115280 # 7 Extended offline Completed: read failure 90% 30850 362119544 # 8 Extended offline Completed: read failure 90% 30843 362119544 # 9 Extended offline Completed: read failure 90% 30826 362119544 #10 Extended offline Completed: read failure 90% 30819 362119544 #11 Extended offline Completed: read failure 90% 30587 362119544 #12 Extended offline Completed: read failure 90% 30580 362119544 #13 Extended offline Completed: read failure 90% 30347 362119544 #14 Extended offline Completed: read failure 90% 30340 362119544 #15 Extended offline Completed: read failure 90% 30323 362119544 #16 Extended offline Completed: read failure 90% 30316 362119544 #17 Extended offline Completed: read failure 90% 30107 362119544 #18 Extended offline Completed: read failure 90% 30100 362119544 #19 Extended offline Completed: read failure 90% 30083 362119544 #20 Extended offline Completed: read failure 90% 30076 362119544 #21 Extended offline Completed: read failure 90% 29843 362119544
Code:
# smartctl -a /dev/ada3 // ... drive information SMART overall-health self-assessment test result: PASSED // ... more drive information SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 777 3 Spin_Up_Time 0x0027 179 177 021 Pre-fail Always - 4033 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 74 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 037 037 000 Old_age Always - 46058 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 73 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 56 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 17 194 Temperature_Celsius 0x0022 108 095 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 80% 45772 296677872 # 2 Short offline Completed: read failure 90% 45680 305110318 # 3 Extended offline Completed: read failure 90% 45601 296266848 # 4 Extended offline Aborted by host 90% 45594 - # 5 Extended offline Completed without error 00% 29662 - # 6 Extended offline Completed without error 00% 29655 - # 7 Extended offline Completed without error 00% 29638 - # 8 Extended offline Completed without error 00% 29631 - # 9 Extended offline Completed without error 00% 29399 - #10 Extended offline Completed without error 00% 29392 - #11 Extended offline Completed without error 00% 29159 - #12 Extended offline Completed without error 00% 29152 - #13 Extended offline Completed without error 00% 29135 - #14 Extended offline Completed without error 00% 29128 - #15 Extended offline Completed without error 00% 28919 - #16 Extended offline Completed without error 00% 28912 - #17 Extended offline Completed without error 00% 28895 - #18 Extended offline Completed without error 00% 28888 -
Despite the tests reporting the drives are OK, I've read a number of posts that say once you start seeing pending sectors stack up, it's time to replace the drive. Those posts also suggest that the cables should be the first thing to scrutinize, so I've ordered a complete new set, they should arrive soon.
Before replacing the drives, I did attempt
dd
on one of the offending sectors just to see if the pending sectors would reduce - it didn't seem to help.I've used the WebGUI to replace the big offender:
ada1
however, the resilvering process seems unable to finish, often restarting one or more times each day. It has been running since January 1 and has yet to complete successfully.I did happen to notice once that
ada1
was offline and then was online again, I'm wondering if the cables may be faulty and causing the process to restart or maybe I need to replace ada3
as well because it has unreadable sectors?I've also disabled SMART tests and zpool scrubs while the resilvering was happening just in case that was causing the reset, but it doesn't seem to have made a difference. I had scrubs set to run once a month and long SMART tests to run once a month (each disk on a different day).
My zpool reports as healthy:
Code:
pool: Tartarus state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Jan 16 20:00:55 2019 52.2G scanned out of 1.27T at 5.79M/s, 61h26m to go 8.67G resilvered, 4.01% done config: NAME STATE READ WRITE CKSUM Tartarus ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/b532dea8-2c7e-11e2-b689-902b34563f56 ONLINE 0 0 0 replacing-1 ONLINE 0 0 1.05K gptid/b587a21a-2c7e-11e2-b689-902b34563f56 ONLINE 0 0 0 (resilvering) gptid/9caf3b27-0b81-11e9-af0c-902b34563f56 ONLINE 0 0 0 (resilvering) gptid/b5d9849a-2c7e-11e2-b689-902b34563f56 ONLINE 0 0 0 gptid/b62b9a91-2c7e-11e2-b689-902b34563f56 ONLINE 0 0 0 gptid/b6808fc8-2c7e-11e2-b689-902b34563f56 ONLINE 0 0 0 gptid/b6d39638-2c7e-11e2-b689-902b34563f56 ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h2m with 0 errors on Thu Dec 13 03:47:08 2018 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors
So my questions are:
1. Can I turn the server off if the resilvering process doesn't complete so that I can replace the cables or am I stuck?
2. Is there a log or something that I can look into to see why the resilvering keeps restarting?
3. Is there something else I should be doing to make the resilvering successful?
My Specs:
Code:
FreeNAS-9.10.2-U1 (86c7ef5) Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz Gigabyte GA-Z77X-UD5H Motherboard 16GB RAM 8GB Sandisk USB for OS 6x 2TB WD Red 7200 RPM SATA (Added a 7th to start the replacement process) 1 zpool, 2 disks of redundancy No RAID setup, all disks are connected directly to the motherboard
I've reviewed the information in these posts but I'm stumped as to why the resilvering keeps restarting:
https://www.ixsystems.com/documenta...rage.html#replacing-drives-to-grow-a-zfs-pool
https://www.ixsystems.com/documentation/freenas/9.10/storage.html#replacing-a-failed-drive
https://forums.freenas.org/index.ph...-10-freenas-reference-manual-not-clear.54195/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
https://forums.freenas.org/index.ph...us-resilvering-please-help.57619/#post-406483
I've exported the debug information from the WebGUI as well and can upload if needed.