Update: I ended up RMAing the drive as it was *just* within the 3 year WD Red warranty (last month). Included the SMART error log. No questions asked.
Also got a spare one to prevent a similar lengthy downtime in the future. And finally I made sure all drives were selected in the (short/long) SMART tasks.
If anyone has the same problem, can't RMA, and is also unable to fix it using badblocks or dd by forcing a write to the bad sector(s): try this freenas post (set offline, wipe, resilver).
I can't vouch that that will work, but it'd have been my last resort. I'll be sure to try it once one of my other (old) drives starts giving bad sectors.
===========
Hello,
Last week I received an email that a sector can't be read anymore from my ada4 device.
I've been trying to get the sector remapped in order to "fix" the issue. Without any success: dd keeps failing with "Input/output error",
Following sources were read and used:
https://dekoder.wordpress.com/2014/10/08/fixing-freenas-currently-unreadable-pending-sectors-error/
https://forums.freenas.org/index.php?threads/currently-unreadable-pending-sectors.46395/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
http://linux.101hacks.com/unix/badblocks/
https://forums.freenas.org/index.ph...-1-currently-unreadable-pending-sectors.9824/
http://daemon-notes.com/articles/system/smartmontools/current-pending
http://www.freebsddiary.org/smart-fixing-bad-sector.php
First I tried a long SMART test to find the failed sector:
Which was successful in locating the problem:
Next up, I wanted to write to the sector to make it relocate:
Initially I hadn't noticed the I/O error, and immediately started another long SMART test for the drive:
The following evening I checked the log again:
Which is a new LBA, seems like it failed earlier this time. I tried the "fix" again:
Tried with badblocks next:
But every single time I retry this command, it'll say "2 bad blocks found" once, followed nine times by "0 bad blocks found".
So it doesn't seem to do anything.
Since badblocks also has a non-destructive mode, I decided to run the following for the entire disk in screen overnight (and scheduled another long test):
So nothing found in particular here.
Then I read something about taking the drive offline from the zfs pool while doing these operations, so I attempted that next:
So except for me learning some new commands, it didn't do anything.
Finally I tried some smaller blocksizes:
Same with /dev/zero as input:
bs=4096 always fails.
The Current_Pending_Sector is never reset to 0:
Does anyone know what I might be doing wrong? How can I "fix" this error?
Please don't just say "RMA the drive". It isn't worth hassle so far.
Thank you
EDIT #1: Decided to read some more, and apparently I had to divide the LBA by 8 when writing 4096, so I did that.
Also got a spare one to prevent a similar lengthy downtime in the future. And finally I made sure all drives were selected in the (short/long) SMART tasks.
If anyone has the same problem, can't RMA, and is also unable to fix it using badblocks or dd by forcing a write to the bad sector(s): try this freenas post (set offline, wipe, resilver).
I can't vouch that that will work, but it'd have been my last resort. I'll be sure to try it once one of my other (old) drives starts giving bad sectors.
===========
Hello,
Last week I received an email that a sector can't be read anymore from my ada4 device.
I've been trying to get the sector remapped in order to "fix" the issue. Without any success: dd keeps failing with "Input/output error",
Following sources were read and used:
https://dekoder.wordpress.com/2014/10/08/fixing-freenas-currently-unreadable-pending-sectors-error/
https://forums.freenas.org/index.php?threads/currently-unreadable-pending-sectors.46395/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
http://linux.101hacks.com/unix/badblocks/
https://forums.freenas.org/index.ph...-1-currently-unreadable-pending-sectors.9824/
http://daemon-notes.com/articles/system/smartmontools/current-pending
http://www.freebsddiary.org/smart-fixing-bad-sector.php
First I tried a long SMART test to find the failed sector:
Code:
[root@fnas] ~# smartctl -t long /dev/ada4 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 542 minutes for test to complete. Test will complete after Thu Jun 1 21:35:48 2017 Use smartctl -X to abort test.
Which was successful in locating the problem:
Code:
[root@fnas] ~# smartctl -a /dev/ada4 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) ... Sector Sizes: 512 bytes logical, 4096 bytes physical ... 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 24578 1678762305 << seek = 1678762305 # 2 Short offline Completed without error 00% 5518 - ...
Next up, I wanted to write to the sector to make it relocate:
Code:
[root@fnas] ~# diskinfo -v /dev/ada4 /dev/ada4 512 # sectorsize ... 4096 # stripesize << blocksize (bs) = 4096 ... [root@fnas] ~# sysctl kern.geom.debugflags=16 #same as: sysctl kern.geom.debugflags=0x10 kern.geom.debugflags: 0 -> 16 [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762305 conv=noerror,sync dd: /dev/ada4: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000061 secs (0 bytes/sec)
Initially I hadn't noticed the I/O error, and immediately started another long SMART test for the drive:
Code:
[root@fnas] ~# smartctl -t long /dev/ada4 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) ... Please wait 542 minutes for test to complete. Test will complete after Sat Jun 3 06:53:16 2017
The following evening I checked the log again:
Code:
[root@fnas] ~# smartctl -a /dev/ada4 ... 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 24611 1678762304 << seek = 1678762304 # 2 Extended offline Completed: read failure 90% 24578 1678762305 # 3 Short offline Completed without error 00% 5518 - ...
Which is a new LBA, seems like it failed earlier this time. I tried the "fix" again:
Code:
[root@fnas] ~# sysctl kern.geom.debugflags=16 kern.geom.debugflags: 16 -> 16 [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762304 conv=noerror,sync dd: /dev/ada4: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000066 secs (0 bytes/sec)
Tried with badblocks next:
Code:
[root@fnas] ~# badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada4 1678762305 1678762304 Checking for bad blocks in read-write mode From block 1678762304 to 1678762305 Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device 1678762304 1678762305 done Reading and comparing: done Testing with pattern 0x55: done Reading and comparing: done Testing with pattern 0xff: done Reading and comparing: done Testing with pattern 0x00: done Reading and comparing: done Pass completed, 2 bad blocks found. (0/2/0 errors) ... Checking for bad blocks in read-write mode From block 1678762304 to 1678762305 Testing with pattern 0xaa: done Reading and comparing: done Testing with pattern 0x55: done Reading and comparing: done Testing with pattern 0xff: done Reading and comparing: done Testing with pattern 0x00: done Reading and comparing: done Pass completed, 0 bad blocks found. (0/2/0 errors)
But every single time I retry this command, it'll say "2 bad blocks found" once, followed nine times by "0 bad blocks found".
So it doesn't seem to do anything.
Since badblocks also has a non-destructive mode, I decided to run the following for the entire disk in screen overnight (and scheduled another long test):
Code:
[root@fnas] ~# badblocks -v -b 4096 -s /dev/ada4 Checking blocks 0 to 976754645 Checking for bad blocks (read-only test): done Pass completed, 0 bad blocks found. (0/0/0 errors)
So nothing found in particular here.
Then I read something about taking the drive offline from the zfs pool while doing these operations, so I attempted that next:
Code:
[root@fnas] ~# glabel status Name Status Components gptid/221fef8d-6648-11e6-93ab-bc5ff4fb5e9c N/A ada0p1 gptid/c3605fab-09e1-11e4-9783-bc5ff4fb5e9c N/A ada1p2 gptid/30d71158-0de2-11e4-b754-bc5ff4fb5e9c N/A ada2p2 gptid/31bca9bc-0de2-11e4-b754-bc5ff4fb5e9c N/A ada3p2 gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c N/A ada4p2 << ! gptid/32922a60-0de2-11e4-b754-bc5ff4fb5e9c N/A ada5p2 gptid/33025476-0de2-11e4-b754-bc5ff4fb5e9c N/A ada6p2 gptid/314f5e04-0de2-11e4-b754-bc5ff4fb5e9c N/A ada7p2 [root@fnas] ~# zpool offline HDD gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c [root@fnas] ~# ls /dev/ada4* /dev/ada4 /dev/ada4p1 /dev/ada4p1.eli /dev/ada4p2 [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762304 conv=noerror,sync dd: /dev/ada4: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000059 secs (0 bytes/sec) [root@fnas] ~# dd if=/dev/zero of=/dev/ada4p1 bs=4096 count=1 seek=1678762304 conv=noerror,sync dd: /dev/ada4p1: Operation not permitted [root@fnas] ~# dd if=/dev/zero of=/dev/ada4p1.eli bs=4096 count=1 seek=1678762304 conv=noerror,sync dd: /dev/ada4p1.eli: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000120 secs (0 bytes/sec) [root@fnas] ~# dd if=/dev/zero of=/dev/ada4p2 bs=4096 count=1 seek=1678762304 conv=noerror,sync dd: /dev/ada4p2: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000064 secs (0 bytes/sec) [root@fnas] ~# zpool status ... NAME STATE READ WRITE CKSUM HDD DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 ... 15598725717587329647 OFFLINE 0 0 0 was /dev/gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c [root@fnas] ~# zpool online HDD 15598725717587329647 [root@fnas] ~# zpool status pool: HDD state: ONLINE scan: resilvered 596K in 0h0m with 0 errors on Sun Jun 4 16:27:35 2017 ...
So except for me learning some new commands, it didn't do anything.
Finally I tried some smaller blocksizes:
Code:
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=512 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync 1+0 records in 1+0 records out 512 bytes transferred in 0.020933 secs (24459 bytes/sec) [root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=1024 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync 1+0 records in 1+0 records out 1024 bytes transferred in 0.014450 secs (70866 bytes/sec) [root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=2048 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync 1+0 records in 1+0 records out 2048 bytes transferred in 0.016503 secs (124100 bytes/sec) [root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=4096 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync dd: /dev/ada4: Input/output error 0+0 records in 0+0 records out 0 bytes transferred in 0.000042 secs (0 bytes/sec) dd: /dev/ada4: Input/output error dd: /dev/ada4: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000068 secs (0 bytes/sec)
Same with /dev/zero as input:
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=1 oseek=1678762304 conv=noerror,sync dd if=/dev/zero of=/dev/ada4 bs=1024 count=1 oseek=1678762304 conv=noerror,sync dd if=/dev/zero of=/dev/ada4 bs=2048 count=1 oseek=1678762304 conv=noerror,sync dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 oseek=1678762304 conv=noerror,sync1+0 records in 1+0 records out 512 bytes transferred in 0.000160 secs (3200423 bytes/sec) [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=1024 count=1 oseek=1678762304 conv=noerror,sync 1+0 records in 1+0 records out 1024 bytes transferred in 0.000201 secs (5094860 bytes/sec) [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=2048 count=1 oseek=1678762304 conv=noerror,sync 1+0 records in 1+0 records out 2048 bytes transferred in 0.000163 secs (12558384 bytes/sec) [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 oseek=1678762304 conv=noerror,sync dd: /dev/ada4: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 0.000056 secs (0 bytes/sec) [root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=8 oseek=1678762304 conv=noerror,sync 8+0 records in 8+0 records out 4096 bytes transferred in 0.000802 secs (5106977 bytes/sec)
bs=4096 always fails.
The Current_Pending_Sector is never reset to 0:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Jun 4 18:18:26 2017 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (54240) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 542) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 8 3 Spin_Up_Time 0x0027 194 178 021 Pre-fail Always - 7300 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 67 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 067 067 000 Old_age Always - 24655 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 67 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 43 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 26449 194 Temperature_Celsius 0x0022 112 099 000 Old_age Always - 40 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 24637 - # 2 Extended offline Completed: read failure 90% 24634 1678762304 # 3 Extended offline Completed: read failure 90% 24611 1678762304 # 4 Extended offline Completed: read failure 90% 24578 1678762305 # 5 Short offline Completed without error 00% 5518 - # 6 Short offline Completed without error 00% 5470 - # 7 Short offline Completed without error 00% 5422 - # 8 Short offline Completed without error 00% 5374 - # 9 Short offline Completed without error 00% 5326 - #10 Short offline Completed without error 00% 5278 - #11 Extended offline Completed without error 00% 5265 - #12 Short offline Completed without error 00% 5230 - #13 Short offline Completed without error 00% 5182 - #14 Short offline Completed without error 00% 5134 - #15 Short offline Completed without error 00% 5086 - #16 Short offline Completed without error 00% 5038 - #17 Short offline Completed without error 00% 4990 - #18 Short offline Completed without error 00% 4942 - #19 Extended offline Interrupted (host reset) 10% 4928 - #20 Short offline Completed without error 00% 4894 - #21 Short offline Completed without error 00% 4846 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Does anyone know what I might be doing wrong? How can I "fix" this error?
Please don't just say "RMA the drive". It isn't worth hassle so far.
Thank you
EDIT #1: Decided to read some more, and apparently I had to divide the LBA by 8 when writing 4096, so I did that.
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=209845288 conv=noerror,sync # 1 678 762 304 / 8 = 209 845 288 1+0 records in 1+0 records out 4096 bytes transferred in 0.000204 secs (20093414 bytes/sec) [root@fnas] ~# smartctl -a /dev/ada4 | grep -e Current_Pending_Sector -e failure 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 # 2 Extended offline Completed: read failure 90% 24634 1678762304 # 3 Extended offline Completed: read failure 90% 24611 1678762304 # 4 Extended offline Completed: read failure 90% 24578 1678762305
Last edited: