Must Drive Be Replaced?

TooMuchData · Nov 13, 2022

On TrueNAS Core 13.0-U3 I have a RaidZ2 pool of 6 x 6tb WD Reds. Zpool Status reports no known errors and no devices showing errors. Scrub repaired 0B in 09:15:38 with 0 errors on Sun Nov 13 09:15:49 2022 (today). However, one of the drives has failed a SMART Long test 3 times, and fails reading the same LBA with 30% of the test remaining.

I suspect I'll replace this device in the not too distant future. But, is it necessary to do so immediately? It seems that the device is not (yet) affecting the pool, so why not wait until I can buy the replacement at a reduced price (maybe Black Friday)? I would also prefer to first burn-in the replacement.

If the device presents uncorrectable errors to ZFS, won't ZFS mark the device failed and continue without it (at which time I would promptly replace the device)?

Thanks for your thoughts about this.

TooMuchData · Nov 14, 2022

Ran the rest of the long test by using -t select. No further LBA errors. So, I have a device that shows no errors of any kind other than one bad LBA (which is probably not in use). Do you replace that disk? I'm inclined to wait for an error that gets noticed by ZFS.

Would still appreciate other opinions.

Davvo · Nov 14, 2022

Are we talking about red pros (cmr) or standard reds (smr)?

TooMuchData · Nov 14, 2022

Red Pro and Red Plus are CMR. This happens to be an older WD60EFRX which is also CMR. Why do you ask?

Davvo · Nov 14, 2022

Generally CMR drives are of higher manifacturing quality and tend to have better warranties than the consumer standard.
Given the redundancy level of your pool I would monitor it closely while looking for discounts for a replacement, as you planned.
If it's under warranty check if that's enough for an RMA.

NugentS · Nov 14, 2022

You could try running a badblocks across the drive. This will tend to map out any pending bad sectors. This could achieve one of three things:
1. Completely fail the drive
2. Apparently fix the drive, but it fails again shortly after as the bad blocks expand
3. Fix the drive for a significant period of time.

2 & 3 are most likley with about a 50/50 split in my experience

joeschmuck · Nov 14, 2022

Hey buddy, I've seen this before and the drive is failing with a surface defect. Currently you do not have any data stored at the failing LBA which is why a SCRUB passes without issue. A Short test likely jumps over that LBA region. The Long test has failed more than once so it's a failure.

You could try to force the LBA's to be remapped out of the usable space but typically once you have a surface defect, it will continue to grow.

My advice, you have time to purchase another drive and then replace it. It's not a critical replacement as of now and I definitely advise against trying to make it last longer, it generally does not last long.

If you want to play with running BadBlock, I recommend you take the failing LBA and subtract 10,000 and use that as your starting point, than add 10,000 as your stopping point. If I recall correctly you actually enter the ending LBA first and then the beginning LBA. My little hard drive troubleshooting guide has some information in it about this. I forget what I previously recommended for the LBA beginning and end, maybe it was 1,000.

TooMuchData said:
If the device presents uncorrectable errors to ZFS, won't ZFS mark the device failed and continue without it (at which time I would promptly replace the device)?

I'm not sure it actually works that way. The hard drive is told to write a block of data, it writes it, then it reads it and compares it. If it fails it may try to write the data a few more times but if it never compares properly then the sector(s) that failed are mapped as bad blocks and the data will be written elsewhere. This will happen blind to you. You will see the errors when the SMART data is checked.

Anyway, good luck with whatever you decide to do.

TooMuchData · Nov 14, 2022

Thanks, you Schmuck you! I'm already prepared to replace the device (have spare burned in and ready), but will play with badblocks just for the learning experience. I'll have to run the variation that writes data, but does not lose any data.

joeschmuck · Nov 14, 2022

Yes, but you should be able to use dd as well and read a block, write a block. I don't think you can specify an LBA unfortunately and writing 6TB will take a long time, but I could be wrong, I just don't know the parameter exists. This is why badblocks is good to use. Good luck.

NugentS · Nov 14, 2022

TooMuchData said:
Thanks, you Schmuck you! I'm already prepared to replace the device (have spare burned in and ready), but will play with badblocks just for the learning experience. I'll have to run the variation that writes data, but does not lose any data.

Opinion: You need to remove the drive from the array and badblocks it properly with full mulyi pattern write otherwise you are only running half the test.

TooMuchData · Nov 18, 2022

I removed the device to a Linux system and ran badblocks twice, with -w and -n. But, a SMART long test still fails on the same LBA.

I've replaced the device in the pool and am now running it through chkdsk /r on Win 7. Regardless of results I will use the device in an unimportant mirror until it outright fails. I will continue to run periodic SMART long tests augmented by SMART selective tests that begin on the LBA following the error (just to see how the device holds up). I may eventually stop the long tests as they raise the logged errors count.

Thanks to all who contributed.

joeschmuck · Nov 18, 2022

TooMuchData said:
I removed the device to a Linux system and ran badblocks twice, with -w and -n. But, a SMART long test still fails on the same LBA.

Just so I understand it, you ran badblocks using the -w parameter, something like this badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada0 1244448 1044448 where the 1244448 is the ending LBA and 1044448 is the beginning LBA? Of course you could drop the LBA parameters to do the entire drive. If that is what you did, it's very odd that faulty blocks were not picked up but a SMART Long test still fails at the same LBA. Odd.

TooMuchData · Nov 18, 2022

Not quite. Ran -nsv without -b and without -p starting 5k LBAs before failing one and ending 5k after. Then -wsv. You think the -b and/or -p matters?

joeschmuck · Nov 18, 2022

TooMuchData said:
You think the -b and/or -p matters?

The -p matters so that you can perform multiple passes, since you have defined the LBA range I would set the -p for 100 or maybe 500 just to beat that section up really good. The -b parameter just writes larger blocks which speeds up the process considerably (think Advanced Format and 4K blocks). Also, do you have the device "not mounted" ? It must not be mounted if I recall correctly in order to allow badblocks to actually write to the drive. I don't recall if you get an error message or it moves right along letting you think it's working. I'm not sure if you should use the -f parameter as well as I've never run badblocks on Linux before. I'm sure a Google search on "badblocks linux" will find you some references.

Like I said, very odd that SMART will fail but badblocks does not. Just keep an eye out for Pending Sector Errors and Reallocated Sector Errors going up.

Another way to force it is to write data to those LBA's, you could just write to the entire drive, maybe run a secure format to write a DoD level wipe on the drive. It would take considerable time since I doubt you would be able to specify an LBA range but that should work too.

Or, just let it be. When data is written to the area it will fail to be verified and after several attempts you will have it mapped out. Several could be quite a few. And as you said, you plan to use it until it no longer works.

TooMuchData · Nov 19, 2022

Win 7 chkdsk /r ran for almost 24 hours. Found no errors on the drive.

I replaced the drive in the pool and moved it to a test Scale system. It is offline. Here are self-test recent results:

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 6 Extended offline Completed: read failure 30% 42273 3751514384

# 7 Selective offline Completed without error 00% 42265 -

# 8 Short offline Completed without error 00% 42262 -

# 9 Selective offline Completed without error 00% 42249 -

#10 Selective offline Completed: read failure 10% 42246 3751514384

#11 Short offline Completed without error 00% 42238 -

#12 Extended offline Completed: read failure 30% 42226 3751514384

#13 Short offline Completed without error 00% 42216 -

#14 Short offline Completed without error 00% 42207 -

#15 Short offline Completed without error 00% 42183 -

#16 Extended offline Completed: read failure 30% 42172 3751514384

#17 Short offline Completed without error 00% 42165 -

#18 Extended offline Completed: read failure 30% 42162 3751514384

Recent long tests and one selective test failed at same LBA, 3751514384.

So, ran "badblocks -b 512 -wsv -c 64 -p 10 /dev/sdb 3760000000 3750000000" and found 0,0,0 errors.
(Use 512 for blocksize as it is the logical blocksize of the WD Red 6TB. Then can specify LBAs to badblocks)

Then ran "smartctl -t select,3750000000-3759999999 /dev/sdb"

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".

SPAN STARTING_LBA ENDING_LBA

0 3750000000 3759999999

Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.

Result was no errors.

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Selective offline Completed without error 00% 42354 -

# 2 Selective offline Completed without error 00% 42354 -

So, will now run long test again and report results tomorrow.

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".

Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 673 minutes for test to complete.

Test will complete after Sat Nov 19 22:37:44 2022 EST

Here are recent SMART data:

root@truenas3[~]# smartctl -a /dev/sdb

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Red

Device Model: WDC WD60EFRX-68L0BN1

Serial Number: WD-WX11D66510H9

LU WWN Device Id: 5 0014ee 20dcae5f2

Firmware Version: 82.00A82

User Capacity: 6,001,175,126,016 bytes [6.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: 5700 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b

SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Sat Nov 19 10:56:15 2022 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 231 196 021 Pre-fail Always - 7450

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 306

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0

9 Power_On_Hours 0x0032 042 042 000 Old_age Always - 42354

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 304

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 300

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2810

194 Temperature_Celsius 0x0022 120 112 000 Old_age Always - 32

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always -

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1

joeschmuck · Nov 19, 2022

So this drive now has reported a MultiZone Error where a few months ago (at 40210 hours) it didn't have that error. This is why I like statistics.

I say run it until you get a few more errors and then replace it.

As for why the testing passes at times and others it fails, well I suspect you are polishing the platter area. Eventually if will be a solid fail.

If you are just goofing around with the drive, the I'd say test it with a larger about of LBA's. I can't explain why it generally passes when you select specific LBA's so do a larger chunk and see how that plays out. Change the starting LBA's to 3000000000. Why? Maybe the stepping across the platter is the weak point so put the heads further away and have it step into the field. That is the only thing at the moment which makes sense to me.

The Schmuck !

TooMuchData · Nov 19, 2022

Once again the long test failed:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 30% 42362 3751514384

So, no other mechanism can detect an error or problem, but a standard long test will always fail on the particular LBA. Selective self tests that cross the supposedly failing LBA do not fail. Badblocks over the same LBA multiple times does not result in correction. CHKDSK finds nothing. You tell me!

I will use the device in a non-essential mirror until it produces visible smoke.

Thanks for all comments.

NASbox · Nov 21, 2022

NugentS said:
You could try running a badblocks across the drive. This will tend to map out any pending bad sectors. This could achieve one of three things:
1. Completely fail the drive
2. Apparently fix the drive, but it fails again shortly after as the bad blocks expand
3. Fix the drive for a significant period of time.

2 & 3 are most likley with about a 50/50 split in my experience

I second this approach. I had a drive that was a bit flakey (but in my case the smart long was error free). I temporarily swapped the drive out and ran a bad blocks on the drive that was causing errors, and then resilvered. The drive has been error free for 3 months so far.

The standard 4 pass badblocks gives a drive a good run and you will see how it performs. If there are more tan a few errors, scrap the drive.

Good luck

Important Announcement for the TrueNAS Community.

Must Drive Be Replaced?

TooMuchData

Contributor

TooMuchData

Contributor

Davvo

MVP

TooMuchData

Contributor

Davvo

MVP

NugentS

MVP

joeschmuck

Old Man

TooMuchData

Contributor

joeschmuck

Old Man

NugentS

MVP

TooMuchData

Contributor

joeschmuck

Old Man

TooMuchData

Contributor

joeschmuck

Old Man

TooMuchData

Contributor

joeschmuck

Old Man

TooMuchData

Contributor

NASbox

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Must Drive Be Replaced?

Contributor

Contributor

MVP

Contributor

MVP

MVP

Old Man

Contributor

Old Man

MVP

Contributor

Old Man

Contributor

Old Man

Contributor

Old Man

Contributor

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Must Drive Be Replaced?"

Similar threads