Scrub reveals repairs two scrubs in a row

Stux · Jan 28, 2017

Robert Trevellyan said:
@SweetAndLow nailed it.

To give an idea how critical the top cover is in a rack mount case like this:

https://forums.freenas.org/index.ph...o-x10-sri-f-xeon-e5-1650v4.46262/#post-315996

trsupernothing · Jan 28, 2017

With fans set to heavy io and the top on the case idle drive temp has dropped 20c on two of the troublesome drives. I'll be starting a scrub in a few days to test for read errors.

trsupernothing · Jan 30, 2017

43% through a scrub. A drive currently at 25c just threw a read error. Either heat wasn't the cause of the read errors or I permanently damaged the drives from excess heat and no matter what now they'll always throw read errors. What are people's bets on? haha

SweetAndLow · Jan 30, 2017

If it was really damaged it would fail a smart test. Smart tests are not perfect though.

Sent from my Nexus 5X using Tapatalk

trsupernothing · Jan 30, 2017

so far another read error on a second drive. thus far both are the same drives from the last scrub. da 35 and da36. da 36 just gave the error and is at 23c. If this scrub goes the same as the last, the same 3rd drive will error soon. I have 2 unused reds I can swap in 1 at a time to replace 2 of these 3 drives and scrub again.

SweetAndLow · Jan 30, 2017

trsupernothing said:
so far another read error on a second drive. thus far both are the same drives from the last scrub. da 35 and da36. da 36 just gave the error and is at 23c. If this scrub goes the same as the last, the same 3rd drive will error soon. I have 2 unused reds I can swap in 1 at a time to replace 2 of these 3 drives and scrub again.

That seems like a reasonable next troubleshooting step. Do you have enough slots to do a inplace replacement? This way you don't have to put your pool into a degraded state.

trsupernothing · Jan 30, 2017

SweetAndLow said:
That seems like a reasonable next troubleshooting step. Do you have enough slots to do a inplace replacement? This way you don't have to put your pool into a degraded state.

I have 8 open slots on that chasis, i can add a drive. I've been offlining drives and replacing them to swap. Is an inplace replacement different?

and now with this new scrub, we have a new contender... we're now at 4 drives repairing. da 38 and da 7 have joined the party. I'm so lost. I spent the money and did this build right. I'm really trying. I'm trying to do this right. 4 drives with read errors this scrub.

trsupernothing · Jan 30, 2017

And they're all in the same vdev. I have 5 vdevs of 8 drives a piece. This is the only vdev fighting me.

trsupernothing · Jan 30, 2017

I have another chasis as well, the same model supermicro. Its still in the box... I'm going to open it up and take the troublesome vdev over to there... it will be all alone. I'm going to take BRAND NEW sas cables to it.... and i'll even add a new m1015 solely for this new box / troublesome vdev.

At this point it will be new case, new psus, new sas cables, new backplane, new raid card, new everything. If i've still got read errors at that point, i'll just stare blankly at the cosmos.

Stux · Jan 30, 2017

Is your vdev the sole vdev in its pool?

If so, have you tried recreating the vdev?

Sounds like you really have done everything right :-/

How are the drives powered?

trsupernothing · Jan 30, 2017

Stux said:
Is your vdev the sole vdev in its pool?

If so, have you tried recreating the vdev?

Sounds like you really have done everything right :-/

How are the drives powered?

The problem vdev is part of the only pool I run. 1 pool, 5 vdevs of 8. Dual psu's in the supermicro case. 1200 watt psus attached to the stock backplane.

trsupernothing · Jan 30, 2017

846E16-R1200B cases

Robert Trevellyan · Jan 31, 2017

trsupernothing said:
The problem vdev is part of the only pool I run

Then unfortunately you can't move it out of the system without destroying the pool.

trsupernothing · Jan 31, 2017

Robert Trevellyan said:
Then unfortunately you can't move it out of the system without destroying the pool.

I simply meant physically isolating the vdev on its own chassis whilst still functioning as a vdev of the pool

Stux · Jan 31, 2017

After all this hardware shenanigans. Is it possible the issue is actually with the vdev rather than the devices?

trsupernothing · Jan 31, 2017

Stux said:
After all this hardware shenanigans. Is it possible the issue is actually with the vdev rather than the devices?

I don't know how we can figure that out. I've thought it possible there was "bad data" or some fragmented something or other causing this issue for some time now. I've swapped so many drives and cases and cables trying to track this issue down. Each and every scrub there are read errors on 2-4 drives and on average 800k bits repaired during the scrub.

Robert Trevellyan · Jan 31, 2017

It is theoretically possible that some drives were permanently damaged by overheating. Can you tell if the drives reporting errors were worst-case for heat? Other than looking at their physical situation, you could smartctl -x | grep Lifetime.

trsupernothing · Jan 31, 2017

Robert Trevellyan said:
smartctl -x | grep Lifetime.

Thats a cool command, lifetime high temp of da36 is 46c.

Alright k-mart shoppers, I have replaced da35 just now with a brand new, had to rip open the static bag western digital red. da35 has had read errors the past 3 scrubs. If the replacement drive has read errors on the next scrub, we're outside of the drives being the issue.

trsupernothing · Jan 31, 2017

Another troublesome drive has a lifetime high temp of 52c

trsupernothing · Jan 31, 2017

Here is a pic of my spreadsheet documenting my read errors over time. Each row is a drive in the vdev.

Important Announcement for the TrueNAS Community.

Scrub reveals repairs two scrubs in a row

MVP

Explorer

Explorer

Sweet'NASty

Explorer

Sweet'NASty

Explorer

Explorer

Explorer

MVP

Explorer

Explorer

Pony Wrangler

Explorer

MVP

Explorer

Pony Wrangler

Explorer

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scrub reveals repairs two scrubs in a row"

Similar threads