Scrub reveals repairs two scrubs in a row

Status
Not open for further replies.

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
With fans set to heavy io and the top on the case idle drive temp has dropped 20c on two of the troublesome drives. I'll be starting a scrub in a few days to test for read errors.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
43% through a scrub. A drive currently at 25c just threw a read error. Either heat wasn't the cause of the read errors or I permanently damaged the drives from excess heat and no matter what now they'll always throw read errors. What are people's bets on? haha
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
If it was really damaged it would fail a smart test. Smart tests are not perfect though.

Sent from my Nexus 5X using Tapatalk
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
so far another read error on a second drive. thus far both are the same drives from the last scrub. da 35 and da36. da 36 just gave the error and is at 23c. If this scrub goes the same as the last, the same 3rd drive will error soon. I have 2 unused reds I can swap in 1 at a time to replace 2 of these 3 drives and scrub again.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
so far another read error on a second drive. thus far both are the same drives from the last scrub. da 35 and da36. da 36 just gave the error and is at 23c. If this scrub goes the same as the last, the same 3rd drive will error soon. I have 2 unused reds I can swap in 1 at a time to replace 2 of these 3 drives and scrub again.
That seems like a reasonable next troubleshooting step. Do you have enough slots to do a inplace replacement? This way you don't have to put your pool into a degraded state.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
That seems like a reasonable next troubleshooting step. Do you have enough slots to do a inplace replacement? This way you don't have to put your pool into a degraded state.

I have 8 open slots on that chasis, i can add a drive. I've been offlining drives and replacing them to swap. Is an inplace replacement different?

and now with this new scrub, we have a new contender... we're now at 4 drives repairing. da 38 and da 7 have joined the party. I'm so lost. I spent the money and did this build right. I'm really trying. I'm trying to do this right. 4 drives with read errors this scrub.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
I have another chasis as well, the same model supermicro. Its still in the box... I'm going to open it up and take the troublesome vdev over to there... it will be all alone. I'm going to take BRAND NEW sas cables to it.... and i'll even add a new m1015 solely for this new box / troublesome vdev.

At this point it will be new case, new psus, new sas cables, new backplane, new raid card, new everything. If i've still got read errors at that point, i'll just stare blankly at the cosmos.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Is your vdev the sole vdev in its pool?

If so, have you tried recreating the vdev?

Sounds like you really have done everything right :-/

How are the drives powered?
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Is your vdev the sole vdev in its pool?

If so, have you tried recreating the vdev?

Sounds like you really have done everything right :-/

How are the drives powered?
The problem vdev is part of the only pool I run. 1 pool, 5 vdevs of 8. Dual psu's in the supermicro case. 1200 watt psus attached to the stock backplane.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
After all this hardware shenanigans. Is it possible the issue is actually with the vdev rather than the devices?
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
After all this hardware shenanigans. Is it possible the issue is actually with the vdev rather than the devices?
I don't know how we can figure that out. I've thought it possible there was "bad data" or some fragmented something or other causing this issue for some time now. I've swapped so many drives and cases and cables trying to track this issue down. Each and every scrub there are read errors on 2-4 drives and on average 800k bits repaired during the scrub.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
It is theoretically possible that some drives were permanently damaged by overheating. Can you tell if the drives reporting errors were worst-case for heat? Other than looking at their physical situation, you could smartctl -x | grep Lifetime.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
smartctl -x | grep Lifetime.
Thats a cool command, lifetime high temp of da36 is 46c.

Alright k-mart shoppers, I have replaced da35 just now with a brand new, had to rip open the static bag western digital red. da35 has had read errors the past 3 scrubs. If the replacement drive has read errors on the next scrub, we're outside of the drives being the issue.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Here is a pic of my spreadsheet documenting my read errors over time. Each row is a drive in the vdev.

H2AL3un.jpg
 
Status
Not open for further replies.
Top