Scrub reveals repairs two scrubs in a row

Status
Not open for further replies.

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Also, I assume you're normalizing your drives based on serials or gptid, vs say device names, which, may or may not fluctuate between reboots etc.

Ie, da36 might become da34 after a reboot.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Also, I assume you're normalizing your drives based on serials or gptid, vs say device names, which, may or may not fluctuate between reboots etc.

Ie, da36 might become da34 after a reboot.
da labels are not guaranteed to be the same every boot. Serial numbers and gptid are the most accurate.

Sent from my Nexus 5X using Tapatalk
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Thanks guys, I am in fact using drive serials for my spreadsheet as well as my troubleshooting attempts. I've been lazily referring to the da#'s here for quicker "shop talk" Good looking out though.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Thanks guys, I am in fact using drive serials for my spreadsheet as well as my troubleshooting attempts. I've been lazily referring to the da#'s here for quicker "shop talk" Good looking out though.
Awesome! Well I'm out of ideas so I hope replacing them does something. Keep tracking the failure though that will probably be handy.

Sent from my Nexus 5X using Tapatalk
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Awesome! Well I'm out of ideas so I hope replacing them does something. Keep tracking the failure though that will probably be handy.

Sent from my Nexus 5X using Tapatalk
We'll know soon enough if swapping out that one drive made any difference. If it immediately has issues, like I said all I can do at that point is break this vdev out to its own separate chassis which will be new EVERYTHING, cables, cards, backplanes, psus etc. If its read errors still at that point... i guess I can try replacing the cpu and or the memory (which has been memtested to no end) but I'll just be speechless as well at that point
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
this is what I get for purchasing the appropriate hardware and spending thousands on this build. If i had put it on an old gen1 i5 processor with non-ecc ram and a desktop case somehow I'd have zero issues. haha
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Just an update on this thread. It has been over a month since I posted here and I am still chasing down this issue. I'm more convinced it may be some stuck / corrupt data than hardware at this point (problem with the vdev) etc. For 3 scrubs in a row just after my last post here, I had only 2 drives throwing read errors. It was consistently those 2 drives in that vdev. I decided to try a seagate ironwolf as a replacement drive (mostly because i'm sick of swapping out wd reds (i now have enough wd reds that i've pulled to make a whole other vdev of 8 and add it in in the future) I replaced one of the consistently read error drives with the seagate, and then suddenly i was only consistently reading errors on one remaining drive. I thought to myself... could it be? 2 scrubs in a row after the first seagate replacement and read errors remained on only one of the wd reds. I replaced that second drive with a second ironwolf and prepared to perhaps have finally solved this issue. A 10 hour resilver later and a scrub reveals.......NOPE STILL HAVING READ ERRORS...

I wake up this morning to 2 drives throwing read errors in this vdev, One of the drives hasn't had a read error show up on my chart since december 3rd 2016 and the other hasn't had a read error in over a month. This is after a dozen forced scrubs at this point (for troubleshooting) I replace 2 consistently read error giving drives with seagate ironwolfs and now suddenly 2 other drives that have been quiet for months decide to start speaking up?!

In my absence from this thread I also doubled my ram to 64 gigs of ecc as well as left the chassis fans maxed for over a month during these troubleshooting tests. I have been moving data off of the box in an attempt to potentially clear out bad data (but snapshots are keeping the data around for 3 months) (i've contemplating force deleting the snapshots but I wasn't sure if it was safe to delete auto generated snapshots that are scheduled) so I figured I would just keep troubleshooting and waiting for the natural deletion cycle.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I think it's time to restore from backup.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Has SuperMicro been contacted to see if they have ideas?
I'm on the second mobo (as part of part swapping testing) I'm more convinced than ever its a corrupt file(s)... the past 4 scrubs in a row have reviewed 216k repaired. Its the same number of repairs just on various drives.... I'm in the process of moving as many terabytes of files off of the freenas box as possible. The repair number used to be 800 plus and has been going down the more files i move. I'm going to pull most of my data off and see if the scrubs complete successfully. Then perhaps moving data back slowly to identify what may be causing this.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Since you have read errors, I'd say it is not the data. It is the system.

Anything strange about your power? How about the network cables or any other cables connected to this system?
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Since you have read errors, I'd say it is not the data. It is the system.

Anything strange about your power? How about the network cables or any other cables connected to this system?

This is why I'm so frustrated... the system isn't even its original form in any shape after all of my troubleshooting. i've changed chasis, mobos, cables, backplanes, psu's, hba cards, hard drives, etc etc.. i could have built 3 systems out of the parts i've changed out whilst troubleshooting. I've had read errors on this vdev for months across dozens of parts swaps.
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
Well, I just had my first read-error free scrub for the first time since September 2016..... Repaired zero bytes, scrub completed successfully. There are only 2 things that have changed since my last error filled scrub. 1, i have moved a ton of data off of the freenas. Although its nearly impossible to tell if I actually pulled data off of that vdev or not since there are 5 vdevs with over 100 terabytes of data total and I only moved 15 tbs or so. 2, I don't have my newly replaced drives set to do smart scans long or short. Running a long test during a scrub wouldn't cause read errors would it?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Device Model: WDC WD60EFRX-68L0BN1

I think we have the same problem. After having replaced almost everything in my machine, it turns out, the drive with worst behaviour is of type WD60EFRX-68L0BN1 just like yours. Other disks with device type WD60EFRX-68MYMN1 in my zpool, do not exibit the same behaviour (Throwing SCSI errors).
 
Last edited:

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
I think we have the same problem. After having replaced almost everything in my machine, it turns out, the drive with worst behaviour is of type WD60EFRX-68L0BN1 just like yours. Other disks with device type WD60EFRX-68MYMN1 in my zpool, do not exibit the same behaviour (Throwing SCSI errors).
are you by chance storing recorded tv files? wtv?
 

trsupernothing

Explorer
Joined
Sep 5, 2013
Messages
65
No, I do not store movies.

What types of disks do you have? Please make a list.

Do you see errors on disk not of type WD60EFRX-68L0BN1?
I wasn't referring to movies at all. recorded tv / wtv files are windows media center recorded tv files. I only ask because once i removed the bulk of these files from my server (as i feared one or more may have been corrupt) the scsi errors dissappeared from the server forever.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I wasn't referring to movies at all. recorded tv / wtv files are windows media center recorded tv files. I only ask because once i removed the bulk of these files from my server (as i feared one or more may have been corrupt) the scsi errors dissappeared from the server forever.

Please answer this question: "Do you see errors on disk NOT of type WD60EFRX-68L0BN1?". It would mean a lot to me.

Thanks,
Tobias
 
Status
Not open for further replies.
Top