Reslivering to an UNAVAIL disk? What's going on

diskdiddler · Mar 24, 2016

Robert Trevellyan said:
ZFS doesn't have a database of every disk ever made that it consults to figure out the size of the drive, it just inquires what size the drive is. If drives lied about their size, everything would go to hell ... which some people have experienced with counterfiet USB sticks.

I agree with the logic entirely - but I did get h2testw to spit out this error on my replacement Toshiba hard disk with the new firmware...

Media has filled up earlier than expected!
In the beginning there were 4769030 MByte free, but only
4769023 MByte could be written.
Warning: Only 4769030 of 4769305 MByte tested.
Writing speed: 158 MByte/s
H2testw v1.4

Yes, I ran h2testw on a 5TB hard disk - and I'm gonna quote that line again. Media has filled up earlier than expected!

Robert Trevellyan · Mar 24, 2016

diskdiddler said:
Does WD define 5TB as exactly the same amount of space as Toshiba?

The thing to look for is the LBA count on the label on the drive. If it's the same or larger, it'll work. It's been quite a while since I saw different LBA counts for drives of nominally equal size.

Robert Trevellyan · Mar 24, 2016

diskdiddler said:
Media has filled up earlier than expected!

When I search for that online, I get one other result, also related to h2testw. I blame the software.

diskdiddler · Mar 24, 2016

No, the software works fine for flash drives and my other Toshiba drive which has had bad sectors already marked.

danb35 · Mar 25, 2016

Well, there are two possibilities: the drive is lying about its size, or there's a bug in the software. The former would break an awful lot of stuff, so I think the latter is more likely.

diskdiddler · Mar 25, 2016

Well, we'll know soon, 3h:21m to go, soon the nightmare will be over. (I still feel I then need to do a scrub, due to disk 0)
Then in about 5 or 6 days when I get my replacement disks, re-add disk 6 and then same again (resliv, scrub)
then finally when that's done, replace disk #0

In theory in about 10 or 12 days, all 6 disks should be 100% ......

danb35 · Mar 25, 2016

diskdiddler said:
(I still feel I then need to do a scrub, due to disk 0)

Unnecessary--the resilver necessarily reads everything in the vdev, so if there are any data errors on disk 0, they'll be found.

diskdiddler · Mar 25, 2016

Oh lovely, thanks - well that will save some time, it's certainly not a short process - 25 hours as it stands.
I'll update the thread when it's all over with I guess - currently, 1 file lost (un-important) .... phew.

rs225 · Mar 25, 2016

The drive makers set standards for how many LBA in a given capacity, so equal sizes are interchangeable between makes. This was not always the case.

Because of that, I think ZFS does allow for a variation in size for a replace, on the order of megabytes.

I also think the software is buggy on your HD test.

Robert Trevellyan · Mar 25, 2016

rs225 said:
drive makers set standards for how many LBA in a given capacity

They do? That would explain a lot, and be very good news. Do you have a reference?

diskdiddler · Mar 25, 2016

Well finally a disk has worked, the replacement server was worth it. It's up and running.
So now to very very sheepishly (sigh) replace 2 more disks and then re-add the 6'th and when that's finally done, replace the 1'st (0)
I'm still not safe, I have "4.5" disks right now, but it's better than the 3.5 I had previously.

Still super blown away at disks getting perm SMART bad sectors from a bloody power supply, BLOWN AWAY
Entirely destroyed disk? Sure
Read surface error, ok fine
Actual SMART level, diagnosed bad blocks? what?!
(It could've been the motherboard too)

Just to clarify, I did some googling - but to my knowledge NOTHING will do a 100% total scan of the disk and un-mark bad sectors if they prove right, correct?

Bidule0hm · Mar 26, 2016

diskdiddler said:
Still super blown away at disks getting perm SMART bad sectors from a bloody power supply, BLOWN AWAY
Entirely destroyed disk? Sure
Read surface error, ok fine
Actual SMART level, diagnosed bad blocks? what?!

Well, bad power can lead to bad writes that can lead to bad reads when you try to get the data and bad read = sector added to the pending sector list, then if it can read it by retrying then the sector is healthy again, if it doesn't (more likely option here, because the write was bad in the first place) then it's reallocated. That's how you get reallocated sectors because of a crappy PSU. There's probably others means to get there too because of the PSU but this one is the one I can think of the more likely to happen :)

Bad power can also lead to the motor controller failure, CRC errors, ...

diskdiddler said:
Just to clarify, I did some googling - but to my knowledge NOTHING will do a 100% total scan of the disk and un-mark bad sectors if they prove right, correct?

Yep, unless you're the manufacturer once a sector is remapped it's for life.

pschatz100 · Mar 27, 2016

diskdiddler said:
Well finally a disk has worked, the replacement server was worth it. It's up and running.
So now to very very sheepishly (sigh) replace 2 more disks and then re-add the 6'th and when that's finally done, replace the 1'st (0)
I'm still not safe, I have "4.5" disks right now, but it's better than the 3.5 I had previously.

Still super blown away at disks getting perm SMART bad sectors from a bloody power supply, BLOWN AWAY
Entirely destroyed disk? Sure
Read surface error, ok fine
Actual SMART level, diagnosed bad blocks? what?!
(It could've been the motherboard too)

If your disks were running hot due to inadequate cooling, it is entirely possible that other components were running hot as well. Power supplies can run hot, even with decent cooling, and motherboard components can be damaged by excessive heat without failing entirely. A couple of years ago, I had a power supply that tested OK with testing equipment, but had intermittent failures when installed. It caused all sorts of untraceable headaches and was only discovered via trial and error after replacing the cpu and motherboard failed to fix the problem.

rs225 · Mar 27, 2016

Robert Trevellyan said:
They do? That would explain a lot, and be very good news. Do you have a reference?

Well, of course not.

Oh, wait, actually, this one time: http://idema.org/wp-content/plugins/download-monitor/download.php?id=1223
More interesting stuff at www.idema.org
http://idema.org/?page_id=416

It is a trade group for drive makers, and I'm sure there is far more interesting stuff available to members.

Robert Trevellyan · Mar 28, 2016

Interesting. I'm having trouble figuring out the real meaning of that document.

diskdiddler · Mar 29, 2016

Never ending nightmares with this thing........
Any idea why this hapenned?

[root@freenas] ~# zpool status
pool: ARRAY
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Mar 29 17:32:59 2016
466G scanned out of 24.2T at 169M/s, 41h4m to go
155G resilvered, 1.88% done
config:

NAME STATE READ WRITE CKSUM
ARRAY ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/7a3a09f4-10cb-11e4-bc4d-441ea13cac6b ONLINE 0 0 0
gptid/7aaa3c1a-10cb-11e4-bc4d-441ea13cac6b ONLINE 0 0 0
gptid/2a07e64b-9fc5-11e4-8140-28924a2d5aca ONLINE 0 0 0
gptid/7b8d77f7-10cb-11e4-bc4d-441ea13cac6b ONLINE 0 0 0
gptid/765fc016-f1a5-11e5-b50b-9cb6540b10fe ONLINE 0 0 0 (resilvering)
gptid/02a66f4c-f578-11e5-9fd0-9cb6540b10fe ONLINE 0 0 0 (resilvering)

errors: No known data errors

?
I had a working system before shutdown, 5 disks (out of 6) installed and working.
Added disk #6 and reslivered, so now it feels compelled to do 5 as well..? worrying, normal?

diskdiddler · Mar 29, 2016

Seriously, does anyone know why 2 disks would resliver, when replacing 1? It feels worrying to me.

Robert Trevellyan · Mar 29, 2016

diskdiddler said:
It feels worrying to me.

Justifiably so.

I'd seriously consider backing up all the data and starting over with full hardware burn-in.

diskdiddler · Mar 29, 2016

Is there some kind of advanced log I can check as to what or why it initiated it to 2 disks? (Thus far, it seems to be working fine - 7 hours to go) but still an extra disk is thrashing now. Makes 0 sense to me.
I couldn't have accidentally done this through clicking something could I?

Important Announcement for the TrueNAS Community.

Reslivering to an UNAVAIL disk? What's going on

Wizard

Pony Wrangler

Pony Wrangler

Wizard

Hall of Famer

Wizard

Hall of Famer

Wizard

Guru

Pony Wrangler

Wizard

Server Electronics Sorcerer

Guru

Guru

Pony Wrangler

Wizard

Wizard

Pony Wrangler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Reslivering to an UNAVAIL disk? What's going on"

Similar threads