Reslivering to an UNAVAIL disk? What's going on

diskdiddler · Mar 22, 2016

I see some information indicating they did have a live version but no longer support it, interestingly they seem to have pulled the download links.

diskdiddler · Mar 23, 2016

Well it's passed a couple of memtests but I'm now running one with all the disks plugged in too, see if it works.
Memtest.org which offers an SMP all core mode will crash, HOWEVER - I've seen that mode crash on a heap of systems :/ so I'm not convinced that's a reliable test.
(it also crashes at exactly the same point, each time, 19%)

I'm going to let it run at least 3 passes tonight.
I'm also continuing to thoroughly test the replacement drive before putting it in. It passed a SMART full check. but H2TestW spat this out at me.
"Media has filled up earlier than expected!
In the beginning there were 4769030 MByte free, but only
4769023 MByte could be written.
Warning: Only 4769030 of 4769305 MByte tested.
Writing speed: 158 MByte/s
H2testw v1.4"
That program is actually designed to spot fake USB keys but the logic is sound, write to a device 100% and try to read it back, 100%

Interesting error message from a brand new hard drive to be honest.

Mirfster · Mar 23, 2016

diskdiddler said:
I see some information indicating they did have a live version but no longer support it, interestingly they seem to have pulled the download links.

Haven't found my notes on OCCT; but maybe take a look a UBCD. Others have said they use it; but not sure if they use any of the CPU tools or not.

diskdiddler · Mar 23, 2016

Well I'm replacing the entire server today, just in case. It's cheap and easy. The spare disk has seemingly passed multiple tests.
I have no idea what FreeNAS will do, booting up in an identical but different motherboard (which will have a very slightly different BIOS)
It could be interesting...

danb35 · Mar 23, 2016

Should be fine. Might need to reconfigure your NIC, if you'd manually configured it before.

diskdiddler · Mar 24, 2016

This entire thread has completely challenged my understanding of hard disks and I've been working on PC's for 26 years................................... I'll explain why in a minute.

So
I've replaced the entire server with the same model, brand new, RAM has been retained but utterly thoroughly tested, obviously the remaining 4 working drives with my data, kept also. (Disk 0 is starting to exhibit bad sectors, only 1 file damaged so far**)
I ensured different molex -> sata adapter power cable, different SATA data cables.
I've thoroughly thrashed my final replacement new disk in another PC with h2testw and a SMART short and full test - it's got 0 bad sectors, according to the PC it was in.
It's in the process of reslivering, right now, we'll see what happens. In theory, this should work.

SO!
If this works, here's my conundrum.
I can understand a machine with a a faulty board, PSU, CPU, disk controller, cable (whatever it was) causing multiple soft read errors, perhaps due to the disk spinning up and down too much (not enough power?)
BUT how on EARTH can it cause actual real SMART level bad sectors, on multiple drives (and it has occurred to at least THREE disks in the past 4 weeks)
That, I do not, in any way comprehend and I dunno, I've never seen this before, ever! It makes no sense to me. Those 3 disks might infact be fine? They just spun down too often? I simply don't know. What frustrates me, is I need to go back to my disk supplier, sheepishly, again and effectively play dumb, saying "yeah, here's another 3 disks" not looking forward to it (Retail customer support in Australia is atrocious compared to the USA, there's no "yes sir! no sir!, no questions asked sir!" stuff, I'm already feeling I'm close to the "get out of here, we won't honor this warranty" stage)

I think this honestly requires further discussion, not to help me but general technical discussion because this is a fascinating thing and may help others in future.

** Thank goodness and not an important file either, but has me sweating.

diskdiddler · Mar 24, 2016

BTW: thanks all who have helped, I appreciate it.

Mirfster · Mar 24, 2016

diskdiddler said:
I've replaced the entire server with the same model, brand new, RAM has been retained but utterly thoroughly tested, obviously the remaining 4 working drives with my data, kept also. (Disk 0 is starting to exhibit bad sectors, only 1 file damaged so far**)

So from my POV:

Disks were validated/vetted on an entirely different machine and passed without errors
Only disks and RAM were re-used in an entirely new build
RAM is still Non-ECC correct? (If I recall, you stated so earlier)
- Still not sure if this is accurate/possible; since the system specs call for ECC Ram..
- Anyway you can validate this or provide part #?
- If it turns out to be ECC, I will stop focusing/harping you on it. :)
Even though the server is "new"; I think you still have the under-powered PSU issue correct?

So, for arguments sake lets eliminate cabling, connectors, and MB since they are all new. Toss out the drives as well; since you tested them. That would tend lead me to look at the other common denominators... Your RAM and PSU...

Personally, I would lean towards the RAM right now and figure out if in fact it is Non-ECC. If so, I am seriously confused on how you got that to work? PSU still needs to be addressed too. Also, I would look at temps to see if the drives are getting cooked...

diskdiddler · Mar 24, 2016

I am using non ECC, FreeNAS will work fine without it. I would prefer ECC, but had difficulty getting hold of it at a reasonable price.
The memory has passed about 10 passes in 2 different bootable USB memtest programs.

4 of the disks are totally from the old server, I can't go without these disks, that's my data.
The 5'th (of 6) disks has been thoroughly vetted on another machine.

I am still on an under powered PSU but I've removed one hard drive and one fan and the PSU is brand new. I also dont' intend to cook this one.
This machine is 3 years old this month and spent the vast majority of time with 6 and at one point 7 disks in it....

Note: my system ran completely fine from Oct 2014 until the last month. The isues it's had, I suspect are heat related, where one summer I thought I'd gamble without the external fan. Hence 59c.
I suspect, if it remains well cooled, unlike the last one, I should get from 12 to 36 months out of it. Which is all I need.

rs225 · Mar 24, 2016

If you weren't doing SMART long tests before, those bad sectors could be unrelated.

If you were doing them, then their sudden appearance is probably related to the overall problem. Which is likely power. While two drives were clearly starved, the other drives may have been at the edge. Writes may have been getting corrupted or had sufficiently low signal to be unreadable, even with drive-ECC. In other words, these drives may be fine. You should do long tests on each of them, and manually overwrite sectors which fail, which should allow a selective-long test to resume until the next fail. If you can get to 100%, the drives are fine.

That you lost data implies this is all related, because otherwise losing data on raidz2 is improbable. And yet, it happens, which is why I say raidz1 is not as bad as people say, because genuine double drive failure is simply unlikely on a properly scrubbed/tested array. But you didn't suffer double drive failure: You suffered systemic failure, which affected many of your drives simultaneously(hence, RAID is not a backup). And in this case, you appear to be a very lucky person.

What would have happened if you had a raidz1 pool? I will speculate: The pool would have become UNAVAIL when the second disk registered write failures. That freezes the pool, but it isn't necessarily lost. With luck (which you turn out to have), you would have ended up in much the same situation once you were able to get the right drives to work at the same time, with potentially the first-failed drive disconnected. Knowing all of that may not have been easy at that hypothetical time. The power situation may have left you in a situation where you couldn't get 5 drives working in that system, but you could get 4, which was good enough for your raidz2, but wouldn't have been in a hypothetical raidz1.

Bidule0hm · Mar 24, 2016

diskdiddler said:
This entire thread has completely challenged my understanding of hard disks and I've been working on PC's for 26 years................................... I'll explain why in a minute.

So
I've replaced the entire server with the same model, brand new, RAM has been retained but utterly thoroughly tested, obviously the remaining 4 working drives with my data, kept also. (Disk 0 is starting to exhibit bad sectors, only 1 file damaged so far**)
I ensured different molex -> sata adapter power cable, different SATA data cables.
I've thoroughly thrashed my final replacement new disk in another PC with h2testw and a SMART short and full test - it's got 0 bad sectors, according to the PC it was in.
It's in the process of reslivering, right now, we'll see what happens. In theory, this should work.

SO!
If this works, here's my conundrum.
I can understand a machine with a a faulty board, PSU, CPU, disk controller, cable (whatever it was) causing multiple soft read errors, perhaps due to the disk spinning up and down too much (not enough power?)
BUT how on EARTH can it cause actual real SMART level bad sectors, on multiple drives (and it has occurred to at least THREE disks in the past 4 weeks)
That, I do not, in any way comprehend and I dunno, I've never seen this before, ever! It makes no sense to me. Those 3 disks might infact be fine? They just spun down too often? I simply don't know. What frustrates me, is I need to go back to my disk supplier, sheepishly, again and effectively play dumb, saying "yeah, here's another 3 disks" not looking forward to it (Retail customer support in Australia is atrocious compared to the USA, there's no "yes sir! no sir!, no questions asked sir!" stuff, I'm already feeling I'm close to the "get out of here, we won't honor this warranty" stage)

I think this honestly requires further discussion, not to help me but general technical discussion because this is a fascinating thing and may help others in future.

** Thank goodness and not an important file either, but has me sweating.

I already answered this in the previous page: https://forums.freenas.org/index.ph...-disk-whats-going-on.42197/page-2#post-273012

diskdiddler · Mar 24, 2016

rs225 said:
If you weren't doing SMART long tests before, those bad sectors could be unrelated.

If you were doing them, then their sudden appearance is probably related to the overall problem. Which is likely power. While two drives were clearly starved, the other drives may have been at the edge. Writes may have been getting corrupted or had sufficiently low signal to be unreadable, even with drive-ECC. In other words, these drives may be fine. You should do long tests on each of them, and manually overwrite sectors which fail, which should allow a selective-long test to resume until the next fail. If you can get to 100%, the drives are fine.

Sectors are still dying on Disk 0, (50% through now) but only a handful. I'm inclined to believe the disk is now genuinely dying. (I did do full SMART checks once a month previously and short smart checks every 2 days)

CRITICAL: Device: /dev/ada0, 16 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/ada0, 2 Offline uncorrectable sectors
CRITICAL: Device: /dev/ada0, ATA error count increased from 12 to 22

(believe it or not, this disk is NOT my major concern, as it's part of the 4 which have been in the system over a year)
Until this resliver hits 100% I'm not convinced I'm out of the woods.

rs225 said:
That you lost data implies this is all related, because otherwise losing data on raidz2 is improbable. And yet, it happens, which is why I say raidz1 is not as bad as people say, because genuine double drive failure is simply unlikely on a properly scrubbed/tested array. But you didn't suffer double drive failure: You suffered systemic failure, which affected many of your drives simultaneously(hence, RAID is not a backup). And in this case, you appear to be a very lucky person.

I actually do have a backup of my super critical data which is irreplaceable. What remains is simply awkward and time consuming to replace. (extremely, but not irreplaceable )

rs225 said:
What would have happened if you had a raidz1 pool? I will speculate: The pool would have become UNAVAIL when the second disk registered write failures. That freezes the pool, but it isn't necessarily lost. With luck (which you turn out to have), you would have ended up in much the same situation once you were able to get the right drives to work at the same time, with potentially the first-failed drive disconnected. Knowing all of that may not have been easy at that hypothetical time. The power situation may have left you in a situation where you couldn't get 5 drives working in that system, but you could get 4, which was good enough for your raidz2, but wouldn't have been in a hypothetical raidz1

So here's my REAL worry,....... since I'm only 50% in
Despite disk 0 clearly on its way out, the last weeks issues has been really about replacement drives. (3 have failed with bad sectors,.. 3!)
There's been a firmware change from FP1A to FP2A (disks are TOSHIBA MD04ACA500)
All my new disks are FP2a. I'm concerned there's been an ever so slight tweak to the amount of free space on the disks or something like that. So FreeNAS will resliver and test away, assuming there's going to be a very precise amount of space to work with and fall just slightly short, marking those sectors as bad and (yet again) tripping off a resliver failure
It's a theory and I could, hopefully be wrong. I didn't want to mention this earlier as it makes this vastly more complex a thread.
We'll know in 14 hours if I'm right, or close to it.

Robert Trevellyan · Mar 24, 2016

diskdiddler said:
The memory has passed about 10 passes in 2 different bootable USB memtest programs.

It's good that you've tested your RAM, and I understand that you've made choices based on factors beyond your control, but it's important to understand the real issue here. All RAM is prone to random bit flips due to cosmic radiation. ECC RAM can detect a bit flip and either correct it, or halt the system. Non-ECC RAM just delivers the incorrect data. This has nothing to do with whether the RAM is faulty or not, so no amount of successful memory test passes can prove that your data will be safe with non-ECC RAM.

diskdiddler said:
assuming there's going to be a very precise amount of space to work with and fall just slightly short, marking those sectors as bad and (yet again) tripping off a resliver failure ... It's a theory and I could, hopefully be wrong.

Thats not how it works. ZFS won't let you replace a drive with a smaller drive.

diskdiddler · Mar 24, 2016

Robert Trevellyan said:
It's good that you've tested your RAM, and I understand that you've made choices based on factors beyond your control, but it's important to understand the real issue here. All RAM is prone to random bit flips due to cosmic radiation. ECC RAM can detect a bit flip and either correct it, or halt the system. Non-ECC RAM just delivers the incorrect data. This has nothing to do with whether the RAM is faulty or not, so no amount of successful memory test passes can prove that your data will be safe with non-ECC RAM.

Sure but this same RAM worked fine for 2 years, it's passed a test and even if it passes a bad piece of data along, it's not going to be the cause of hard disks getting marked as bad. It's a paranoid move to ensure your data is EXTRA extra safe - my next system will use it but as it stands, I'm pretty comfortable not using it in the system as is.

Robert Trevellyan said:
Thats not how it works. ZFS won't let you replace a drive with a smaller drive.

Can you please elaborate? or perhaps I should elaborate my thinking.
The replacement disk is marked as the same model, same type, same size, in THEORY but perhaps the surface area is 500kb short, 5MB short, I don't know - I really don't,
Hence, ZFS / FreeNAS accepts the disk as it's got the exact 5,000,111,222,333,000 bytes (or whatever it is) as the others but you can't write, in full to every last, single, sector?
That's my theory. I hope I'm wrong.

Here's a dumb similar question.
If I cracked it with Toshiba and replaced my 5TB Toshiba with a 5TB WD Red, would FreeNAS accept it? Does WD define 5TB as exactly the same amount of space as Toshiba?

danb35 · Mar 24, 2016

If the replacement drive were smaller than the others in your pool, the replacement would fail right away, not after 99% of the resilvering is complete. As long as the replacement drive is at least as large as the others in your pool, you'll be fine.

diskdiddler · Mar 24, 2016

So it would simply not accept it as a viable option for replacement? Even if it's only a handful of bytes?

danb35 · Mar 24, 2016

Correct, and that's one of the reasons for the swap partition on each disk. If the replacement disk is a little smaller such that the system refuses to accept it, you can set the swap size to 0 and replace the disk. That will give you an extra 2 GB of wiggle room.

diskdiddler · Mar 24, 2016

That I wasn't aware of, where is that configured? (thanks for the help, sorry we've branched off topic)
How generous is it as default? Seems like a good idea / feature which paranoid admins might want to tinker with incase they think they may change disk sizes / brands?

danb35 · Mar 24, 2016

System -> Advanced, I think.

Robert Trevellyan · Mar 24, 2016

diskdiddler said:
Can you please elaborate?

ZFS doesn't have a database of every disk ever made that it consults to figure out the size of the drive, it just inquires what size the drive is. If drives lied about their size, everything would go to hell ... which some people have experienced with counterfiet USB sticks.

Important Announcement for the TrueNAS Community.

Reslivering to an UNAVAIL disk? What's going on

Wizard

Wizard

Doesn't know what he's talking about

Wizard

Hall of Famer

Wizard

Wizard

Doesn't know what he's talking about

Wizard

Guru

Server Electronics Sorcerer

Wizard

Pony Wrangler

Wizard

Hall of Famer

Wizard

Hall of Famer

Wizard

Hall of Famer

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Reslivering to an UNAVAIL disk? What's going on"

Similar threads