diskdiddler
Wizard
- Joined
- Jul 9, 2014
- Messages
- 2,377
I see some information indicating they did have a live version but no longer support it, interestingly they seem to have pulled the download links.
Haven't found my notes on OCCT; but maybe take a look a UBCD. Others have said they use it; but not sure if they use any of the CPU tools or not.I see some information indicating they did have a live version but no longer support it, interestingly they seem to have pulled the download links.
So from my POV:I've replaced the entire server with the same model, brand new, RAM has been retained but utterly thoroughly tested, obviously the remaining 4 working drives with my data, kept also. (Disk 0 is starting to exhibit bad sectors, only 1 file damaged so far**)
This entire thread has completely challenged my understanding of hard disks and I've been working on PC's for 26 years................................... I'll explain why in a minute.
So
I've replaced the entire server with the same model, brand new, RAM has been retained but utterly thoroughly tested, obviously the remaining 4 working drives with my data, kept also. (Disk 0 is starting to exhibit bad sectors, only 1 file damaged so far**)
I ensured different molex -> sata adapter power cable, different SATA data cables.
I've thoroughly thrashed my final replacement new disk in another PC with h2testw and a SMART short and full test - it's got 0 bad sectors, according to the PC it was in.
It's in the process of reslivering, right now, we'll see what happens. In theory, this should work.
SO!
If this works, here's my conundrum.
I can understand a machine with a a faulty board, PSU, CPU, disk controller, cable (whatever it was) causing multiple soft read errors, perhaps due to the disk spinning up and down too much (not enough power?)
BUT how on EARTH can it cause actual real SMART level bad sectors, on multiple drives (and it has occurred to at least THREE disks in the past 4 weeks)
That, I do not, in any way comprehend and I dunno, I've never seen this before, ever! It makes no sense to me. Those 3 disks might infact be fine? They just spun down too often? I simply don't know. What frustrates me, is I need to go back to my disk supplier, sheepishly, again and effectively play dumb, saying "yeah, here's another 3 disks" not looking forward to it (Retail customer support in Australia is atrocious compared to the USA, there's no "yes sir! no sir!, no questions asked sir!" stuff, I'm already feeling I'm close to the "get out of here, we won't honor this warranty" stage)
I think this honestly requires further discussion, not to help me but general technical discussion because this is a fascinating thing and may help others in future.
** Thank goodness and not an important file either, but has me sweating.
If you weren't doing SMART long tests before, those bad sectors could be unrelated.
If you were doing them, then their sudden appearance is probably related to the overall problem. Which is likely power. While two drives were clearly starved, the other drives may have been at the edge. Writes may have been getting corrupted or had sufficiently low signal to be unreadable, even with drive-ECC. In other words, these drives may be fine. You should do long tests on each of them, and manually overwrite sectors which fail, which should allow a selective-long test to resume until the next fail. If you can get to 100%, the drives are fine.
That you lost data implies this is all related, because otherwise losing data on raidz2 is improbable. And yet, it happens, which is why I say raidz1 is not as bad as people say, because genuine double drive failure is simply unlikely on a properly scrubbed/tested array. But you didn't suffer double drive failure: You suffered systemic failure, which affected many of your drives simultaneously(hence, RAID is not a backup). And in this case, you appear to be a very lucky person.
What would have happened if you had a raidz1 pool? I will speculate: The pool would have become UNAVAIL when the second disk registered write failures. That freezes the pool, but it isn't necessarily lost. With luck (which you turn out to have), you would have ended up in much the same situation once you were able to get the right drives to work at the same time, with potentially the first-failed drive disconnected. Knowing all of that may not have been easy at that hypothetical time. The power situation may have left you in a situation where you couldn't get 5 drives working in that system, but you could get 4, which was good enough for your raidz2, but wouldn't have been in a hypothetical raidz1
It's good that you've tested your RAM, and I understand that you've made choices based on factors beyond your control, but it's important to understand the real issue here. All RAM is prone to random bit flips due to cosmic radiation. ECC RAM can detect a bit flip and either correct it, or halt the system. Non-ECC RAM just delivers the incorrect data. This has nothing to do with whether the RAM is faulty or not, so no amount of successful memory test passes can prove that your data will be safe with non-ECC RAM.The memory has passed about 10 passes in 2 different bootable USB memtest programs.
Thats not how it works. ZFS won't let you replace a drive with a smaller drive.assuming there's going to be a very precise amount of space to work with and fall just slightly short, marking those sectors as bad and (yet again) tripping off a resliver failure ... It's a theory and I could, hopefully be wrong.
It's good that you've tested your RAM, and I understand that you've made choices based on factors beyond your control, but it's important to understand the real issue here. All RAM is prone to random bit flips due to cosmic radiation. ECC RAM can detect a bit flip and either correct it, or halt the system. Non-ECC RAM just delivers the incorrect data. This has nothing to do with whether the RAM is faulty or not, so no amount of successful memory test passes can prove that your data will be safe with non-ECC RAM.
Thats not how it works. ZFS won't let you replace a drive with a smaller drive.
ZFS doesn't have a database of every disk ever made that it consults to figure out the size of the drive, it just inquires what size the drive is. If drives lied about their size, everything would go to hell ... which some people have experienced with counterfiet USB sticks.Can you please elaborate?