Safest course when all disks possibly compromised...

David Dyer-Bennet · Dec 30, 2015

We've got a box running for two years (5+1 pool; yeah, I know that's risky and inefficient). Two weeks ago a disk reported itself bad, we replaced it, and it resilvered.

But this week, the replacement disk reported itself bad. No other disks showed any problems. Infant mortality is always a possibility of course, but...

When we opened the box, the disks all seemed kind of hot. On further investigation, it looks like probably the main fan that cools the disks wasn't hooked up. Seems very unlikely that problem is two years old and didn't show up until now, so maybe we messed it up when changing the disk two weeks ago, or maybe the connector vibrated loose, or maybe gremlins. Whatever, we can't prove anything.

The thing is -- now I'm sitting on a degraded pool (working, but no redundancy), the remaining disks of which may well have been exposed to over-temp for up to two weeks (enough, maybe, to kill the new disk). So now I'm terrified that the remaining disks have very limited future life. Panic! (Server is currently powered off, so that life isn't currently ticking away).

Options we've thought of include:

Replace the bad disk again, and let it start resilvering. Be very sure the fans are hooked up and running right . Then get the backup server in place for regular automatic backups (most of the data actually exists in multiple places, but not in an adequately organized fashion and it's not being automatically kept current).
Bring the pool up in degraded state (don't replace the failed disk), and immediately start replicating the data onto another server (we've got one, intended to be the backup server, nearly ready to go).
Bit-copy the disks individually somehow to new drives, bring the copies up in a server, and then replace the missing disk and let it resilver.

#3 is more work and time. However, #3 appears to require the fewest hours of continued life in the existing disks; only a disk actively being copied would be powered on at all, and the copying should be more efficient than other methods (this pool is hideously full, so copying unused space won't hurt us much). The point here is to minimize the chance of one of the drives that may have been exposed to over-temp conditions failing before we can copy the data off it.

Is #3 possible, using FreeBSD or Windows tools? Do the general class of Windows utilities that do drive replication without regard to partitions or filesystem copy everything that matters (at least master boot record, partition table, and all partitions)? Has anybody actually done it?

We've ruled out #1 as far too risky.

I'm planning to examine the failed disks tonight and tomorrow; look at SMART data and such. If they were exposed to overtemp, there should be a record there, shouldn't there? I kind of would have expected the short SMART test to catch overtemp conditions and report them, also, and we got no such reports.

Is there some other approach that has better odds of recovering the pool?

Pending the results of examining the failed disks, we're kind of leaning towards #2 currently.

Mlovelace · Dec 30, 2015

I would go with option #2, back up the data asap then rebuild the old server with raidz2 pool.

Bidule0hm · Dec 31, 2015

I think #1 is less risky than 2# because one read error and the pool can be gone with #2 but it depends on how bad the state if this drive is.

SMART tests doesn't warn you about overtemp, the SMART service in FreeNAS do, but only when the moon is full, on even days of the month and if you jump 3 times... --> more seriously there's a bug and I never received an email because of an overtemp (even when doing tests where I know it should send the mail), for some everything works, there's a thread about this bug somewhere. It's a shame because it's a very important thing and it's still unresolved.

Mirfster · Dec 31, 2015

Scary situation all around. Especially if there is the chance that any other drive could be bad (due to the heating issue). Myself, I would think about option #3 since I would want to baby it as much as possible.

Disclaimer: I have not tried this for a FreeNas or even RAIDed Drive, but have done it for individual OS Drives (like Windows and Ubuntu)..

I would presume that you could take one drive at a time and put it in a "known good" system and make a backup or direct copy with CloneZilla (via bootable USB Thumb Drive). Personally, I would make a backup image to a network share or safe storage first.

Then once you have all the drives duplicated you can bring up the newly duplicated drives from there and proceed. If you made backup images with CloneZilla, you could even restore the Images to VM Hard Drive(s) and possibly run the scenario in a VM instance of FreeNas (again have not tested this either).

Last Note: I am unsure if FreeNas would freak out or not if all the disks were new and not the same identifier at once. Perhaps someone could chime in on this?

Best of luck.

Mlovelace · Dec 31, 2015

Bidule0hm said:
I think #1 is less risky than 2# because one read error and the pool can be gone with #2

One read error on resilver and the pool is gone too. At least with option #2 you get some data recovery. First thing should be to save as much data as possible then replace the drives.

jgreco · Dec 31, 2015

#3 is possible but don't use the Windows tools. Use FreeBSD to make a dd copy from the cooked drives onto new drives.

jgreco · Dec 31, 2015

and also #3 is probably the SAFEST as long as you don't screw up and overwrite a disk you are trying to recover.

danb35 · Dec 31, 2015

Mlovelace said:
One read error on resilver and the pool is gone too.

Why would this be the case? If there's a block read error somewhere in the data, the result should be that the file containing that block would be reported as having a data error. If it's in the metadata, all the metadata is redundant (up to 6x redundant, IIRC) and checksummed, so ZFS would know there's an error, and should be able to grab a clean copy somewhere else. Sure, if another disk outright dies, the pool is gone. But a bad block should have much more limited consequences. Or am I missing something?

Mirfster · Dec 31, 2015

jgreco said:
and also #3 is probably the SAFEST as long as you don't screw up and overwrite a disk you are trying to recover.

Agreed, that is why I mentioned making an image with CloneZilla. At least from there you can do multiple things with it; image a new drive or drop the image on a virtual drive.

David Dyer-Bennet · Jan 3, 2016

Well, I'm surprised. We re-booted the system with the "failed" new drive in place (which showed no signs of problems in the SMART data, at least; yeah, I know how much that can sometimes miss) and it took that drive back into the pool. So we're back to the original (inadequate) level of redundancy, at least. It does seem to be keeping the temp down, so the fan issue was real and we did fix it.

Still transferring the data off as fast as we can of course. Which is also having problems, and will be addressed in a new thread....

TwittyFlash · Apr 19, 2016

Hi folks,

Was experimenting with clonezilla to backup my Freenas until I disovered an error when using clonezilla.
I have created an image of Freenas 9.3 and 9.3.1 with clonezilla, but the image was unable to be restored.
I only saw a "Booting..." for 9.3 after restoring the clonezilla image.
and for version 9.3.1, I received a " NTLDR is missing
Press Ctrl+Alt+Del to restart" for 9.3.1.
Has anyone encountered this?
Any advice will be appreciated. Thanks.

depasseg · Apr 19, 2016

It's probably a good idea to start a new thread. And looks like you are trying to boot a windows disk.

TwittyFlash · Apr 19, 2016

you mean my disk was being used for windows earlier?

Important Announcement for the TrueNAS Community.

Safest course when all disks possibly compromised...

David Dyer-Bennet

Patron

Mlovelace

Guru

Bidule0hm

Server Electronics Sorcerer

Mirfster

Doesn't know what he's talking about

Mlovelace

Guru

jgreco

Resident Grinch

jgreco

Resident Grinch

danb35

Hall of Famer

Mirfster

Doesn't know what he's talking about

David Dyer-Bennet

Patron

TwittyFlash

Dabbler

depasseg

FreeNAS Replicant

TwittyFlash

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Safest course when all disks possibly compromised...

Patron

Guru

Server Electronics Sorcerer

Doesn't know what he's talking about

Guru

Resident Grinch

Resident Grinch

Hall of Famer

Doesn't know what he's talking about

Patron

Dabbler

FreeNAS Replicant

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Safest course when all disks possibly compromised..."

Similar threads