David Dyer-Bennet
Patron
- Joined
- Jul 13, 2013
- Messages
- 286
We've got a box running for two years (5+1 pool; yeah, I know that's risky and inefficient). Two weeks ago a disk reported itself bad, we replaced it, and it resilvered.
But this week, the replacement disk reported itself bad. No other disks showed any problems. Infant mortality is always a possibility of course, but...
When we opened the box, the disks all seemed kind of hot. On further investigation, it looks like probably the main fan that cools the disks wasn't hooked up. Seems very unlikely that problem is two years old and didn't show up until now, so maybe we messed it up when changing the disk two weeks ago, or maybe the connector vibrated loose, or maybe gremlins. Whatever, we can't prove anything.
The thing is -- now I'm sitting on a degraded pool (working, but no redundancy), the remaining disks of which may well have been exposed to over-temp for up to two weeks (enough, maybe, to kill the new disk). So now I'm terrified that the remaining disks have very limited future life. Panic! (Server is currently powered off, so that life isn't currently ticking away).
Options we've thought of include:
Is #3 possible, using FreeBSD or Windows tools? Do the general class of Windows utilities that do drive replication without regard to partitions or filesystem copy everything that matters (at least master boot record, partition table, and all partitions)? Has anybody actually done it?
We've ruled out #1 as far too risky.
I'm planning to examine the failed disks tonight and tomorrow; look at SMART data and such. If they were exposed to overtemp, there should be a record there, shouldn't there? I kind of would have expected the short SMART test to catch overtemp conditions and report them, also, and we got no such reports.
Is there some other approach that has better odds of recovering the pool?
Pending the results of examining the failed disks, we're kind of leaning towards #2 currently.
But this week, the replacement disk reported itself bad. No other disks showed any problems. Infant mortality is always a possibility of course, but...
When we opened the box, the disks all seemed kind of hot. On further investigation, it looks like probably the main fan that cools the disks wasn't hooked up. Seems very unlikely that problem is two years old and didn't show up until now, so maybe we messed it up when changing the disk two weeks ago, or maybe the connector vibrated loose, or maybe gremlins. Whatever, we can't prove anything.
The thing is -- now I'm sitting on a degraded pool (working, but no redundancy), the remaining disks of which may well have been exposed to over-temp for up to two weeks (enough, maybe, to kill the new disk). So now I'm terrified that the remaining disks have very limited future life. Panic! (Server is currently powered off, so that life isn't currently ticking away).
Options we've thought of include:
- Replace the bad disk again, and let it start resilvering. Be very sure the fans are hooked up and running right . Then get the backup server in place for regular automatic backups (most of the data actually exists in multiple places, but not in an adequately organized fashion and it's not being automatically kept current).
- Bring the pool up in degraded state (don't replace the failed disk), and immediately start replicating the data onto another server (we've got one, intended to be the backup server, nearly ready to go).
- Bit-copy the disks individually somehow to new drives, bring the copies up in a server, and then replace the missing disk and let it resilver.
Is #3 possible, using FreeBSD or Windows tools? Do the general class of Windows utilities that do drive replication without regard to partitions or filesystem copy everything that matters (at least master boot record, partition table, and all partitions)? Has anybody actually done it?
We've ruled out #1 as far too risky.
I'm planning to examine the failed disks tonight and tomorrow; look at SMART data and such. If they were exposed to overtemp, there should be a record there, shouldn't there? I kind of would have expected the short SMART test to catch overtemp conditions and report them, also, and we got no such reports.
Is there some other approach that has better odds of recovering the pool?
Pending the results of examining the failed disks, we're kind of leaning towards #2 currently.