Failed drive

Status
Not open for further replies.

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
Nightmare! I am a rather ignorant old-timer so treat me gently!
I have a 5 drive FreeNAS system with ZFS and a single drive redundancy. I decided to replace one drive due to increased errors. I took it off line (as per manual) and proceeded to replace it with a new drive. On rebooting, it transpired that one of the other drives had failed completely during the change. I have thus replaced this with the 'error' drive, and thus should have 4/5 of the original data, but not in the original ATA slots.
What do I need to do to get the volumes restored? I am on FreeNAS-9.3-STABLE-201605170422 version.
Fingers crossed!
 

philhu

Patron
Joined
May 17, 2016
Messages
258
The slots do not matter in the least
If the second drive errored while rebuilding the first, I think you are toast as you were down to raid0 equiv and the next drive died during rebuild, causing raid failure

Did you boot the system? In shell try: zpool status

Due to disk sizes, raid1 or raidz1 are NOT recommended. As you found out. loss of a drive during rebuild *IS* a very big problem, and really has a 35% chance on a 4tb drive of occurring.

Doing raidz2 is a pretty cheap way to keep your data.

I do 24 disks, broken into 2 vdevs using raidz3 on each, under one volume. My backup is an lt04 tape system runningt bacula in a freebsd jail
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
The slots do not matter in the least
If the second drive errored while rebuilding the first, I think you are toast as you were down to raid0 equiv and the next drive died during rebuild, causing raid failure

Did you boot the system? In shell try: zpool status
The drive failed during the changeover - when I rebooted I immediately noticed that I only had 4 drives on line. Can I thus assume that it would not have started a rebuild?
Having replaced the failed drive with the original 'error' drive, I have the 5 drives back on line but the system is showing no sign of rebuilding. The master volume is saying: "Error getting available space" in the GUI.
The answer to zpool status is:
Code:
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h17m with 0 errors onThu Augf 4 04:02:21 2016
config:
NAME            STATE           READ  WRITE  CKSUM
freenas-boot   ONLINE         0           0            0
da0p2              ONLINE        0           0            0

errors: No known data errors
 
Last edited by a moderator:

philhu

Patron
Joined
May 17, 2016
Messages
258
the only pool it sees is the boot device.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
It is far more likely you changed the wrong drive or knocked a cable loose during the change.

In the event you actually DID have a second drive fail, all you need to do is turn the system off, put the 'bad' drive back in, swap the Really Bad drive, and restart. raidz1 will not self-destruct; if there are insufficient drives available to import the pool, the pool doesn't import until you put back enough drives to operate the pool.

I am curious about the 'offline' step in the manual: I can see why you might do that for a hotswap, but for a cold swap, it seems to just create opportunity for disaster of exactly this variety.
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
I appreciate your suggestions, but I definitely changed the 'right' drive, the one that I took offline. The drive that went down subsequently has failed catastrophically - it won't power up on a separate computer. I am an engineer and I don't normally believe in such coincidence - but not sure how I provoked the failure.
Just to emphasise - I essentially did what you suggested:
1) Put the 'bad' drive offline in the GUI Volume Menu, and Shut Down
2) Replaced the 'bad' drive with a brand new drive
3) Rebooted and found another (broken) drive missing
4) Replaced this drive with the 'bad' drive and rebooted to five drives again (4 with pool data)
5) No sign of any rebuilding of the volume
So to return to brass tacks - is there nothing I can do to try and restore the volume? The 'bad' drive has some bad sectors but should still have all the data, but presumably is 'still' offline. Will forcing a Scrub achieve anything?
 

philhu

Patron
Joined
May 17, 2016
Messages
258
Does your system support 'HOT-SWAP'? If not, putting the drive in the running machine or taking the old one out could cause such a problem.

If it does, then no shutdown is/was needed, just the offile, replace, etc

If it doesn't then, offline, power down, replace drive, power up, etc
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I am glad you made no mistakes, but I was hoping you did, because it would help your situation.

The problem is the 'offline' drive. If the pool will not import now, then how do you online the drive? In theory, you should be fine. But offlining a drive, and then losing a second, you might be completely out of luck unless you can get the Really Dead drive to come back to life.

I would keep trying on the Really Dead drive; if you can get it to spin up on another machine, you might be able to make an image of it to your Very New drive, and get the pool online that way.

The only other remotely possible situation I can think of, is that if you did the offline, and then shutdown immediately, it might be possible to have an expert modify your disk labels to reflect the 'offline' drive is actually online, and then force a rollback to the moment you offlined the drive.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
With all the drives attached (ideally, all five drives of the original pool, plus the replacement drive), what's the output of 'zpool import'?
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
Somehow, with a lot of tinkering I have managed to get the broken drive working. It seems to have a motor power issue and took ages to get to speed - it is only working 'on edge'. My drives are all mounted inside a cabinet, but it is currently perched on top!
Anyways - the system is currently resilvering the new drive (14% done) so we live in hope. The monitor on the FreeNAS system is showing all manner of on-going business but the web GUI is happy.
So to be sure - since I have 3 more drives to change - I do not need to take the drive off-line before I change it? This would of course be after a power down.
Thanks for all the suggestions & support, much appreciated.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
While you can backup everything that is important on this system. It's probably not going to survive. Next decide if you want to keep the raidz1 or switch to raidz2 since you will be buying new drivers anyways. I would rebuild everything if I was you.

If not then replace the other failing drive by following the manual. You can even leave it in the system if you don't need the Sara port it's using.

Sent from my Nexus 5X using Tapatalk
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
All good points.
My 'broken' drive survived till about 75% resilvering of the new drive had completed and then gave up the ghost! I then tried the 'bad' drive that had been offlined, and this time the volume appeared and so I could select 'Replace' and it is now resilvering again (presumably from scratch). The only mystery is that I am left with another phantom 'drive' in the list that is currently 'unavailable'. I'll wait till the resilvering finishes and see what is showing.
Most of the NAS is used to backup other computers in the house (it is a domestic environment) but I have about 3TB of movies with nowhere to go! Using 'zpool status' in the console, it looks like most have survived so I guess I'll need to borrow a RAID or three from work and park the movies whilst I rebuild the new system.
I own a 16TB (16x1TB drives) RAID box that used to be in a video edit system. Currently my Mac has the connecting SAS board but the driver got left behind with an OS update. Probably as well since it makes a hell of a din when working!
Interesting times but the missus is weary of a dismantled NAS all over the dining table!
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
I think I'm buggered! It gets to about the same position (75%) as when the broken drive failed terminally, and then goes into a continuous stream of missing data. I guess all I can do is wait until it is finished and then try and rescue any files remaining and then rebuild from scratch, maybe with RaidZ2 this time.
You live and learn!
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
You can run ddrescue on the 75% drive and hopefully recover nearly all of it to a new drive, and then let it resilver from that image.
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
It has ceased to rotate!
It is frustrating that I have 4 working drives with all the data, and a new drive that was 75% silvered. It is just that they were not in the right place at the right time!!
I will be interested to see what state it is in when it has finally finished (if?) the current resilvering - it has reached 84% but is now crawling through. I still have no idea what the 'phantom' drive in the list is - it has a 19 digit code.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
This would be a good time to make sure you have backed up your most critical data.

For any critical data that is lost, there is one other desperate act: As you say, you have/had all the data, just not in the right place/time. If you know that one drive has not fully resilvered, but you still have the prior/earlier version of that same drive, you can (as a self-destruct), image some of the old drive over the new drive that didn't finish a resilver. In theory, it might allow you to recover enough additional data to make it worth the destruction it may also create. In order for it to work, you have to be certain not to overwrite the critical metadata that identifies the new drive as 'new'.
 

PerryM

Dabbler
Joined
Aug 8, 2013
Messages
32
Just to tidy this up. The resilver finished with a 'degraded' status, no surprise there! I did manage to rescue some of my films, about 95% were available with the rest showing an I/O error. The main problem was finding somewhere to temporarily store them!
I have rebuilt the system with new disks, so we will keep our fingers crossed for better luck with them.
Thanks again for the support.
 

philhu

Patron
Joined
May 17, 2016
Messages
258
If you have a spare drive port, you can replace a drive by adding a new drive without offlining the old drive first.

How do you accomplish this? I have a drive failing soon, and have 13 extra slots/bays and a spare drive.
How would one do this?
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Status
Not open for further replies.
Top