Replaced drive

Status
Not open for further replies.

dgraybill

Dabbler
Joined
Oct 22, 2012
Messages
18
I had a drive in a zfs pool that was giving me a crap load of write errors, and ahchi errors, enough that it was causing FreeNAS to lock up. So I stuck a new drive in, and hit replaced. It took forever, and crashed several times. This is the output I get now:

Code:
[root@freenas] ~# zpool status
  pool: Storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: resilvered 186G in 241h35m with 348669 errors on Wed Nov  7 14:39:23 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        Storage                                         ONLINE       0     0     0
          replacing-0                                   ONLINE       0     0     0
            gptid/6d3fa90d-1b23-11e2-b307-00241d29191e  ONLINE       0     0     4
            gptid/9254b0ca-2141-11e2-b3d2-00241d29191e  ONLINE       0     0     0
          gptid/71b35dc9-1b23-11e2-b307-00241d29191e    ONLINE       0     0     0
          gptid/59fa73fd-1bb5-11e2-a43a-00241d29191e    ONLINE       0     0     0

errors: 348669 data errors, use '-v' for a list
[root@freenas] ~#


But it wont let me delete the old drive in the pool on the GUI, like I did last time I replaced a drive. Thoughts?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Have you done a '-v' like it says for a list of data errors? It appears that you may have up to 348669 corrupted files on your server...

Also, I assume this error is the same from your previous thread. You are running in a RAID-0(no redundancy) so a failure of any disk has catastrophic data consequences. It surely appears that you are seeing the "rewards" of using a RAID0 in a multi-drive situation.

I could be mistaken, but I think your best option is to backup your data that is still on the server, delete and recreate the zpool with redundancy, then copy the data back to the server. There's a reason why there is a warning in bold in the manual regarding RAID0.
 

dgraybill

Dabbler
Joined
Oct 22, 2012
Messages
18
These are 2 different situations. In my other thread my os drive failed and I couldn't recreate the pool. This problem has since been rectified. The issue I speak of here is of one drive that is slowly failing. I have had this issue before with an identical drive of the same age and I replaced it just as I have here but I was able to delete the old drive after the fact and it was fine until this drive began acting up. I don't much care about the files that are corrupted but I do care about the other files that aren't corrupted. My issue now is that I can access all the files on the Nas, but very slowly because turn dying drive is still in there and keeps giving ahchi errors. Then after a few thousand faults the kernel panics and locks up. My data is fine if I can get that bad drive out of there.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Did the resilvering ever complete without errors? I'm not too read-up on how resilvering works with unrecoverable errors. In this case I think its possible that since it couldn't do a 100% resilver the drive is still there. You also said that the system kept crashing. It's possible that the crashing(and I assume rebooting) has made the resilvering never complete.

In any case, I'd say that you are in a pickle. If you can't successfully complete a resilvering because of the crashing I'm not sure you'll ever be able to remove the old drive.


That's why I said the best course of action is likely to delete and recreate the zpool. I'd add redundancy to save your data next time. Redundancy doesn't work as a backup, but it sure does save you some serious time and effort having to use those backups.

You mentioned AHCI errors. I've only seen those with a bad SATA cable. Any chance you could replace your SATA cable?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If you are sure the resilvering completed(it looks like it may have completed as much as it will ever complete on Nov 7th) then I'd just do a shutdown, physically remove the failing drive, then bootup. If you have access to your zpool then you can detach the removed disk from the GUI. If you can't access the data after you physically remove the failiing drive then the resilvering didn't complete and it looks like you're stuck with the bad drive since you have no redundancy.
 

dgraybill

Dabbler
Joined
Oct 22, 2012
Messages
18
And fyi I don't have redundancy because the info is not important and my drives are different sizes. Its just a huge pain to get all the data back so ill just try all avenues before starting over.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The link to picasaweb is only to an empty album.

Redundancy for RAID isn't a backup. It's only meant to avoid the hassle of recovering from backups in the event of a failure. I'm glad you understand that. I really hate having to say things like "your setup cost you all of your data" and they reply with "but.. but.. but I have no backups".

Id try a cable only because its cheap and it would suck to use that cable in a year or two and have another drive "fail" because you reused the same cable.
 

dgraybill

Dabbler
Joined
Oct 22, 2012
Messages
18
Hmm picture loads for me. Something like AHCHI0 timeout on slot x port 0. Slot changes all the time port is always 0. There are some other errors too. Ill try a new cable tonight. Thanks.
 
Status
Not open for further replies.
Top