URGENT! One of my disks has failed, and not sure how to replace!

Status
Not open for further replies.

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I'm still not clear what happened here ... is this a disk failure?
It may be disk failures.

What could be the cause? Have we lost everything? How to prevent in future???

I'm kinda losing faith in this whole ZFS/FreeNas thing ... I don't see what we did wrong ... have we just toasted 8TB of data??
How long were you running with a failed disk? A raidz1 array only protects against a single disk failure. Hopefully, you have some backups as you very well may need them.

You can send a prayer to the ZFS gods and hope you are running into a bug with the early ZFS version. In which case booting up 8.3.0-BETA1 and replacing the disk & scrubbing might help.
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
We put in a new drive, in addition to the five already in there.

When I go to console and view disks, it just says "loading" forever –*I can't get to the "Replace" option!

Here's the output from those commands; the "bad" disk is now "UNAVAIL" ...

Is it possible the disk got dislodged or something? It's in a 5-disk hot-swap enclosure.

Code:
[root@Wheelhouse NAS] ~# zpool status -v
  pool: raid-5x3
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	raid-5x3                  DEGRADED     0     0     0
	  raidz1                  DEGRADED     0     0     0
	    ada0p2                ONLINE       0     0     0
	    10739480653363274060  UNAVAIL      0     0     0  was /dev/ada1p2
	    ada2p2                ONLINE       0     0     0
	    ada3p2                ONLINE       0     0     0
	    ada1p2                ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        raid-5x3/alpha:<0x0>
        /mnt/raid-5x3/alpha/staff/Sound FX jw/Sound FX - scary horror/11 DR-EerieAct3-Waterphone..aif
        /mnt/raid-5x3/alpha/staff/Wheelhouse Shoots/ROCKY_THE_MUSICAL/ SHOOTS/WESTPORT/Cannon-CARD-B/CONTENTS/CLIPS001/AA0876/AA087601.SIF

... then it lists 2,860 files and "raid-5x3/alpha:<....>" entries ...

[root@Wheelhouse NAS] ~# camcontrol devlist
<ST3000DM001-9YN166 CC4C>          at scbus4 target 0 lun 0 (ada0,pass0)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus4 target 1 lun 0 (aprobe1,pass6,ada4)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus5 target 0 lun 0 (ada1,pass1)
<ST3000DM001-9YN166 CC4C>          at scbus5 target 1 lun 0 (ada2,pass2)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (cd0,pass3)
<Hitachi HDS5C3030ALA630 MEAOA580>  at scbus7 target 0 lun 0 (ada3,pass4)
< USB Flash Memory 1.00>           at scbus8 target 0 lun 0 (da0,pass5)

[root@Wheelhouse NAS] ~# gpart show
=>     63  7831467  da0  MBR  (3.7G)
       63  1930257    1  freebsd  [active]  (943M)
  1930320       63       - free -  (32K)
  1930383  1930257    2  freebsd  (943M)
  3860640     3024    3  freebsd  (1.5M)
  3863664    41328    4  freebsd  (20M)
  3904992  3926538       - free -  (1.9G)

=>      0  1930257  da0s1  BSD  (943M)
        0       16         - free -  (8.0K)
       16  1930241      1  !0  (943M)

=>        34  5860533101  ada0  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada1  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada2  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada3  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada4  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

[root@Wheelhouse NAS] ~# glabel status
                                      Name  Status  Components
                             ufs/FreeNASs3     N/A  da0s3
                             ufs/FreeNASs4     N/A  da0s4
                            ufs/FreeNASs1a     N/A  da0s1a
gptid/446dd91d-8f15-11e1-a14c-f46d049aaeca     N/A  ada4p1
gptid/447999cb-8f15-11e1-a14c-f46d049aaeca     N/A  ada4p2
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
Arg, now I'm thoroughly confused.

Seems the new drive wasn't connected properly. So we fixed it and rebooted.

Now console shows green light alert.

But when I go to "View All Volumes", it just says "Loading..."

And now:

glabel status
Code:
          Name  Status  Components
 ufs/FreeNASs3     N/A  da0s3
 ufs/FreeNASs4     N/A  da0s4
ufs/FreeNASs1a     N/A  da0s1a


camcontrol devlist:
Code:
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<ST3000DM001-9YN166 CC4C>          at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus4 target 1 lun 0 (ada2,pass2)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus5 target 0 lun 0 (ada3,pass3)
<ST3000DM001-9YN166 CC4C>          at scbus5 target 1 lun 0 (ada4,pass4)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (cd0,pass5)
<Hitachi HDS5C3030ALA630 MEAOA580>  at scbus7 target 0 lun 0 (ada5,pass6)
< USB Flash Memory 1.00>           at scbus8 target 0 lun 0 (da0,pass7)


gpart show
Code:
=>     63  7831467  da0  MBR  (3.7G)
       63  1930257    1  freebsd  [active]  (943M)
  1930320       63       - free -  (32K)
  1930383  1930257    2  freebsd  (943M)
  3860640     3024    3  freebsd  (1.5M)
  3863664    41328    4  freebsd  (20M)
  3904992  3926538       - free -  (1.9G)

=>      0  1930257  da0s1  BSD  (943M)
        0       16         - free -  (8.0K)
       16  1930241      1  !0  (943M)

=>        34  5860533101  ada1  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada2  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada3  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada4  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada5  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)


zpool status
Code:
  pool: raid-5x3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     2
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0
	    ada3p2  ONLINE       0     0     0

errors: 7607009 data errors, use '-v' for a list


Waiting for output of zpool status -v ... it takes forever...


WHAT THE HECK IS GOING ON??
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
So zpool status -v shows same as above, plus the long list of files.

And Console>View Disks now shows 5 disks online, healthy.

Should I scrub?

Is there hope for our data?
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
We have 56,796 files on the NAS; zpool status reports 2,640 files with errors.

Should I delete the 2,640 files? Are they for sure corrupted?

Are the remaining files "good" ?

What do all those "raid-5x3/alpha:<0xf500>" entries mean?

HELP!
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
I clicked the 'scrub' button ... now zpool status shows "resilver in progress .... 900h to go" ... which is like a month...

Is this a waste of time?

Why does it say resilver? I thought scrub and resilver were different things?

And why does 'view disks' show only 5 drives? Is it automatically using the new one we put in? Why doesn't the old "bad" one appear??
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You're screwed, at some point you had 2 disks fail, bye bye data.

Use your backup.

William knows his stuff. If he tells you to use your backup.. you should use your backup. ;)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Personally, I'd run diagnostics on all of them and try to figure it out. I'm not sure at this point there is an easy way to identify the bad disks.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
How do I do that? What diagnostics?

Oh my... Go to the hard drive manufacturers website and download whatever diagnostic CD they have and use it.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Sorry, I thought you meant there was some way to do it from within FreeNAS.

Thank god! I was about to say that if you don't know how to run diagnostics you shouldn't be playing with FreeNAS. LOL.
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
We rechecked all the connections on the drives, and now it's resilvering, but much faster.
Instead of 50,000 hours, it's saying 13 hours and consistently going down...

But I'm confused why I don't see the 6th disk in the console & zpool.

camcontrol devlist gives:
Code:
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<ST3000DM001-9YN166 CC4C>          at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus4 target 1 lun 0 (ada2,pass2)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus5 target 0 lun 0 (ada3,pass3)
<ST3000DM001-9YN166 CC4C>          at scbus5 target 1 lun 0 (ada4,pass4)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (cd0,pass5)
<Hitachi HDS5C3030ALA630 MEAOA580>  at scbus7 target 0 lun 0 (ada5,pass6)
< USB Flash Memory 1.00>           at scbus8 target 0 lun 0 (pass7,da0)


I believe the bold items are 2TB disks.

Yet zpool status, gpart show and the console only show 5 disks:
Code:
 pool: raid-5x3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver in progress for 1h30m, 9.94% done, 13h40m to go
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     0  121M resilvered
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0  213G resilvered
	    ada3p2  ONLINE       0     0     0

errors: 7607009 data errors, use '-v' for a list


Why does ada0 not show up?

EDIT: Oh, I see now, ada0 appears if I "add a volume." So I guess it's just not part of the pool.

Does the fact that it's properly resilvering now mean there's a chance we could get our data back?

Should I replace one of the disks in the pool with the new drive? How would I tell which disk it was that gave me "FAULTED" a couple days ago? (When zpool gave me:
Code:
     NAME                      STATE     READ WRITE CKSUM
     raid-5x3                  DEGRADED     0     0 7.29M
       raidz1                  DEGRADED     0     0 14.7M
         ada0p2                ONLINE       0     0     0
         10739480653363274060  FAULTED      0     0     0  was /dev/ada1p2
         ada2p2                ONLINE       0     0     0
         ada3p2                ONLINE       0     0     3  254M resilvered
         ada1p2                ONLINE       0     0     0


I was going to scrub as soon as the resilver is done.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You'd have to know which drive was ada1 :(

In any case, if you have a backup I wouldn't try to reuse any of the data in your zpool. There's no point in doing a scrub after the resilver unless you are trying to prove that the resilvered data is reliable. Scrubbing is the same as resilvering except generally you are resilvering if you replaced a drive.

You could try looking at SMART data for all of the drives. If one is failing SMART tests it's pretty reasonable to assume that's the bad drive. Personally I keep a spreadsheet that I printed out that identifies each serial number on each drive to the dev name and I stick a label on each disk with the dev name it matches so I can always track the drives.

If I were in your shoes and I was going to use the backup data I'd check SMART tests for any failures, then delete and recreate the zpool from scratch and recopy the data. Any bad disks will definitely give you errors if you didn't fix the problem with checking the SATA connections. You will also know your data is good since you are using a backup. That's just what I would do ;)
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
Well, we don't have a backup for ALL the data ... so we need to figure out which of those files are corrupt, and which are usable.

FWIW, several of the files –*which were previously reported as corrupt –*I've randomly checked, and they now seem to be OK.
(Meaning I was able to copy them and play them –*most of our data is video files.)

What I'd like to do is COPY everything for which we do not have a backup, and which is not corrupt, to another machine, and then upgrade this one to RAIDZ2...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd have expected few, if any files are still good. But you never know. It's odd that everything was corrupted and non-functional and now it seems to be good enough to recover your data.

Did you watch the video file beginning to end? If not the file may be good with parts of it corrupted.
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
I'm thinking maybe what happened is that 2 drives became dislodged. I think the hotswap bay we have is poor quality.
But then, they DID appear connected, just faulted ... I don't know.

The resilver completed, in only 3.5 hours!

Now zpool status says
Code:
  pool: raid-5x3
 state: ONLINE
 scrub: resilver completed after 3h31m with 0 errors on Fri Aug 17 21:46:12 2012
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     0  236G resilvered
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0  252G resilvered
	    ada3p2  ONLINE       0     0     0

errors: No known data errors


Does this mean the data is recovered?? "No known errors" sounds promising!

I've initiated a scrub...

Code:
  pool: raid-5x3
  pool: raid-5x3
 state: ONLINE
 scrub: scrub in progress for 0h4m, 0.62% done, 11h22m to go
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     0
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0
	    ada3p2  ONLINE       0     0     0

errors: No known data errors
 

danzg

Contributor
Joined
Jun 18, 2011
Messages
105
How can I make sure I get an email alert whenever there's an issue? Disk fault, anything?

All I get are nightly reports.

Couldn't find this info anywhere.

Do I need to run some sort of script?
 
Status
Not open for further replies.
Top