URGENT! One of my disks has failed, and not sure how to replace!

paleoN · Aug 13, 2012

danzg said:
I'm still not clear what happened here ... is this a disk failure?

It may be disk failures.

danzg said:
What could be the cause? Have we lost everything? How to prevent in future???

I'm kinda losing faith in this whole ZFS/FreeNas thing ... I don't see what we did wrong ... have we just toasted 8TB of data??

How long were you running with a failed disk? A raidz1 array only protects against a single disk failure. Hopefully, you have some backups as you very well may need them.

You can send a prayer to the ZFS gods and hope you are running into a bug with the early ZFS version. In which case booting up 8.3.0-BETA1 and replacing the disk & scrubbing might help.

danzg · Aug 14, 2012

We put in a new drive, in addition to the five already in there.

When I go to console and view disks, it just says "loading" forever –*I can't get to the "Replace" option!

Here's the output from those commands; the "bad" disk is now "UNAVAIL" ...

Is it possible the disk got dislodged or something? It's in a 5-disk hot-swap enclosure.

Code:

[root@Wheelhouse NAS] ~# zpool status -v
  pool: raid-5x3
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	raid-5x3                  DEGRADED     0     0     0
	  raidz1                  DEGRADED     0     0     0
	    ada0p2                ONLINE       0     0     0
	    10739480653363274060  UNAVAIL      0     0     0  was /dev/ada1p2
	    ada2p2                ONLINE       0     0     0
	    ada3p2                ONLINE       0     0     0
	    ada1p2                ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        raid-5x3/alpha:<0x0>
        /mnt/raid-5x3/alpha/staff/Sound FX jw/Sound FX - scary horror/11 DR-EerieAct3-Waterphone..aif
        /mnt/raid-5x3/alpha/staff/Wheelhouse Shoots/ROCKY_THE_MUSICAL/ SHOOTS/WESTPORT/Cannon-CARD-B/CONTENTS/CLIPS001/AA0876/AA087601.SIF

... then it lists 2,860 files and "raid-5x3/alpha:<....>" entries ...

[root@Wheelhouse NAS] ~# camcontrol devlist
<ST3000DM001-9YN166 CC4C>          at scbus4 target 0 lun 0 (ada0,pass0)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus4 target 1 lun 0 (aprobe1,pass6,ada4)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus5 target 0 lun 0 (ada1,pass1)
<ST3000DM001-9YN166 CC4C>          at scbus5 target 1 lun 0 (ada2,pass2)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (cd0,pass3)
<Hitachi HDS5C3030ALA630 MEAOA580>  at scbus7 target 0 lun 0 (ada3,pass4)
< USB Flash Memory 1.00>           at scbus8 target 0 lun 0 (da0,pass5)

[root@Wheelhouse NAS] ~# gpart show
=>     63  7831467  da0  MBR  (3.7G)
       63  1930257    1  freebsd  [active]  (943M)
  1930320       63       - free -  (32K)
  1930383  1930257    2  freebsd  (943M)
  3860640     3024    3  freebsd  (1.5M)
  3863664    41328    4  freebsd  (20M)
  3904992  3926538       - free -  (1.9G)

=>      0  1930257  da0s1  BSD  (943M)
        0       16         - free -  (8.0K)
       16  1930241      1  !0  (943M)

=>        34  5860533101  ada0  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada1  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada2  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada3  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada4  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

[root@Wheelhouse NAS] ~# glabel status
                                      Name  Status  Components
                             ufs/FreeNASs3     N/A  da0s3
                             ufs/FreeNASs4     N/A  da0s4
                            ufs/FreeNASs1a     N/A  da0s1a
gptid/446dd91d-8f15-11e1-a14c-f46d049aaeca     N/A  ada4p1
gptid/447999cb-8f15-11e1-a14c-f46d049aaeca     N/A  ada4p2

danzg · Aug 14, 2012

Arg, now I'm thoroughly confused.

Seems the new drive wasn't connected properly. So we fixed it and rebooted.

Now console shows green light alert.

But when I go to "View All Volumes", it just says "Loading..."

And now:

glabel status

Code:

          Name  Status  Components
 ufs/FreeNASs3     N/A  da0s3
 ufs/FreeNASs4     N/A  da0s4
ufs/FreeNASs1a     N/A  da0s1a

camcontrol devlist:

Code:

<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<ST3000DM001-9YN166 CC4C>          at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus4 target 1 lun 0 (ada2,pass2)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus5 target 0 lun 0 (ada3,pass3)
<ST3000DM001-9YN166 CC4C>          at scbus5 target 1 lun 0 (ada4,pass4)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (cd0,pass5)
<Hitachi HDS5C3030ALA630 MEAOA580>  at scbus7 target 0 lun 0 (ada5,pass6)
< USB Flash Memory 1.00>           at scbus8 target 0 lun 0 (da0,pass7)

gpart show

Code:

=>     63  7831467  da0  MBR  (3.7G)
       63  1930257    1  freebsd  [active]  (943M)
  1930320       63       - free -  (32K)
  1930383  1930257    2  freebsd  (943M)
  3860640     3024    3  freebsd  (1.5M)
  3863664    41328    4  freebsd  (20M)
  3904992  3926538       - free -  (1.9G)

=>      0  1930257  da0s1  BSD  (943M)
        0       16         - free -  (8.0K)
       16  1930241      1  !0  (943M)

=>        34  5860533101  ada1  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada2  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada3  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada4  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada5  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

zpool status

Code:

  pool: raid-5x3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     2
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0
	    ada3p2  ONLINE       0     0     0

errors: 7607009 data errors, use '-v' for a list

Waiting for output of zpool status -v ... it takes forever...

WHAT THE HECK IS GOING ON??

William Grzybowski · Aug 14, 2012

You're screwed, at some point you had 2 disks fail, bye bye data.

Use your backup.

danzg · Aug 14, 2012

So zpool status -v shows same as above, plus the long list of files.

And Console>View Disks now shows 5 disks online, healthy.

Should I scrub?

Is there hope for our data?

danzg · Aug 14, 2012

We have 56,796 files on the NAS; zpool status reports 2,640 files with errors.

Should I delete the 2,640 files? Are they for sure corrupted?

Are the remaining files "good" ?

What do all those "raid-5x3/alpha:<0xf500>" entries mean?

HELP!

danzg · Aug 14, 2012

I clicked the 'scrub' button ... now zpool status shows "resilver in progress .... 900h to go" ... which is like a month...

Is this a waste of time?

Why does it say resilver? I thought scrub and resilver were different things?

And why does 'view disks' show only 5 drives? Is it automatically using the new one we put in? Why doesn't the old "bad" one appear??

cyberjock · Aug 14, 2012

William Grzybowski said:
You're screwed, at some point you had 2 disks fail, bye bye data.

Use your backup.

William knows his stuff. If he tells you to use your backup.. you should use your backup. ;)

danzg · Aug 14, 2012

William Grzybowski said:
You're screwed, at some point you had 2 disks fail, bye bye data.

Use your backup.

Which 2 disks? How would I know which 2 to replace?

This is unbelievably confusing.

cyberjock · Aug 14, 2012

Personally, I'd run diagnostics on all of them and try to figure it out. I'm not sure at this point there is an easy way to identify the bad disks.

danzg · Aug 14, 2012

noobsauce80 said:
Personally, I'd run diagnostics on all of them and try to figure it out. I'm not sure at this point there is an easy way to identify the bad disks.

How do I do that? What diagnostics?
thanks

cyberjock · Aug 14, 2012

danzg said:
How do I do that? What diagnostics?

Oh my... Go to the hard drive manufacturers website and download whatever diagnostic CD they have and use it.

danzg · Aug 14, 2012

noobsauce80 said:
Oh my... Go to the hard drive manufacturers website and download whatever diagnostic CD they have and use it.

Sorry, I thought you meant there was some way to do it from within FreeNAS.

cyberjock · Aug 14, 2012

danzg said:
Sorry, I thought you meant there was some way to do it from within FreeNAS.

Thank god! I was about to say that if you don't know how to run diagnostics you shouldn't be playing with FreeNAS. LOL.

danzg · Aug 17, 2012

We rechecked all the connections on the drives, and now it's resilvering, but much faster.
Instead of 50,000 hours, it's saying 13 hours and consistently going down...

But I'm confused why I don't see the 6th disk in the console & zpool.

camcontrol devlist gives:

Code:

<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<ST3000DM001-9YN166 CC4C>          at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus4 target 1 lun 0 (ada2,pass2)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus5 target 0 lun 0 (ada3,pass3)
<ST3000DM001-9YN166 CC4C>          at scbus5 target 1 lun 0 (ada4,pass4)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (cd0,pass5)
<Hitachi HDS5C3030ALA630 MEAOA580>  at scbus7 target 0 lun 0 (ada5,pass6)
< USB Flash Memory 1.00>           at scbus8 target 0 lun 0 (pass7,da0)

I believe the bold items are 2TB disks.

Yet zpool status, gpart show and the console only show 5 disks:

Code:

 pool: raid-5x3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver in progress for 1h30m, 9.94% done, 13h40m to go
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     0  121M resilvered
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0  213G resilvered
	    ada3p2  ONLINE       0     0     0

errors: 7607009 data errors, use '-v' for a list

Why does ada0 not show up?

EDIT: Oh, I see now, ada0 appears if I "add a volume." So I guess it's just not part of the pool.

Does the fact that it's properly resilvering now mean there's a chance we could get our data back?

Should I replace one of the disks in the pool with the new drive? How would I tell which disk it was that gave me "FAULTED" a couple days ago? (When zpool gave me:

Code:

     NAME                      STATE     READ WRITE CKSUM
     raid-5x3                  DEGRADED     0     0 7.29M
       raidz1                  DEGRADED     0     0 14.7M
         ada0p2                ONLINE       0     0     0
         10739480653363274060  FAULTED      0     0     0  was /dev/ada1p2
         ada2p2                ONLINE       0     0     0
         ada3p2                ONLINE       0     0     3  254M resilvered
         ada1p2                ONLINE       0     0     0

I was going to scrub as soon as the resilver is done.

cyberjock · Aug 17, 2012

You'd have to know which drive was ada1 :(

In any case, if you have a backup I wouldn't try to reuse any of the data in your zpool. There's no point in doing a scrub after the resilver unless you are trying to prove that the resilvered data is reliable. Scrubbing is the same as resilvering except generally you are resilvering if you replaced a drive.

You could try looking at SMART data for all of the drives. If one is failing SMART tests it's pretty reasonable to assume that's the bad drive. Personally I keep a spreadsheet that I printed out that identifies each serial number on each drive to the dev name and I stick a label on each disk with the dev name it matches so I can always track the drives.

If I were in your shoes and I was going to use the backup data I'd check SMART tests for any failures, then delete and recreate the zpool from scratch and recopy the data. Any bad disks will definitely give you errors if you didn't fix the problem with checking the SATA connections. You will also know your data is good since you are using a backup. That's just what I would do ;)

danzg · Aug 17, 2012

Well, we don't have a backup for ALL the data ... so we need to figure out which of those files are corrupt, and which are usable.

FWIW, several of the files –*which were previously reported as corrupt –*I've randomly checked, and they now seem to be OK.
(Meaning I was able to copy them and play them –*most of our data is video files.)

What I'd like to do is COPY everything for which we do not have a backup, and which is not corrupt, to another machine, and then upgrade this one to RAIDZ2...

cyberjock · Aug 17, 2012

I'd have expected few, if any files are still good. But you never know. It's odd that everything was corrupted and non-functional and now it seems to be good enough to recover your data.

Did you watch the video file beginning to end? If not the file may be good with parts of it corrupted.

danzg · Aug 17, 2012

I'm thinking maybe what happened is that 2 drives became dislodged. I think the hotswap bay we have is poor quality.
But then, they DID appear connected, just faulted ... I don't know.

The resilver completed, in only 3.5 hours!

Now zpool status says

Code:

  pool: raid-5x3
 state: ONLINE
 scrub: resilver completed after 3h31m with 0 errors on Fri Aug 17 21:46:12 2012
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     0  236G resilvered
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0  252G resilvered
	    ada3p2  ONLINE       0     0     0

errors: No known data errors

Does this mean the data is recovered?? "No known errors" sounds promising!

I've initiated a scrub...

Code:

  pool: raid-5x3
  pool: raid-5x3
 state: ONLINE
 scrub: scrub in progress for 0h4m, 0.62% done, 11h22m to go
config:

	NAME        STATE     READ WRITE CKSUM
	raid-5x3    ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0
	    ada2p2  ONLINE       0     0     0
	    ada4p2  ONLINE       0     0     0
	    ada5p2  ONLINE       0     0     0
	    ada3p2  ONLINE       0     0     0

errors: No known data errors

danzg · Aug 17, 2012

How can I make sure I get an email alert whenever there's an issue? Disk fault, anything?

All I get are nightly reports.

Couldn't find this info anywhere.

Do I need to run some sort of script?

Important Announcement for the TrueNAS Community.

URGENT! One of my disks has failed, and not sure how to replace!

Wizard

Contributor

Contributor

Wizard

Contributor

Contributor

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "URGENT! One of my disks has failed, and not sure how to replace!"

Similar threads