Corrupted data! Help

Status
Not open for further replies.

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
Hi guys,

Freenas reported that one of my drives was bad, so I replaced the bad drive with a new one, and started the replacement process.

I checked it today and it reported:
Code:
[root@server112437] ~# zpool status
  pool: zpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 7h30m with 3060271 errors on Thu Dec 27 06:46:08 2012
config:

        NAME                        STATE     READ WRITE CKSUM
        zpool                       DEGRADED     0     0 2.92M
          raidz1                    DEGRADED     0     0 5.95M
            da0p2                   ONLINE       6  767K     0
            da1p2                   ONLINE       0     0     0  55.9M resilvered
            replacing               DEGRADED     0     0     0
              16937584563870420491  UNAVAIL      0     0     0  was /dev/da2/old
              da2                   ONLINE       0     0     0  800G resilvered
          raidz1                    ONLINE       0     0     0
            ada1p2                  ONLINE       0     0     0
            ada2p2                  ONLINE       0     0     0
            ada3p2                  ONLINE       0     0     0

errors: 3060271 data errors, use '-v' for a list



First off I want to remove 16937584563870420491 from the list, but I can't do it from the web interface. it says removed but keeps popping back up there.

Next, Whats the deal with corruption! I thought the whole point of this kind of system was to avoid data corruption... I'm really worried about the reliability of this setup.

I ran ZFS back with open solaris for about 3 years and never touched the machine and it was rock solid. Since moving to FreeNas I've had this drive fail and now data corruption. Not sure what I need to do to move forward and get this system stable again.
 

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
Hmm rebooted and now my previous problem has compounded:

Code:
  pool: zpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        zpool                       DEGRADED     0     0     0
          raidz1                    DEGRADED     0     0     1
            da0p2                   ONLINE       0     0     0
            da1p2                   ONLINE       0     0     0
            replacing               UNAVAIL      0     0     0  insufficient replicas
              16937584563870420491  UNAVAIL      0     0     0  was /dev/da2/old
              6752917804641773929   UNAVAIL      0     0     0  was /dev/da2
          raidz1                    ONLINE       0     0     0
            ada1p2                  ONLINE       0     0     0
            ada2p2                  ONLINE       0     0     0
            ada3p2                  ONLINE       0     0     0

errors: 3060271 data errors, use '-v' for a list


I have two spare drives that I can put in... I'm just worried about losing data or getting more corruption.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What's the deal with all of these people with RAIDZ1s losing data recently? It seems the sun, moon, and stars has aligned for 2-3 people in the last 3-5 days...
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
I'm not offering to try and solve this, but if anyone is going to try they'll probably want to know what version of FreeNAS you're using as well as the output of these commands:

gpart show

glabel status

camcontrol devlist
 

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
Hey thanks for the reply. Here's the info

I'm running Freenas 8.2

FreeBSD server112437 8.2-RELEASE-p9 FreeBSD 8.2-RELEASE-p9 #0: Thu Jul 19 12:39:10 PDT 2012 root@build.ixsystems.com:/build/home/jpaetzel/8.2.0/os-base/amd64/build/home/jpaetzel/8.2.0/FreeBSD/src/sys/FREENAS.amd64 amd64


gpart show output

Code:
=>        34  1953525101  da0  GPT  (932G)
          34          94       - free -  (47K)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  1949330703    2  freebsd-zfs  (930G)

=>        34  1953525101  da1  GPT  (932G)
          34          94       - free -  (47K)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  1949330703    2  freebsd-zfs  (930G)

=>        34  1953525101  da2  GPT  (932G) [CORRUPT]
          34          94       - free -  (47K)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  1949330703    2  freebsd-zfs  (930G)

=>       63  976773105  ada0  MBR  (466G)
         63    1930257     1  freebsd  [active]  (943M)
    1930320         63        - free -  (32K)
    1930383    1930257     2  freebsd  (943M)
    3860640       3024     3  freebsd  (1.5M)
    3863664      41328     4  freebsd  (20M)
    3904992  972868176        - free -  (464G)

=>        34  5860533101  ada1  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada2  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>        34  5860533101  ada3  GPT  (2.7T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  5856338703     2  freebsd-zfs  (2.7T)

=>      0  1930257  ada0s1  BSD  (943M)
        0       16          - free -  (8.0K)
       16  1930241       1  !0  (943M)



glabel status
Code:
          Name  Status  Components
 ufs/FreeNASs3     N/A  ada0s3
 ufs/FreeNASs4     N/A  ada0s4
ufs/FreeNASs1a     N/A  ada0s1a


camcontrol devlist

Code:
<ATA ST31000340AS SD15>            at scbus0 target 5 lun 0 (pass0,da0)
<ST3500413AS HP64>                 at scbus4 target 0 lun 0 (pass1,ada0)
<Slimtype DVD A  DS8A2S 6P5D>      at scbus5 target 0 lun 0 (pass2,cd0)
<ST3000DM001-1CH166 CC43>          at scbus6 target 0 lun 0 (pass3,ada1)
<ST3000DM001-1CH166 CC43>          at scbus7 target 0 lun 0 (pass4,ada2)
<ST3000DM001-1CH166 CC43>          at scbus8 target 0 lun 0 (pass5,ada3)
<ATA ST31000340AS SD15>            at scbus10 target 6 lun 0 (pass6,da1)
<ATA ST31000340AS SD1A>            at scbus12 target 9 lun 0 (pass7,da2)
<ATA ST31500341AS CC1H>            at scbus12 target 10 lun 0 (pass8,da3)


I tried to add /dev/da3 as a spare to see if it would try to use that to recover... So stuck :/
 

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
Ok so I looked through the corrupt list (output from zpool status -v) and most of the files pertain to really old backups that I can afford to lose. So no big deal there.

Side problem: the zpool status -v process hangs at the end. the out of PS -ax shows its waiting on a D+ so some kind of I/O. I tried to kill the process but there isn't any way. Have to reboot.

Anyhow, for my zpool raidz1, the previously cluster was da0p2 da1p2 and da2 so da0p2 and da1p2 should have all the parity needed to rebuild da2. I pulled the da2 disk and checked it and there doesn't appear to have any problems. I have that drive back in the machine after wiping (0'd it out) and its showing up as /dev/da2 still. I also have added another drive /dev/da3 and as mentioned above added it as a spare trying to see if that would help recover the degraded state of the system but no such luck.

Where to go from here?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
For one thing, you chose to have a RAIDZ1. That means 1 disk failure and your data is still safe. In your case you had a disk failure and resilvering started. However, it looks like the drive da0 is having problem based on your first post. So you have 1 bad disk and a second failing disk. So when all is said and done whatever parts of da0 are corrupted will be corrupted files. This is why I never ever trust Mirrors or RAIDZ1, ever.

This is like resilvering 1 disk and you having a single bad sector on another drive. That bad sector WILL turn into a corrupt file since you have no other parity data to correct the error. In your case, it looks like da0 is having serious problems since it lists 767000 write errors.

Additionally, since you now appear to have potential issues with da2 my advice is to backup anything important on your zpool RIGHT NOW. After you've done your backing up should you attempt to do more. What more you ask? I'm not sure. I'd probably test the new disk that you were trying to resilver first. It may be bad. If its good then try resilvering again. But I'm not sure your resilvering will complete and you will be satisfied with the end result if da0 is as broken as it looks in the first post. You may not even be able to finish the resilvering process with da0 in the condition it is in.

I'm not sure how you came to the conclusion that da2 was failed and replaced it, but it looks like da0 is also going bad. Maybe you wrongly determined da2 as the faulty disk and it should have been da0?

BACKUP YOUR DATA NOW IF IT ISN'T ALREADY.
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
I pulled the da2 disk and checked it and there doesn't appear to have any problems. I have that drive back in the machine after wiping (0'd it out)

Yikes, you may have just ruined your chance of recovery, NEVER zero out a disk that you think you won't need until AFTER the pool is healthy again.

I'm not in a situation where I can really spend the proper time to help you, but I would be patient and not do anything else until you decide there's nothing worth recovering, or someone else like PaleoN wants to try and help. I think zeroing that disk could have really hurt your chance for recovery.
 

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
well before i started the replace, it said that da2 was Unavail and that the system was degraded. So I swapped the drive out and started the replacement process... it resilvered and it said it was done but was still in the list under replacing and the old disk id label was still in the list there...

after a reboot at some point in the evening, the system came up again and listed the two under replacing as the id and both as unavail.
 

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
Ugh... and now it somehow started (maybe a scheudled) scrub? ... says scrub: resilver in progress and now my backup speed has dropped neraly to 0!
 

Stephens

Patron
Joined
Jun 19, 2012
Messages
496
Stop the scrub. (I think it's "zpool scrub -s pool"... look it up). In fact, I'd stop using the system except to back up data or apply fixes folks here give you. Zeroing the disk was not the best approach.
 

mooreaa

Dabbler
Joined
Aug 25, 2012
Messages
11
Ok back at the machine now and the backups are chuggling along faster now. Looks like the scrubbing finished and is reporting no errors again.

Code:
zpool status
  pool: zpool
 state: DEGRADED
 scrub: resilver completed after 5h23m with 0 errors on Sun Dec 30 05:23:18 2012
config:

        NAME                        STATE     READ WRITE CKSUM
        zpool                       DEGRADED     0     0     0
          raidz1                    DEGRADED     0     0     1
            da0p2                   ONLINE       0     0     0  334M resilvered
            da1p2                   ONLINE       0     0     0
            replacing               UNAVAIL      0     0     0  insufficient replicas
              16937584563870420491  UNAVAIL      0     0     0  was /dev/da2/old
              6752917804641773929   UNAVAIL      0     0     0  was /dev/da2
          raidz1                    ONLINE       0     0     0
            ada1p2                  ONLINE       0     0     0
            ada2p2                  ONLINE       0     0     0
            ada3p2                  ONLINE       0     0     0
        spares
          da3                       AVAIL

errors: No known data errors


I'm not sure how or where the 334M resilvered from since its in a degraded state.

I will continue to backup everything. Pulled about 30% off so far... I will get everything backedup before attempting to make any fixes.

Would appreciate any thoughts on the "replacing UNAVAIL insufficent replicas" issues. Not sure how I'm going to get around that.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I ran ZFS back with open solaris for about 3 years and never touched the machine and it was rock solid. Since moving to FreeNas I've had this drive fail and now data corruption.
Shocking, aging & ignored drives are starting to fail. Not that if you had been paying attention the drives wouldn't be failing, but you likely would have caught it earlier.

I'm not sure how or where the 334M resilvered from since its in a degraded state.
How exactly do you think it resilvers a failed drive when it's in a degraded state?

I will continue to backup everything. Pulled about 30% off so far... I will get everything backedup before attempting to make any fixes.
I wouldn't be writing to the array at this point, if you are, until you are backed-up. Particularly with the previous errors on da0. Reading isn't safe either depending on how bad off the drive is. Personally, I would make a block copy of da0, using ddrescue, to one of your spare drives before it decides to fail on you and you lose the pool.

Would appreciate any thoughts on the "replacing UNAVAIL insufficent replicas" issues. Not sure how I'm going to get around that.
I don't think I followed everything you did with this. Which is the old da2, the other da2 and currently where is the zeroed old da2.
 
Status
Not open for further replies.
Top