SOLVED Damaged data but pool ONLINE?!

Status
Not open for further replies.

Alex9779

Dabbler
Joined
Nov 23, 2012
Messages
10
Hi community,

yesterday I replaced a defective hard drive with some corrupt sectors in my pool. Then resilvering started but before I already had 5 permanent errors in the status of the pool, all <metadata>:. Now after the process completed I scrubbed the pool which repaired 2 error but none of the 5 above.
I tracked to problem down to a specific dataset. When I try to delete that dataset or specific snapshots in it my kernel crashes with "Fatal trap 12" and so on...
Maybe another device I defect in my pool and I think I know which one it is because of its smart stats although smart does not report any defect sectors.
I ran another short scrub and somehow 3 of the 5 permanent error vanished.
Heres a current output of "zfs status -v Storage":
Code:
pool: Storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub canceled on Sun Mar 17 17:19:50 2013
config:

	NAME                                            STATE     READ WRITE CKSUM
	Storage                                         ONLINE       0     0     0
	  raidz1-0                                      ONLINE       0     0     0
	    gptid/076c01a1-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
	    gptid/24d575f3-8e96-11e2-91f9-f46d04d898d3  ONLINE       0     0     0
	    gptid/0870c326-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
	    gptid/08ddeb79-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
	    gptid/094e2ef1-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x226e4e>
        <metadata>:<0x2267be>


What can I do now? I am not able to delete the defect dataset, always the kernel crashes.
Shall I replace the perhaps defect device too, resilver and then try again? Maybe it crashes because it cannot access the specific areas of that disk.

Please help, Oracle help is not very helpful here, I don't find anything about this situation, only if the pool reports defect metadata AND is not importable but mine is...

Regards,
Alexander
 

Alex9779

Dabbler
Joined
Nov 23, 2012
Messages
10
Ok now my status shows like this:
Code:
 pool: Storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Sun Mar 17 17:45:22 2013
        293G scanned out of 5.66T at 288M/s, 5h25m to go
        0 repaired, 5.04% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        Storage                                         ONLINE       0     0     5
          raidz1-0                                      ONLINE       0     0    30
            gptid/076c01a1-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
            gptid/24d575f3-8e96-11e2-91f9-f46d04d898d3  ONLINE       0     0     0
            gptid/0870c326-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
            gptid/08ddeb79-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
            gptid/094e2ef1-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x226e4e>
        <metadata>:<0x2267be>


I am currently running a scrub since I can't put the probably defective device offline. After that finishes I will replace that device and then see how it is going...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It looks like you have metadata(part of the file system) corruption because you had only 1 drive of redundancy. Because 1 disk was bad if you have any bad sectors on any of the other disks, you will end up with corruption. In this day and age where hard drives are getting bigger and bigger but their reliability in reading data is not increasing at the same rate it is becoming more and more likely that somewhere, something on the disk won't be readable(which would require more redundancy to fix).

In your case, its file system corruption, which is pretty much the crappiest corruption you could get. I'm not sure how I would fix this aside from destroying and recreating the zpool from scratch. I don't think scrubs will fix it because the zpool has no way of knowing "how" to fix it.
 

Alex9779

Dabbler
Joined
Nov 23, 2012
Messages
10
Yeah I figured that out but my problem is that I know the problem is in one specific dataset which holds just some testing data not very important. So if I try to delete this dataset and all its snapshots the kernel crashes. Why I don't know yet, maybe because he does not have a problem with the errors because on deletion they should no matter but because the errors on the drive cause the crash while trying to remove the damaged dataset.

I know that it is a specific set because I was able to delete all snapshots of all other datasets and only snapshot of this one dataset are left and cause a kernel crash when iI try to remove them.

I hope I will have a chance, when the current scrub is over and I am able to put the drive offline. Maybe I am able to remove the dataset in degraded state because the pool should be functional but upon deletion it will not access the failed drive. Then I can replace with a new drive...

- - - Updated - - -

And maybe I made a mistake when replacing the first drive... I replaced the drive which was reported by smart to have two damaged sectors. But the status of the pool already mentioned the other drive to have checksum error. And if I now look at the smart stats of both drives the replaced one is godd except the two bad sectors which are working again after some deeper checks with some utilities than the other one which seems to have some servo and positioning problems but no bad sectors...
 

Alex9779

Dabbler
Joined
Nov 23, 2012
Messages
10
Hmm strange... This is the statu of the pool right after the scrub:
Code:
pool: Storage
 state: ONLINE
  scan: scrub repaired 0 in 6h30m with 0 errors on Mon Mar 18 00:15:26 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        Storage                                         ONLINE       0     0     5
          raidz1-0                                      ONLINE       0     0    30
            gptid/076c01a1-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
            gptid/24d575f3-8e96-11e2-91f9-f46d04d898d3  ONLINE       0     0     0
            gptid/0870c326-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
            gptid/08ddeb79-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0
            gptid/094e2ef1-6501-11e1-b9bd-f46d04d898d3  ONLINE       0     0     0

errors: No known data errors


The permanent errors are gone.

I was able to put the drive offline now. Pool is degraded now. Then I tried to destroy the dataset which caused the kernel crashes previously. It took somehow very long to complete but it is gone now...
So I am going to replace the drive with another new and then lets see how resilvering will run...
 

Alex9779

Dabbler
Joined
Nov 23, 2012
Messages
10
Well the resilvering did fine, still no errors, no checksum errors or anything else...

I think I was lucky that only this unimportant dataset was hit by that errors and I was able to figure out that it was because I was able to delete all other snapshots without any problems.
This is a home NAS so no problems with users and I use the snapshots only because I can, not that I would really need then so.

Still I am unsatisfied that I didn't find a solution how to figure out with only the error messages which specific data is damaged. Its nice that if files are damaged you get to know which are but it metadata is damaged as in my case I didn't find anything how to figure out to what that metadata belongs to. In the Oracle documentation only some nice sentences mentioning that you should move damaged data away so that you can restore it to the original place. Well good for real files but what about metadata? I my case I really was lucky to figure it out, I don't have enough capacity at home to save all data on the NAS to external then kill the whole pool and recreate it. Sensitive data is backed up, yes, but there is other data which is not vital but I don't want to loose it because of the unability to figure out the real problem...
 
Status
Not open for further replies.
Top