(ZFS) status is UNKNOWN (data corruption)

ktr · Dec 13, 2013

Hi - I recently started using FreeNAS (~ 3mos ago) and when I logged into today I noticed the error:

Code:

WARNING: The volume ryans_hds (ZFS) status is UNKNOWN: One or more devices has experienced an error resulting in data corruption. Applications may be affected.Restore the file in question if possible. Otherwise restore the entire pool from backup.

After searching a bit, I came across http://forums.freenas.org/threads/zfs-status-unknown.16104/, but it didn't seem to help me much. I ran zpool status -x and got:

Code:

[root@freenas ~]# zpool status -x                                             
  pool: ryans_hds                                                             
state: DEGRADED                                                               
status: One or more devices has experienced an error resulting in data         
        corruption.  Applications may be affected.                             
action: Restore the file in question if possible.  Otherwise restore the       
        entire pool from backup.                                               
  see: http://illumos.org/msg/ZFS-8000-8A                                     
  scan: scrub repaired 0 in 13h35m with 1227 errors on Sun Dec  8 13:35:57 2013
config:                                                                       
                                                                               
        NAME                                            STATE    READ WRITE CKS UM                                                                             
        ryans_hds                                      DEGRADED    0    0    0                                                                             
          mirror-0                                      DEGRADED    0    0    0                                                                             
            gptid/ecc7beeb-0172-11e3-b2a9-d43d7eb33430  ONLINE      0    0    0                                                                             
            15311977958075100947                        UNAVAIL      0    0    0  was /dev/gptid/ed3a1e65-0172-11e3-b2a9-d43d7eb33430                       
                                                                               
errors: 1227 data errors, use '-v' for a list

So it appears to me that something happened to one of my devices and it is now "unavailable"? Is there a way to fix this error? The drives are also only ~ 3mos old (I bought them specifically to build this box).

Also, regarding the actual files that were corrupted ... how do I go about trying to restore the "file in question"? Should I just do

Code:

zfs rollback <file in question>

?

Sorry if these are basic questions - thanks in advance!

joeschmuck · Dec 13, 2013

Sorry, I cannot give advice on how to treat your pool but backup any files you want to retain, assuming you can access the files.

Next setup email on FreeNAS to send you daily status messages. The scrub happened on 8 Dec, it's 13 Dec, you should know about serious problems a bit quicker.

Do you have your system on an UPS? I'm curious as to why you have this type of failure, is it just a hard drive failure?

At the SSH shell could you report back what the following info (assuming your drives are ada0 and ada1), looking for a drive failure issue:
smartctl -A /dev/ada0
smartctl -A /dev/ada1

Dusan · Dec 13, 2013

Can you also please please post output of these commands (in CODE tags):
camcontrol devlist
gpart show

ktr · Dec 13, 2013

I will get the output of these commands tonight when I get home (don't have access from work). @joeschmuck: I am not using a UPS - is that a huge mistake? Thanks!

joeschmuck · Dec 13, 2013

Yes, you truly need an UPS that has a USB status cable. This will for one keep your NAS from corrupting its data if you have a power failure and it will automatically shutdown the NAS for a prolonged power outage. Of course the alternative is no UPS and you could have all kinds of data corruption and even mess up your boot Flash drive image. It sucks recovering from this kind of failure because it could take from hours to days. Also my first thought was you had a power glitch which might have caused your problem, but then again your problem may have nothing to do with power.

One other thought is the SATA cables. You might need to replace the one going to the failed drive. Many folks have SATA cable issues and since your system is new, that could be it. Of course let's see what those commands tell us that we asked for.

ktr · Dec 13, 2013

Ok, back again. Here are the results:

Code:

[root@freenas ~]# smartctl /dev/ada0                                        
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)       
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org 
                                                                             
ATA device successfully opened
 
[root@freenas ~]# smartctl /dev/ada1                                         
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)       
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org 
                                                                             
ATA device successfully opened
 
[root@freenas ~]# camcontrol devlist
<WDC WD20EFRX-68AX9N0 80.00A80> at scbus0 target 0 lun 0 (ada0,pass0)
<HITACHI HTS723216L9SA60 FC2ZC50B> at scbus4 target 1 lun 0 (ada1,pass1)
<SanDisk Cruzer Fit 2.01> at scbus6 target 0 lun 0 (pass2,da0)
 
[root@freenas ~]# gpart show
=> 63 7821249 da0 MBR (3.7G)
63 1930257 1 freebsd [active] (942M)
1930320 63 - free - (31k)
1930383 1930257 2 freebsd (942M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 3916320 - free - (1.9G)
 
=> 0 1930257 da0s1 BSD (942M)
0 16 - free - (8.0k)
16 1930241 1 !0 (942M)
 
=> 34 3907029101 ada0 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834696 2 freebsd-zfs (1.8T)
3907029128 7 - free - (3.5k)
 
=> 63 312581745 ada1 MBR (149G)
63 301417137 1 ntfs [active] (143G)
301417200 1296 - free - (648k)
301418496 11159552 2 !18 (5.3G)
312578048 3760 - free - (1.9M)

Regarding a power issue, we do have outages every now and then - somewhat of a rural area (not rural rural, but moreso than a lot of places).

So, any ideas given the output above? Also, does this mean that my disk is hosed or just that the "scrub" failed (and assuming it just means it failed, am I able to recover)? Thanks for all your time so far - I'm ordering an uninterruptible power supply this weekend as a result!

cyberjock · Dec 13, 2013

Your smartctl commands are wrong.. you forgot the "-a". What you provided is useless. :P

DrKK · Dec 13, 2013

Dude.

The second drive listed is a windows drive. It's got an NTFS partition, and a MBR. So I have no idea WTF you're doing.

ktr · Dec 13, 2013

Sorry, the NTFS drive is unrelated to this issue ... it's from a laptop that I needed to get the data off of it. It's been a long day :/ Let's try again:

Code:

[root@freenas ~]# smartctl -A /dev/ada0
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  001  001  051    Pre-fail  Always  FAILING_NOW 48769
  3 Spin_Up_Time            0x0027  182  177  021    Pre-fail  Always      -      3891
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      18
  5 Reallocated_Sector_Ct  0x0033  179  179  140    Pre-fail  Always      -      639
  7 Seek_Error_Rate        0x002e  200  190  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  097  097  000    Old_age  Always      -      2598
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      18
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      14
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      3
194 Temperature_Celsius    0x0022  119  115  000    Old_age  Always      -      28
196 Reallocated_Event_Count 0x0032  187  187  000    Old_age  Always      -      13
197 Current_Pending_Sector  0x0032  197  197  000    Old_age  Always      -      1180
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0

cyberjock · Dec 13, 2013

and .. ada0 fail! CUPS=1180 is a bad thing.

So copy your data off of ada0 and hope you get the important stuff. If you have backups, then you should be fine as you need to use them. But you will need to recreate your pool from scratch.

DrKK · Dec 13, 2013

Just to add to what cyberjock said:

The correct number of Current Pending Sectors is 0. Any number above 0 is what is known as "very bad".

You can see your number is 1180. This means your drive is "hosed" and cannot be salvaged for use. Get what you can off of it, and get a new drive.

cyberjock · Dec 13, 2013

Also, you failed to setup emailing in FreeNAS, failed to setup SMART service and turn it on, or both. If you had you'd have gotten warnings LONG before you got the streetlight warning like you did.

joeschmuck · Dec 13, 2013

And you should also do a "smartctl -A /dev/ada1" so we can see if you have another drive about to go, hopefully not but you never know.

Dusan · Dec 13, 2013

I don't care much about ada1 smartctl output as that is an unrelated NTFS drive. What I'd like to know is what happened to the other drive from the mirror. From the information provided so far I see we have a mirror with one drive missing and the second drive has unrecoverable errors resulting in corrupted files. However, if the other drive just disconnected due to faulty power or SATA cable, it can still hold good copy of the data. @ktr, you need to figure out what happened to that other drive. The OS does not see it (camcontrol).

cyberjock · Dec 14, 2013

Eh.. If the other drive was disconnected you should make sure that the currently installed 1/2 of the mirror is disconnected. Basically treat each disk like a separate pool that should never be connected at the same time.

joeschmuck · Dec 14, 2013

Yea, I didn't catch that the NTFS was now the ada1 drive and that the other drive must be disconnected since it doesn't show up. My bad.

ktr · Dec 14, 2013

I can't seem to get the other disk to be show up in FreeNAS ... tried switching sata cables, switching power input, changing which port I plug it into ... it never shows up. Does that mean the 2nd disk is dead? I'm wondering how this all could have come about - I literally just built everything 3mos ago and thought I was generally pretty safe by having the two disks mirror each other. Now it seems that one is bad / going bad and I can't even see the other one. I'm also a little nervous as we have all our family videos on the one remaining disk :(

Any suggestions would be most welcome. Thanks again for everything so far.

joeschmuck · Dec 14, 2013

There is a thing called infant mortality where things die well before their designed end of life. Most of the time these things die rather quickly but for some reason yours hung in there a bit longer. Of course we are making an assumption that the drives are the cause and not the motherboard or possibly the power supply. To verify your drives are the issue you could place them into another computer and test them out with disk testing software. I would disconnect any drives to that machine that you want to keep in-tact, it's easy to delete a drive by accident. This is just something you could try if you want, I know I would just to ensure I get all the facts before filling out an RMA.

Important Announcement for the TrueNAS Community.

(ZFS) status is UNKNOWN (data corruption)

ktr

Cadet

joeschmuck

Old Man

Dusan

Guru

ktr

Cadet

joeschmuck

Old Man

ktr

Cadet

cyberjock

Inactive Account

DrKK

FreeNAS Generalissimo

ktr

Cadet

cyberjock

Inactive Account

DrKK

FreeNAS Generalissimo

cyberjock

Inactive Account

joeschmuck

Old Man

Dusan

Guru

cyberjock

Inactive Account

joeschmuck

Old Man

ktr

Cadet

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

(ZFS) status is UNKNOWN (data corruption)

Cadet

Old Man

Guru

Cadet

Old Man

Cadet

Inactive Account

FreeNAS Generalissimo

Cadet

Inactive Account

FreeNAS Generalissimo

Inactive Account

Old Man

Guru

Inactive Account

Old Man

Cadet

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "(ZFS) status is UNKNOWN (data corruption)"

Similar threads