RAIDZ2 - Permanent errors have been detected

Status
Not open for further replies.

jcsston

Cadet
Joined
Jan 16, 2013
Messages
6
I have a backup server running FreeNAS-9.2.1.3-RELEASE-x64 (dc0c46b),
that is reporting "Permanent errors have been detected in the following files" during a zpool scrub.
This is with RAIDZ2 and ECC memory.

How can I fix this error, is destroying and re-creating the pool the only choice?

Code:
[root@bisoffsite] ~# zpool status -v backup
  pool: backup
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Aug  9 15:13:09 2014
        5.10T scanned out of 16.7T at 35.1M/s, 96h16m to go
        0 repaired, 30.55% done
config:

        NAME                                                STATE     READ WRITE CKSUM
        backup                                              ONLINE       0     0     3
          raidz2-0                                          ONLINE       0     0     6
            gptid/83030910-d43e-11e2-9993-001e4f206cc6.eli  ONLINE       0     0     0
            gptid/83616d59-d43e-11e2-9993-001e4f206cc6.eli  ONLINE       0     0     0
            gptid/83ac6c71-d43e-11e2-9993-001e4f206cc6.eli  ONLINE       0     0     0
            gptid/83ff8106-d43e-11e2-9993-001e4f206cc6.eli  ONLINE       0     0     0
            gptid/84331ce3-d43e-11e2-9993-001e4f206cc6.eli  ONLINE       0     0     0
            gptid/846d5967-d43e-11e2-9993-001e4f206cc6.eli  ONLINE       0     0     0
          raidz2-1                                          ONLINE       0     0     0
            gptid/320e7bcf-d52f-11e3-8662-002219099607.eli  ONLINE       0     0     0
            gptid/0566b96c-255f-11e3-bb68-002219099607.eli  ONLINE       0     0     0
            gptid/05f0b4c0-255f-11e3-bb68-002219099607.eli  ONLINE       0     0     0
            gptid/0663f4bd-255f-11e3-bb68-002219099607.eli  ONLINE       0     0     0
            gptid/06ca9e5f-255f-11e3-bb68-002219099607.eli  ONLINE       0     0     0
            gptid/0729c039-255f-11e3-bb68-002219099607.eli  ONLINE       0     0     0
        spares
          gptid/f65ab5f6-e1e6-11e3-8662-002219099607.eli    AVAIL

errors: Permanent errors have been detected in the following files:

        backup:<0x39abd35>
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, destroying and recreating your pool is the easiest.

Unless you know what metadata is attached to 0x39abd35 it'll be hard to isolate the file and delete it. :(

Considering you are running a RAIDZ2 and shouldn't have much of a chance of corruption you have clearly done something wrong with your server.. so let's starts with the basics.

Post your full hardware, FreeNAS version, and your schedule for SMART short and long tests and scrubs.
 

jcsston

Cadet
Joined
Jan 16, 2013
Messages
6
I'm running FreeNAS-9.2.1.3-RELEASE-x64 (dc0c46b)
The server is a Dell PowerEdge 2950
CPU: Xeon CPU E5405 @ 2.00Ghz (4 cores)
RAM: 8GB DDR2 ECC
RAID: PERC 6/i Integrated + PERC 5/E Adapter with Dell PowerVault MD1000 (external 15 bay drive enclosure)
HDD: 12 x 2TB WDC WD20EARS SATA + 1 x 2TB WDC WD2000F9YZ SATA (hotspare)

All the hard drives are setup as individual RAID-0 volumes with the PERC 6/i and PERC 5/e cards.

I have the default 35 day ZFS scrub schedule setup. SMART doesn't work through the RAID controllers.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I have the default 35 day ZFS scrub schedule setup. SMART doesn't work through the RAID controllers.
I'm no expert on ZFS or FreeNAS, but I'm pretty sure not being able to perform SMART tests is a pretty big red flag and one of the major risk factors for losing your pool.
 

david kennedy

Explorer
Joined
Dec 19, 2013
Messages
98
I'm no expert on ZFS or FreeNAS, but I'm pretty sure not being able to perform SMART tests is a pretty big red flag and one of the major risk factors for losing your pool.

What makes you think that? perhaps the SMART functionality simply doesn't work over the controller card?
Probably due to DELL using a custom "administrator tool" to manage the card under windows?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Umm.. dell perc controllers should NOT be used. Exactly what Whattteva said is true. If you can't do SMART you have major issues you need to deal with *now*. Not when you are having problems. At that point it'll be to late and you might lose all of your data!

This is one of the main reasons why the FreeNAS manual says that you should NOT use hardware RAID with ZFS. Its a recipe for disaster in the long term.

So go read our stickies on recommended hardware and get something else ordered. Also keep in mind that you *will* have to destroy your pool and rebuild it because you created all those RAID0 arrays that won't work with the new controller.

Your problem is *directly* attributable to you chosing to use the RAID controllers when you shouldn't be. Fix it or a few checksum errors are going to be the least of your worries.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
What makes you think that? perhaps the SMART functionality simply doesn't work over the controller card?
Probably due to DELL using a custom "administrator tool" to manage the card under windows?
It's one of the precautions on the stickies. Here's the exact quote taken out of the "So you want some hardware suggestions" thread:
Random RAID controllers that are operating in RAID mode (showing virtual or logical devices to FreeNAS) are a very bad idea. If your controller costs more than a few hundred dollars, it may not be a good choice for FreeNAS.

EDIT: Dang, Cyberjock beat me to it :confused:
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
If the controller doesn't support SMART, then not being able to run SMART means very little.

Let the scrub complete. The permanent file errors may disappear if they are no longer relevant. Or zpool clear backup may do it (as well as resetting the error counts)

But, that doesn't explain why you have this problem. Since you are using a RAID controller, is it possible it is doing write caching and you had a crash or power loss? It seems strange that you have top level checksum errors without anything going on with individual drives. That sort of implies a problem with your server hardware, I would think.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
zpool clear won't remove the file errors. The errors are there until the offending blocks are released.

It is very possible he has multiple disks failing, but he has no way of proving that because he can't do SMART tests or SMART monitoring. That's why RAID controllers = ZFS fail.
 

jcsston

Cadet
Joined
Jan 16, 2013
Messages
6
There was a power failure in the data-center the server is hosted at over the weekend. I started a manual ZFS scrub once the power came back online Saturday.
The PERC cards do check the SMART information and do run Patrol Reads in the background automatically.

I'll look into a SAS controller on the list mentioned that would work with our Dell MD1000 disk enclosures.


Thanks,
Jory
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There was a power failure in the data-center the server is hosted at over the weekend. I started a manual ZFS scrub once the power came back online Saturday.
The PERC cards do check the SMART information and do run Patrol Reads in the background automatically.

I'll look into a SAS controller on the list mentioned that would work with our Dell MD1000 disk enclosures.


Thanks,
Jory

Yeah, sorry but plenty of users have seen lost data because of Percs. They do their own SMART thing, but it's not sufficient enough. Not sure why as I don't have one of those cards to see what exactly it is monitoring. It could be a big farce and not really be useful at all (which is my guess considering how many users have lost data because of those Perc cards).
 

jcsston

Cadet
Joined
Jan 16, 2013
Messages
6
I found that people have had success with the LSI SAS3801E card and MD1000 enclosures. I'll also get a SFF-8470 to SFF-8088 cable to connect the newer SAS connector to the MD1000.

Would using the zfs send / receive to backup and restore the pool clear up this error? or would I need to do a file-level backup and restore?

Thanks,
Jory
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I would do file level because the corruption may follow the send/receive and since the send/receive wouldn't include the parity data the error would appear to be corrected even though it hasn't.

My recommendation would be rsync that bad boy to your temp server, but when moving back you could do ZFS send/receive so you can get higher speeds.
 

david kennedy

Explorer
Joined
Dec 19, 2013
Messages
98
It's one of the precautions on the stickies. Here's the exact quote taken out of the "So you want some hardware suggestions" thread:


EDIT: Dang, Cyberjock beat me to it :confused:

So do Oracle's ZS3 systems support and utilize SMART?

First, i agree that the choice of a RAID card was poor. All zfs documentation (both from Oracle/SUN and freenas) state NOT to use them.

My point (and maybe wasn't communicated clearly) was that the inability to run the SMART tests didn't cause the issue. Best case it may have helped if the issue was due to a disk failure by warning of a potential issue. The disk failure scenario is possible but not likely as it is a raidz2 pool.

More likely the combination of a hardware raid card + power-failure caused this, in which case what value would the SMART test have added?

Before anyone else points to "the stickies"...

Always use recommended hardware and NEVER use a raid card with ZFS. ZFS requires DIRECT access to the disks. If you look at any of the SUN/Oracle boxes they ship with standard SATA ports and NO RAID controllers.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
My point (and maybe wasn't communicated clearly) was that the inability to run the SMART tests didn't cause the issue. Best case it may have helped if the issue was due to a disk failure by warning of a potential issue. The disk failure scenario is possible but not likely as it is a raidz2 pool.
I suppose I was also not being entirely clear. It was never my intention to state that not being able to run the SMART test caused the issue. Saying that is the sole issue is just silly because SMART is just a set of diagnostic tools after all. My point was that not being able to run it was a huge red flag because that obviously indicates that the system does not have direct access to the HDD, which IS one of the major risk factor when it comes to things going unexpectedly wrong.
 

esamett

Patron
Joined
May 28, 2011
Messages
345

rs225

Guru
Joined
Jun 28, 2014
Messages
878
What I mean about the scrub and permanent file errors is that permanent file errors will linger in the listing(with hexadecimal junk) even after you delete the file, snapshots, etc. But after a scrub, they will finally be removed from the list if they are actually gone.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No, that list of file errors is the list that ZFS knows of. If/when those errors are corrected they will instantly be removed. In your case, if it was some .txt file that was corrupt and you deleted it then that text file would instantly not be listed anymore.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175

esamett

Patron
Joined
May 28, 2011
Messages
345
"Strong in the Force am I, but not that strong." - Yoda

Not familiar with EFI.
 
Status
Not open for further replies.
Top