Data corruption questions

Status
Not open for further replies.

jdratlif

Dabbler
Joined
Oct 1, 2012
Messages
25
I have been running FreeNAS for about a year, and I haven't really had any problems with it.

Recently, I needed to re-partition a drive on another system that had a lot of data on it, so I copied all the data over to FreeNAS, and erased the drive. Now that I am trying to put i back, it tells me a large number of the files that I copied are corrupted and I can't copy them back.

I ran a scrub, and it detected "permanent errors" in a number of files. The scrub only found checksum errors with the files, no read/write errors. Running a long smart test on all three of my FreeNAS drives shows no issues.

What happened here? Is there any way to fix these files? What did I do wrong that FreeNAS just silently destroyed tons of my data, even though the hardware appears to be functional?

FreeNAS x64 8.3.1

Code:
zpool status -v | less
  pool: Giuseppe
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 111M in 3h38m with 116 errors on Fri Sep 27 02:26:22 2013
config:
 
        NAME        STATE     READ WRITE CKSUM
        Giuseppe    ONLINE       0     0   214
          raidz1-0  ONLINE       0     0   428
            ada0    ONLINE       0     0   607
            ada1    ONLINE       0     0   581
            ada2    ONLINE       0     0   598
 
errors: Permanent errors have been detected in the following files:


I ran the smart tests manually by doing smartctl -t long /dev/ada0, and one for ada1 and ada2.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I don't know.. what hardware are you using?

Have you tried doing a RAM test recently? If not you can do one using www.memtest.org. Just boot from it and let it run. 3 full passes without errors is good RAM. This usually takes many hours, so I typically start it and go to bed.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
And no, your data is trashed. Hopefully you have backups since RAIDZ1 is not a backup(as you have just seen first hand). In fact, RAIDZ1 is not considered reliable by today's standards(link in my sig if you want to see why).

But clearly something is wrong. ZFS is identifying the errors, but is unable to correct for them. Now you just need to figure out what you did wrong. Did you do scrubs regularly? Did you use ECC RAM? Did you virtualize FreeNAS?
 

jdratlif

Dabbler
Joined
Oct 1, 2012
Messages
25
AMD A4 3400 CPU
8 GB RAM (http://www.newegg.com/Product/Product.aspx?Item=N82E16820231550) -- if you need special ram to be ECC, then that's probably not it

Scrubs were run every 35 days, which FreeNAS decided on
ECC RAM? I don't know. Probably not.
Virtualize - no, it's iron.

I will run the memtest tonight. The files aren't critical, so there is no backup of them. It's just an annoyance.

Is there something I could have done before I deleted the original files to ensure they copied correctly?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Is there something I could have done before I deleted the original files to ensure they copied correctly?

Not sure. The cause for the problems need to be identified first. Are all of the files that are showing as corrupted only new files or is it old files?

Scrubs should be run more frequently for the consumer grade disks. This probably isn't your problem, but something you should read up on in the manual and forums and adjusting accordingly. I recommend 15 days.

As for RAM, ECC is "required" for ZFS to perform all of its functions properly. Without ECC RAM, if your RAM goes bad you will trash your zpool(yours isn't ECC RAM) Honestly, this is what looks like has happened, but until you run a test there's no way to know for sure. There's plenty of posts in the forums discussing this, so I won't hash it out again here. :)
 

jdratlif

Dabbler
Joined
Oct 1, 2012
Messages
25
Only new files have been corrupted, and it wasn't everything I copied. I will change the scrub timer to 15 days.

Should I be considering switching my volumes to UFS, or investing in ECC RAM?

I read a post recently that said FreeNAS demands 6 GB RAM just for the system and an additional 1 GB for each TB of space, where I thought it was just 1 GB per TB of space. I have 8 GB RAM, and 3x2 TB HDD in a RAIDZ1.

Should I buy another hard drive and make it RAIDZ2? Do I need more RAM?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
All of your questions are personal choices(with their own risks[data loss] and rewards[not spending money on more hardware]). I'd never do less than 8GB of RAM, and I'd try to have at least 12GB for your size, I'd recommend RAIDZ2 over RAIDZ1 anyday, and I'd never do ZFS without ECC RAM.

I'd wait to see what the RAM test says before you start looking at doing things. One problem at a time.
 

jdratlif

Dabbler
Joined
Oct 1, 2012
Messages
25
Thank you for your assistance. I really hadn't considered the RAM before. I didn't need to run the test all night, because memtest+ found 14000 errors in less than a minute. It's clearly my RAM.

If you were in my position and had money to spend, would you move to a RAIDZ2 solution with 12 GB ECC RAM and stick with ZFS?

Thanks.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Thank you for your assistance. I really hadn't considered the RAM before. I didn't need to run the test all night, because memtest+ found 14000 errors in less than a minute. It's clearly my RAM.

If you were in my position and had money to spend, would you move to a RAIDZ2 solution with 12 GB ECC RAM and stick with ZFS?

Thanks.

Yep.

Now here's the really crappy part. There's no way for you to ensure that any of the data on your pool is good. ZFS is unable to properly deal with RAM errors since it was designed expecting you will use ECC. Scrubs, which are supposed to help prevent corruption, instead add their own because they fix data that isn't corrupted making it corrupt now itself. And all of those errors that you have are probably the tip of the iceberg. Many more errors are probably present but won't give you an error. :(

So now you need to go file by file and figure out what is and isn't corrupt. Depending on when your RAM went bad you might be lucky and have good backups. But more than likely you will find that your backups are now trashed. That's why there's the heavy handed warnings in forum stickies that you should never use ZFS without ECC RAM.

So unless you plan to open and watch every movie beginning to end, read every text document, and open every picture you aren't going to know what is and isn't good.

Good luck.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Oh, and I would do your recommendations. RAIDZ2, 12GB of RAM(usually I'd just get 2x8GB sticks of ECC RAM), and continue to use ZFS.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yes, and the other thing here is that this problem isn't "solved" by switching to UFS; UFS will merrily corrupt data on bad hardware too. ZFS was just competent enough to notice that such corruption had happened.
 

jdratlif

Dabbler
Joined
Oct 1, 2012
Messages
25
Okay, so just to make sure I understand.

The scrub process, given the faulty memory, probably destroyed things on the zpool that it claimed to correct. The entire zpool is untrustworthy at best.

In this particular case where the new data is definitely bad, UFS would not have been any better, because it would have used the memory as well, and I just wouldn't have noticed the problem so quickly.

There's no solution for corrupt memory, except to not use it, and if I really want to protect myself in the future, I should invest in ECC RAM. RAIDZ2 is a much more effective data protection plan, as the process of rebuilding an array after a drive failure carries a statistically significant risk of a second drive failing, thereby destroying the array entirely.

I think this explains some odd errors I had recently on my Mac, and some problems I had with my iPod, and I'm even remembering some odd kernel panics on FreeNAS now. Sigh. The signs were all in front of me... Oh well, I'll recover eventually.

Is there anything else you would recommend I do to help avoid disaster in the future?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Okay, so just to make sure I understand.

The scrub process, given the faulty memory, probably destroyed things on the zpool that it claimed to correct. The entire zpool is untrustworthy at best.

No, if only "new" files are corrupt, probably happened during copying data to the server. Which could imply that it isn't necessarily RAM.

In this particular case where the new data is definitely bad, UFS would not have been any better, because it would have used the memory as well, and I just wouldn't have noticed the problem so quickly.

There's no solution for corrupt memory, except to not use it, and if I really want to protect myself in the future, I should invest in ECC RAM. RAIDZ2 is a much more effective data protection plan, as the process of rebuilding an array after a drive failure carries a statistically significant risk of a second drive failing, thereby destroying the array entirely.

I think this explains some odd errors I had recently on my Mac, and some problems I had with my iPod, and I'm even remembering some odd kernel panics on FreeNAS now. Sigh. The signs were all in front of me... Oh well, I'll recover eventually.

Is there anything else you would recommend I do to help avoid disaster in the future?

The rest of this is correct. The recommendation is to curse and swear and cry a little, and then to get more paranoid.

One thing to note about ZFS is that a scrub can actively do further damage on non-ECC hardware with memory problems, whereas UFS won't actively try to "correct" bad data.

You'll see some of us chanting "ECC" "ECC" and then the inevitable pushback from people who think that means "but not for a home user" or "server hardware is big and noisy" or whatever. Typically, I haven't found good trustworthy server-class hardware supporting ECC to be substantially more expensive than prosumer-grade gear. But both are more trustworthy than the $50 board some people will pick up at the local computer shop that is proudly made by "Made in China." And lots of the prosumer-grade gear still has tradeoffs like Realtek ethernets. So if you buy new? See the thread in "Hardware" about suggested hardware. The stuff in there should work well for most server-style uses, not just FreeNAS.

Around here, we've had good luck over the years bench-testing hardware for a month or three prior to deployment, which typically involves stress testing in various ways (most of which are not-too-nice). This is probably even more important if you're relying on every bit of your eight gigabytes of memory (that's sixty four billion bits) to be error-free. And ECC faults should be treated as a sign to replace the memory, not used as a safety net and relied upon to correct future problems.

Expect disks to fail at the worst possible time. Like when you lose one, the second will fail the next day. Use at least RAIDZ2 and maybe keep a spare handy? Us true paranoids have RAIDZ3, plus a hot spare, PLUS a cold spare that's burned in and ready for insertion.

Be proactive with scrubs and SMART testing.

And if you do all that, fate says that what'll happen is that nothing will go wrong and in five years you'll feel stupid for having taken the extra steps. But when that outcome strikes, remember: fate victimizes the unprepared.

Oh lord I took my cynical pills early tonight.
 

jdratlif

Dabbler
Joined
Oct 1, 2012
Messages
25
I was wrong when I said only new files were corrupt. Only new files were on the list that scrub gave me of permanent data errors. However, I have seen some weird things with some of my files lately that I haven't paid too much attention to, but now make perfect sense. These are old files, in same cases weeks old.

This is the first time I've had files that I could not copy from FreeNAS. It's the first time FreeNAS figured out there was massive data loss, but it is clearly not the first time I've had data corruption. I have a disc image that Mac tells me has a bad checksum, and when I ignore the checksum and try to use it, it fails; but that same disc image that I burned to a DVD two years ago works fine in the same application.

I wish I had seen the forum recommendations about ECC memory last year. It doesn't appear that my current box will do ECC memory. I'm not sure I can justify a new NAS right now. If I could, do you like the iXsystems FreeNAS Mini? It doesn't say whether it uses ECC RAM or what kind of NIC it comes with.
 
Status
Not open for further replies.
Top