Data Corruption

sotwizz · Nov 2, 2013

My FreeNAS server had been working great for months and I recently added a bunch of files to it. Yesterday I logged in to the web admin and was greeted with a flashing yellow alert because the system was degraded. Two of my hard drives were unavailable. I wasn't immediately concerned, a raidz2 should be able to withstand two missing drives. My first action was to reboot the server. Once it came back up, only one drive was unavailable but the yellow alert message became more concerning:

"One or more devices has experienced an error resulting in data corruption. Applications may be affected.Restore the file in question if possible. Otherwise restore the entire pool from backup."

I went to the shell and `zpool status -v` told me there was one data error in a particular file. I did a quick search online to see what I should do about my damaged file. The common answers I found recommended deleting the file and then scrubbing the volume. I thought that was kind of annoying but not a big deal, I can replace that file. So I deleted the file and began a scrub on the volume. Once the scrub was complete, `zpool status` reported millions of data errors! I was kind of in disbelief that there was so much data corruption; I thought ZFS was supposed to prevent data corruption. I shutdown the server and opened the case just to check if there was anything obvious going on with one of the drive like a loose cable or something. I didn't find anything wrong and I booted it up again. For some reason it began resilvering immediately. So that's where I'm at now, after resilvering here's `zpool status`:

Code:

  pool: zfs
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 5.29G in 2h2m with 858569 errors on Fri Nov  1 18:32:41 2013
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    zfs                                            DEGRADED    0    0  848K
      raidz2-0                                      DEGRADED    0    0 1.68M
        gptid/0599af5b-069a-11e1-bcea-1c6f655cfcd3  ONLINE      0    0    0
        14508403152296357094                        UNAVAIL      0    0    0  was /dev/gptid/066f9a2d-069a-11e1-bcea-1c6f655cfcd3
        gptid/071c9544-069a-11e1-bcea-1c6f655cfcd3  ONLINE      0    0  134K
        gptid/07cba34b-069a-11e1-bcea-1c6f655cfcd3  ONLINE      0    0    2
        gptid/087bcc8a-069a-11e1-bcea-1c6f655cfcd3  ONLINE      0    0    0
        gptid/09565946-069a-11e1-bcea-1c6f655cfcd3  ONLINE      0    0    0
 
errors: 858569 data errors, use '-v' for a list

It's nice to see that there are now less than a million data errors, but `zpool status -v` now lists off pretty much every file stored on the server. I can see directory listings of all the files but am unable to copy or open any of them.

Is there anything I can do at this point to fix the data or are my files just gone? Is there anything I did wrong that made the situation worse? What can I do in the future to prevent this from happening again? I know hard drives die occasionally; I thought ZFS would be more reliable but my FreeNAS server ended up being the one to lose data first before all my other computers.

Dusan · Nov 2, 2013

The thing you should do immediately is to run a memory test: http://www.memtest.org/

cyberjock · Nov 2, 2013

Well,considering you provided no hardware list its hard to tell you what is wrong. But I'll sum up a few things I'm guessing you did wrong:

1. Didn't setup SMART monitoring. If you had, you'd have gotten an email when any disks went offline. You wouldn't have logged in to find 2 disks offline. More than likely this one thing would have saved your data.
2. You ran out of redundancy. RAIDZ2 lets 2 disks fail. 2 disks failed, but because of URE errors it is quite possible to lose data on both hardware RAID and ZFS. The solution to protect your data is to monitor your disks(see item 1) and to proactive replace bad components.
3. Hopefully you used server grade motherboard, CPU, and RAM. But if I am a betting man I'm guessing when you post your hardware list you're going to list desktop parts. That's going to get ugly as ECC RAM is virtually a requirement if you value your pool(s).
4. ZFS is more trustworthy than hardware RAID, as long as you do your homework. If you don't want to or didn't do a good job of doing your homework you're going to be shellshocked at the mistakes you've probably made that you didn't know were bad(such as not using ECC RAM).

I talked to a guy for about 30 mins yesterday. He had the normal story that most people fall under here. He's been in IT for years, its a Friday afternoon and he thought it would be a good time to go to FreeNAS. So he upgraded and then proceeded to do just about everything wrong he possibly could. Off the top of my head:

1. Went with a RAIDZ1, then 2 mirrored vdevs and wondered where his missing disk space went.
2. Didn't use ECC RAM.
3. Created and tried to setup Jails which didn't go so well(he almost lost his entire movie collection because of his misunderstanding of jails).
4. didn't setup SMART monitoring or testing.
5. Had no clue what scrubs were.

gpsguy · Nov 2, 2013

You forgot #5 - Didn't have any backups

cyberjock said:
Well,considering you provided no hardware list its hard to tell you what is wrong. But I'll sum up a few things I'm guessing you did wrong:

Important Announcement for the TrueNAS Community.

Data Corruption

sotwizz

Cadet

Dusan

Guru

cyberjock

Inactive Account

gpsguy

Active Member

Similar threads

Important Announcement for the TrueNAS Community.

Data Corruption

sotwizz

Cadet

Dusan

Guru

cyberjock

Inactive Account

gpsguy

Active Member

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Data Corruption"

Similar threads