My FreeNAS server had been working great for months and I recently added a bunch of files to it. Yesterday I logged in to the web admin and was greeted with a flashing yellow alert because the system was degraded. Two of my hard drives were unavailable. I wasn't immediately concerned, a raidz2 should be able to withstand two missing drives. My first action was to reboot the server. Once it came back up, only one drive was unavailable but the yellow alert message became more concerning:
"One or more devices has experienced an error resulting in data corruption. Applications may be affected.Restore the file in question if possible. Otherwise restore the entire pool from backup."
I went to the shell and `zpool status -v` told me there was one data error in a particular file. I did a quick search online to see what I should do about my damaged file. The common answers I found recommended deleting the file and then scrubbing the volume. I thought that was kind of annoying but not a big deal, I can replace that file. So I deleted the file and began a scrub on the volume. Once the scrub was complete, `zpool status` reported millions of data errors! I was kind of in disbelief that there was so much data corruption; I thought ZFS was supposed to prevent data corruption. I shutdown the server and opened the case just to check if there was anything obvious going on with one of the drive like a loose cable or something. I didn't find anything wrong and I booted it up again. For some reason it began resilvering immediately. So that's where I'm at now, after resilvering here's `zpool status`:
It's nice to see that there are now less than a million data errors, but `zpool status -v` now lists off pretty much every file stored on the server. I can see directory listings of all the files but am unable to copy or open any of them.
Is there anything I can do at this point to fix the data or are my files just gone? Is there anything I did wrong that made the situation worse? What can I do in the future to prevent this from happening again? I know hard drives die occasionally; I thought ZFS would be more reliable but my FreeNAS server ended up being the one to lose data first before all my other computers.
"One or more devices has experienced an error resulting in data corruption. Applications may be affected.Restore the file in question if possible. Otherwise restore the entire pool from backup."
I went to the shell and `zpool status -v` told me there was one data error in a particular file. I did a quick search online to see what I should do about my damaged file. The common answers I found recommended deleting the file and then scrubbing the volume. I thought that was kind of annoying but not a big deal, I can replace that file. So I deleted the file and began a scrub on the volume. Once the scrub was complete, `zpool status` reported millions of data errors! I was kind of in disbelief that there was so much data corruption; I thought ZFS was supposed to prevent data corruption. I shutdown the server and opened the case just to check if there was anything obvious going on with one of the drive like a loose cable or something. I didn't find anything wrong and I booted it up again. For some reason it began resilvering immediately. So that's where I'm at now, after resilvering here's `zpool status`:
Code:
pool: zfs state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: resilvered 5.29G in 2h2m with 858569 errors on Fri Nov 1 18:32:41 2013 config: NAME STATE READ WRITE CKSUM zfs DEGRADED 0 0 848K raidz2-0 DEGRADED 0 0 1.68M gptid/0599af5b-069a-11e1-bcea-1c6f655cfcd3 ONLINE 0 0 0 14508403152296357094 UNAVAIL 0 0 0 was /dev/gptid/066f9a2d-069a-11e1-bcea-1c6f655cfcd3 gptid/071c9544-069a-11e1-bcea-1c6f655cfcd3 ONLINE 0 0 134K gptid/07cba34b-069a-11e1-bcea-1c6f655cfcd3 ONLINE 0 0 2 gptid/087bcc8a-069a-11e1-bcea-1c6f655cfcd3 ONLINE 0 0 0 gptid/09565946-069a-11e1-bcea-1c6f655cfcd3 ONLINE 0 0 0 errors: 858569 data errors, use '-v' for a list
It's nice to see that there are now less than a million data errors, but `zpool status -v` now lists off pretty much every file stored on the server. I can see directory listings of all the files but am unable to copy or open any of them.
Is there anything I can do at this point to fix the data or are my files just gone? Is there anything I did wrong that made the situation worse? What can I do in the future to prevent this from happening again? I know hard drives die occasionally; I thought ZFS would be more reliable but my FreeNAS server ended up being the one to lose data first before all my other computers.