Thank you for the very quick response. As I said, the non-ECC memory was a calculated risk based on being given a free mb/cpu/memory. I am well aware of the benefits of enterprise grade infrastructure that I use daily for work and would have definitely gone with the recommended equipment if I were buying everything new. In fact, replacing mb/cpu/memory is not out of the question for me if that is the only true solution. For a home system, I was willing to take the risk of cosmic rays during a scrub corrupting massive amounts of data and cosmic rays during normal operations corrupting individual files. Having this issue after three weeks of uptime versus seven years and multiple drive swaps/expansions on an Ubuntu running md/ext is surprising.
Let me put your comment in context. Are you *sure* you didn't have any problems when you were using md/ext? I mean, ext has no way to actually identify corruption. So unless you went checking every byte of all of your data you really wouldn't know for sure, right?
That's the problem with going with ZFS (and the reason I went to ZFS). You cannot compare ZFS to alternatives because, well, there really aren't any alternatives. So your comparison, while I understand what you mean, really doesn't pass a complex test to see if it is a fair comparison. You could have had corruption happening at regular intervals and you might never have known. In fact, some of these studies that say ZFS is so great claim that corruption is far more widespread than we realize and we probably wouldn't be able to accept the reality of it. I still don't buy the reality of it, but I have seen some of it first-hand. ;)
I did not expect one or two corrupted files to damage the entire file system to the point that it was repairable only by destroying/recreating or simple commands "zpool status -v" causing system hangs or reboots! So far, memtest has not found any issues after completing one pass. I will let it run a few more passes. If it were just a few corrupted files, I would easily chalk it up to non-ECC memory and acknowledge everyone was right and I underestimated the frequency of cosmic rays impacting my data. I am concerned about the only recommended recovery procedure and a fairly innocuous command such as zpool status -v taking the system down.
If the ZFS metadata gets corrupted then the whole house of cards crashes down. Yes, it's totally possible to do a zpool status and crash the box. There's dozens and dozens of users that couldn't even mount their pool because it was corrupted. The couldn't even access their data anymore. If a zpool status is all the problem you are having, count yourself lucky and get your data off the pool. ;)
I took the chance with non-ECC memory and got corrupt data. If I need to spend the money to correct that, well then stupidity has a price. Today that price looks like it will be $500+ in addition to the cost of a second set of backup disks if I don't want to completely rely on a RAID0 to preserve my data locally for a few days.
This is the harsh reality. It's also why we're so adamant about the whole "go ECC and server-grade or go home". The costs of getting out of the hole you might end up in (nevermind the emotional cost of the lost data) makes it far cheaper (and faster) in the long run to do it right.
I am less concerned with the source of the original corruption and would like some assistance troubleshooting why I can't delete the files and move on as well as why zpool is crashing. Having ECC memory might have prevented the current corruption (assuming it was caused by my non-ECC memory), but it doesn't explain zpool status -v crashing or why I can't clear the corrupted data from the existing pool.
Actually it might. If your metadata is corrupted beyond repair with redundancy then ZFS has to process whatever garbage it receives. That garbage often causes system crashes. Unfortunately you'll find very few people here willing to troubleshoot this issue because based on experience it's a foregone conclusion that it's probably due to hardware. Even *if* someone wanted to dig deeper, they'd only be able to show what is wrong, not necessarily correct it, and it would take many many hours to diagnose.
Is the answer really as simple as: "You didn't use ECC memory so there is nothing we can do for you?" With ECC memory, could a power outage put me in a similar situation? How about a FreeNAS/ZFS bug? I do have a UPS and have a shutdown timer, but I'm still nervous about not having any recourse besides wiping the file system to fix two corrupt files that I don't even care about. I am not trying to be difficult, I am really just trying to understand recovery options for when things go badly. If wipe/restore is the only real answer, then I need to plan for a second local backup in addition to RAIDZ2 with local and offsite backups, which is something I hadn't anticipated for a home NAS. Would snapshots have mitigated the need for rebuilding the file system?
Is the answer that simple. Well, you should *always* have a backup because things can and do go wrong. If your SAS controller starts writing random garbage to all of the disks its possible to trash a zpool in an unrecoverable way. ZFS does not negate the need for a backup. ZFS only provides validation that the data on the disk is correct or is not correct. As a rule a power outage shouldn't put you in the same place, but on non server-grade that can often be found to not be true.
I'd be willing to bet money the only solution is to wipe and restore. :(
Snapshots wouldn't have mitigated this disaster. All it does is provide you with points in time you could roll back to. But once a pool is corrupted the best way to deal with it is to remove the corruption. If it's just a file you can simply delete it. But if its metadata then you might have no recourse except a nuke and repave.