[SOLVED] Fatal Trap 12, possibly due to ECC memory, further issue...

Status
Not open for further replies.

eretron

Dabbler
Joined
Dec 11, 2013
Messages
27
Hi all,

let me just say off the bat that I've read everything I could find on the issue, and while much of that was helpful in understanding the reasons behind the crash and what generally can be done to prevent solve many related problems, it did not solve my issue. Let me start by listing my build:

FreeNAS 9.2.1.8
Supermicro X8STE
6x4Gb ECC DDR3 RAM
6x2Tb WD Red setup as RAIDZ2

We have established with almost absolute certainty that the issue came up because of a faulty memory module, as supported by the fact that BIOS and MemTest only recognized 16 of the 24Gb RAM. Further slot testing narrowed it to just the one module being faulty, so we discarded a pair of modules, and were left with the 16Gb. We rebooted to no effect, the error still shows at the point of mounting the pool during boot.

We tried importing the pool using various importing commands offered on forums, including using -f flag, forced import, as well as X forced import. It would return with either fatal trap 9 or 12, depending on the command. We tried the same thing with the new FreeNAS 9.2.1.9 on another stick, where we would first get notified that the pool may be in use by another system (the old one), and if we try to force it, it again returns with the fatal traps.

We read what was said here, and the main proposition we see is creating another pool and migrating the data over there. Since our data is of enormous importance to us and the well-being of our business, and since we see no way of accessing the data so as to be able to migrate it, we were wondering if anyone has a solution to this?

Are we missing anything in this process? Hopefully something simple and stupid which makes us look like complete idiots, but idiots with their data safe and sound...

Thanks.
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm not sure how important the data is to you, but if you are willing to pay for the chance of recovery you can always call iXsystems and do a one-off 3 hour support ticket. Do note that you can't expect immediate turnaround as they'll have to send you a contract you sign and return, etc etc etc.

Other than that, I don't have much advice. Recovery isn't something you can really do effectively as there's too many logical stops and go-tos to get to a working pool. I will say that since you already tried -f and -X, there is, in my opinion, and even smaller chance of getting your data back. Those tools shouldn't be willy-nilly thrown around and can actually take a recoverable pool and make it unrecoverable.

If the data is worth paying for you can call iXsystems and see if they will help.
 

eretron

Dabbler
Joined
Dec 11, 2013
Messages
27
Thanks, cyberjock, for being ever-present and providing constructive feeedback. :)

Okay, we made some progress:

Okay, so we first used
Code:
zdb -ul <vdev>

to get the list of uberblocks. Then we successfully imported the pool using
Code:
zpool import -N -o readonly=on -f -F -T <uberblock_id> <pool>


The problem we are getting now is that the pool isn't mounted. When we try with
Code:
zfs mount poolname

we get a "failed to create mountpoint".

The ZFS list command show the correct dataset list with all the details. Any ideas?

PS - we're reserving the pay-to-solve options as our last solution, naturally. In the meantime, we are being careful in using non-invasive read only commands.
 

eretron

Dabbler
Joined
Dec 11, 2013
Messages
27
Good news people.

We have managed to import pool.

Solution for mounting problems was in permissions of /mnt folder.

Code:
chmod -R 777 /mnt


and voila.

We will now try to backup all data via SFTP, and then destroy current pool and make new one.

I hope someone will find this useful.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Sorry I didn't respond sooner. Been too busy to read the forums the last few days. Just want to say a few things:

1. Thank goodness you got your data. Woohoo!
2. -F, -f, -T and -X are VERY invasive. So putting those in the same post with you saying "we are being careful in using non-invasive read-only commands"doesn't mean anything. Even if you mount a pool read-only, the ZFS service will make changes to the pool if it finds corruption. This cannot be blocked under any circumstances. The only thing the read-only parameter does is prevent the user from writing to the pool. This "not well-known but very important fact" has actually been useful in saving a few pools in my days. :)
 

sfcredfox

Patron
Joined
Aug 26, 2014
Messages
340
I'd like to ask a (possibly) dumb question on this topic:

Since he said he was using ECC memory, shouldn't this type of situation have been prevented by the fact that it was ECC?
I thought that was the whole point? Or does that only cover flipping erroneous bits and not cover a total module failure if there was important data in there?

By the way, did you have any ideas on how/why the permissions issue crept in here?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd like to ask a (possibly) dumb question on this topic:

Since he said he was using ECC memory, shouldn't this type of situation have been prevented by the fact that it was ECC?
I thought that was the whole point? Or does that only cover flipping erroneous bits and not cover a total module failure if there was important data in there?

By the way, did you have any ideas on how/why the permissions issue crept in here?

Yes and no. If it were a bad memory module that was failing ECC the system should have been halted when the memory controller detected an uncorrectable error. On the other hand if the "bad" memory module was actually good and the memory controller on the motherboard is flaky or failed then it's anyone's guess based on what fails and how it affects the system. For example, if the memory controller fails in a way that the "halt the CPU" message can't be sent to the CPU then it would obviously never halt. There's too many variables and unless he was able to identify that the memory module was in fact failed and the only thing that failed (and that failure didn't somehow impact the system in some other way directly or indirectly) then there's no way to prove for 100% certainty much of anything. It could be that the memory channel that module was on is bad because the memory controller is bad. The OP doesn't provide enough information.

There's also no proof that he doesn't actually have something like 5 failing disks in his 6 disk RAIDZ2. We don't know that, and we're assuming he's diligent and has setup good SMART tests, scrubs, etc. But we don't actually know that for 100% certainty. If I had a problem with my pool I'd be able to immediately prove that I have good SMART tests, regular scrubs, etc because I know better. But "Mr. Random User" seems to more often than not screw this up, and he finds out he made a mistake when his pool won't mount anymore.

Then there's always the possibility of other faults like a crappy power supply, bad choice of SATA/SAS controller, etc etc etc.

Just too many variables that may or may not affect the outcome to know for certainty anything. The one thing I can tell you though is that this is why people do backups. ;)
 

eretron

Dabbler
Joined
Dec 11, 2013
Messages
27
I am sure that memory was to blame, since 1 module out of 6 is totally dead after that problem occurred.

System had regular scrub (every 15 days) and SMART short and long tests on all disks. Everything was clean and error free.

System is now back and running with no errors whatsoever, with brand new 24GB of ECC.

Also new stuff is another FreeNAS box made for rsync daily backups... Just in case...
 
Joined
Mar 6, 2014
Messages
686
Since our data is of enormous importance to us and the well-being of our business...
Also new stuff is another FreeNAS box made for rsync daily backups... Just in case...
You almost learned the hardest way.....
Glad you got things fixed ;)
 
Status
Not open for further replies.
Top