[SOLVED] Fatal Trap 12, possibly due to ECC memory, further issue...

eretron · Dec 8, 2014

Hi all,

let me just say off the bat that I've read everything I could find on the issue, and while much of that was helpful in understanding the reasons behind the crash and what generally can be done to prevent solve many related problems, it did not solve my issue. Let me start by listing my build:

FreeNAS 9.2.1.8
Supermicro X8STE
6x4Gb ECC DDR3 RAM
6x2Tb WD Red setup as RAIDZ2

We have established with almost absolute certainty that the issue came up because of a faulty memory module, as supported by the fact that BIOS and MemTest only recognized 16 of the 24Gb RAM. Further slot testing narrowed it to just the one module being faulty, so we discarded a pair of modules, and were left with the 16Gb. We rebooted to no effect, the error still shows at the point of mounting the pool during boot.

We tried importing the pool using various importing commands offered on forums, including using -f flag, forced import, as well as X forced import. It would return with either fatal trap 9 or 12, depending on the command. We tried the same thing with the new FreeNAS 9.2.1.9 on another stick, where we would first get notified that the pool may be in use by another system (the old one), and if we try to force it, it again returns with the fatal traps.

We read what was said here, and the main proposition we see is creating another pool and migrating the data over there. Since our data is of enormous importance to us and the well-being of our business, and since we see no way of accessing the data so as to be able to migrate it, we were wondering if anyone has a solution to this?

Are we missing anything in this process? Hopefully something simple and stupid which makes us look like complete idiots, but idiots with their data safe and sound...

Thanks.

cyberjock · Dec 8, 2014

I'm not sure how important the data is to you, but if you are willing to pay for the chance of recovery you can always call iXsystems and do a one-off 3 hour support ticket. Do note that you can't expect immediate turnaround as they'll have to send you a contract you sign and return, etc etc etc.

Other than that, I don't have much advice. Recovery isn't something you can really do effectively as there's too many logical stops and go-tos to get to a working pool. I will say that since you already tried -f and -X, there is, in my opinion, and even smaller chance of getting your data back. Those tools shouldn't be willy-nilly thrown around and can actually take a recoverable pool and make it unrecoverable.

If the data is worth paying for you can call iXsystems and see if they will help.

eretron · Dec 11, 2014

Thanks, cyberjock, for being ever-present and providing constructive feeedback. :)

Okay, we made some progress:

Okay, so we first used

Code:

zdb -ul <vdev>

to get the list of uberblocks. Then we successfully imported the pool using
Code:
zpool import -N -o readonly=on -f -F -T <uberblock_id> <pool>

The problem we are getting now is that the pool isn't mounted. When we try with

Code:

zfs mount poolname

we get a "failed to create mountpoint".

The ZFS list command show the correct dataset list with all the details. Any ideas?

PS - we're reserving the pay-to-solve options as our last solution, naturally. In the meantime, we are being careful in using non-invasive read only commands.

eretron · Dec 11, 2014

Good news people.

We have managed to import pool.

Solution for mounting problems was in permissions of /mnt folder.

Code:

chmod -R 777 /mnt

and voila.

We will now try to backup all data via SFTP, and then destroy current pool and make new one.

I hope someone will find this useful.

cyberjock · Dec 13, 2014

Sorry I didn't respond sooner. Been too busy to read the forums the last few days. Just want to say a few things:

1. Thank goodness you got your data. Woohoo!
2. -F, -f, -T and -X are VERY invasive. So putting those in the same post with you saying "we are being careful in using non-invasive read-only commands"doesn't mean anything. Even if you mount a pool read-only, the ZFS service will make changes to the pool if it finds corruption. This cannot be blocked under any circumstances. The only thing the read-only parameter does is prevent the user from writing to the pool. This "not well-known but very important fact" has actually been useful in saving a few pools in my days. :)

sfcredfox · Dec 13, 2014

I'd like to ask a (possibly) dumb question on this topic:

Since he said he was using ECC memory, shouldn't this type of situation have been prevented by the fact that it was ECC?
I thought that was the whole point? Or does that only cover flipping erroneous bits and not cover a total module failure if there was important data in there?

By the way, did you have any ideas on how/why the permissions issue crept in here?

cyberjock · Dec 13, 2014

sfcredfox said:
I'd like to ask a (possibly) dumb question on this topic:

Since he said he was using ECC memory, shouldn't this type of situation have been prevented by the fact that it was ECC?
I thought that was the whole point? Or does that only cover flipping erroneous bits and not cover a total module failure if there was important data in there?

By the way, did you have any ideas on how/why the permissions issue crept in here?

Yes and no. If it were a bad memory module that was failing ECC the system should have been halted when the memory controller detected an uncorrectable error. On the other hand if the "bad" memory module was actually good and the memory controller on the motherboard is flaky or failed then it's anyone's guess based on what fails and how it affects the system. For example, if the memory controller fails in a way that the "halt the CPU" message can't be sent to the CPU then it would obviously never halt. There's too many variables and unless he was able to identify that the memory module was in fact failed and the only thing that failed (and that failure didn't somehow impact the system in some other way directly or indirectly) then there's no way to prove for 100% certainty much of anything. It could be that the memory channel that module was on is bad because the memory controller is bad. The OP doesn't provide enough information.

There's also no proof that he doesn't actually have something like 5 failing disks in his 6 disk RAIDZ2. We don't know that, and we're assuming he's diligent and has setup good SMART tests, scrubs, etc. But we don't actually know that for 100% certainty. If I had a problem with my pool I'd be able to immediately prove that I have good SMART tests, regular scrubs, etc because I know better. But "Mr. Random User" seems to more often than not screw this up, and he finds out he made a mistake when his pool won't mount anymore.

Then there's always the possibility of other faults like a crappy power supply, bad choice of SATA/SAS controller, etc etc etc.

Just too many variables that may or may not affect the outcome to know for certainty anything. The one thing I can tell you though is that this is why people do backups. ;)

eretron · Jan 28, 2015

I am sure that memory was to blame, since 1 module out of 6 is totally dead after that problem occurred.

System had regular scrub (every 15 days) and SMART short and long tests on all disks. Everything was clean and error free.

System is now back and running with no errors whatsoever, with brand new 24GB of ECC.

Also new stuff is another FreeNAS box made for rsync daily backups... Just in case...

Rilo Ravestein · Jan 30, 2015

eretron said:
Since our data is of enormous importance to us and the well-being of our business...

eretron said:
Also new stuff is another FreeNAS box made for rsync daily backups... Just in case...

You almost learned the hardest way.....
Glad you got things fixed ;)

Important Announcement for the TrueNAS Community.

[SOLVED] Fatal Trap 12, possibly due to ECC memory, further issue...

eretron

Dabbler

cyberjock

Inactive Account

eretron

Dabbler

eretron

Dabbler

cyberjock

Inactive Account

sfcredfox

Patron

cyberjock

Inactive Account

eretron

Dabbler

Rilo Ravestein

Guru

Similar threads

Important Announcement for the TrueNAS Community.

[SOLVED] Fatal Trap 12, possibly due to ECC memory, further issue...

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

Patron

Inactive Account

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "[SOLVED] Fatal Trap 12, possibly due to ECC memory, further issue..."

Similar threads