Crashing on pool replication

Status
Not open for further replies.

brumnas

Dabbler
Joined
Oct 4, 2015
Messages
33
Hello, my FreeNAS 9.10.1 (latest updates applied like month ago) started to crash pretty ugly. I suspect it's because of the replication task, which never caused problems (pool1 -> pool2, both local) but suddenly, after having rebooted the box 2 days ago, this happened:
1. Reboot triggered manually, as I had "strange" feeling (2 jails not getting net)
2. After reboot, I unlocked pool1 - ok
3. I unlocked pool2 - crash in a few seconds
4. While the box was rebooting, my supermicro MB beeped ugly complaining "No RAM found"
5. I powered down and up and rebooted
6. While rebooting, I crashed again
7. It rebooted and came ok, both pools locked
8. I kept it running for some time, no crash
9. I unlocked pool1, no crash
10. I booted UEFI MemTest v5 and let it run all 4 passes: no errors, 64GB ok
11. I updated my MB's firmware, very last Supermicro SM X11SSL-CF
12. I booted FreeNAS, no crash
13. I unlocked pool1, no crash
14. I unlocked pool2: CRASH.. like in 5 seconds - I guess the replication task was triggered
15. I booted up, unlocked pool1, no crash
16. I've deleted the replication task
17. I've unlocked pool2: no crash => I guess, the problem came from replicating pool1->pool2

I've checked the /data/crash - all the logs/dumps are there - I could send it to someone who can use it if needed. From my side, I can interpret in only like this: trap 12, "supervisor read data, page not present" => HW/RAM/drivers/firmware.

The MB has nice active UPC power and is in normal temperature, always; all the sensors were in normal range after the crash.

What I don't understand:
- The problem seems to be related to the replication task, e.g. SW error
- BUT: the reboot POST told me "no RAM" - how can this happen?!
- If the RAM would be deffect I would expect some MemTest errors

The MB log entries:
<CODE>
13,System Event,08/24/2016 20:52:33 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Timer interrupt
14,System Event,08/24/2016 20:52:34 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Hard Reset</CODE>

It looks like OS bug to me, although I can't explain how the POST could fail because of some crash-caused-by-OS-operation-bug - can it? Some kind of HW magic going on in the pool replication, which can even confuse the MB that bad, that it can't find the RAM?

Can someone use my inputs for bug analysis? Do you need something more from the logs/dumps?

The only thing I didn't done so far is stresstesting the system with some USB booted tool. I'll do that as the very last thing I can imagine in this moment. The situation is: I have a really bad feeling about my box, which was running very stable for a few months now.

Thank you,
Andrej
 

brumnas

Dabbler
Joined
Oct 4, 2015
Messages
33
I wouldn't call it a solution, but a workaround ;-) - I've deleted the automatic snapshot replication, after I was sure the problem was NOT the pool2 (I've let it run for a few days - no problems).

As long as nobody tells me it's fixed, I will not trust neither init automatic replication.

The most strange part for me is still: how can a SW low-level problem cause the mainboard to fail POST RAM test after the SW was crashing the OS :-o??? I needed a power down cycle to make things behave normal.

I've asked the real hardware guys who sold me the RAM chips and a guy confirmed, that it's possible. But "how" - I didn't find out. But if I think about it: my Macbook Air sometimes freezes the internal camera in an ugly way: the LED keeps lighting and the camera stops to work: and no reset helps, I need to power down. So, from this point of view: it's possible a HW component to go crazy and needing a power down. But a Supermicro server board?!..

I'm a SW architect/engineer and take this as a new HW "could happen" into my troubleshooting bag..

If anybody cares, I can send the crash dump logs.
 
Status
Not open for further replies.
Top