Hello, my FreeNAS 9.10.1 (latest updates applied like month ago) started to crash pretty ugly. I suspect it's because of the replication task, which never caused problems (pool1 -> pool2, both local) but suddenly, after having rebooted the box 2 days ago, this happened:
1. Reboot triggered manually, as I had "strange" feeling (2 jails not getting net)
2. After reboot, I unlocked pool1 - ok
3. I unlocked pool2 - crash in a few seconds
4. While the box was rebooting, my supermicro MB beeped ugly complaining "No RAM found"
5. I powered down and up and rebooted
6. While rebooting, I crashed again
7. It rebooted and came ok, both pools locked
8. I kept it running for some time, no crash
9. I unlocked pool1, no crash
10. I booted UEFI MemTest v5 and let it run all 4 passes: no errors, 64GB ok
11. I updated my MB's firmware, very last Supermicro SM X11SSL-CF
12. I booted FreeNAS, no crash
13. I unlocked pool1, no crash
14. I unlocked pool2: CRASH.. like in 5 seconds - I guess the replication task was triggered
15. I booted up, unlocked pool1, no crash
16. I've deleted the replication task
17. I've unlocked pool2: no crash => I guess, the problem came from replicating pool1->pool2
I've checked the /data/crash - all the logs/dumps are there - I could send it to someone who can use it if needed. From my side, I can interpret in only like this: trap 12, "supervisor read data, page not present" => HW/RAM/drivers/firmware.
The MB has nice active UPC power and is in normal temperature, always; all the sensors were in normal range after the crash.
What I don't understand:
- The problem seems to be related to the replication task, e.g. SW error
- BUT: the reboot POST told me "no RAM" - how can this happen?!
- If the RAM would be deffect I would expect some MemTest errors
The MB log entries:
<CODE>
13,System Event,08/24/2016 20:52:33 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Timer interrupt
14,System Event,08/24/2016 20:52:34 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Hard Reset</CODE>
It looks like OS bug to me, although I can't explain how the POST could fail because of some crash-caused-by-OS-operation-bug - can it? Some kind of HW magic going on in the pool replication, which can even confuse the MB that bad, that it can't find the RAM?
Can someone use my inputs for bug analysis? Do you need something more from the logs/dumps?
The only thing I didn't done so far is stresstesting the system with some USB booted tool. I'll do that as the very last thing I can imagine in this moment. The situation is: I have a really bad feeling about my box, which was running very stable for a few months now.
Thank you,
Andrej
1. Reboot triggered manually, as I had "strange" feeling (2 jails not getting net)
2. After reboot, I unlocked pool1 - ok
3. I unlocked pool2 - crash in a few seconds
4. While the box was rebooting, my supermicro MB beeped ugly complaining "No RAM found"
5. I powered down and up and rebooted
6. While rebooting, I crashed again
7. It rebooted and came ok, both pools locked
8. I kept it running for some time, no crash
9. I unlocked pool1, no crash
10. I booted UEFI MemTest v5 and let it run all 4 passes: no errors, 64GB ok
11. I updated my MB's firmware, very last Supermicro SM X11SSL-CF
12. I booted FreeNAS, no crash
13. I unlocked pool1, no crash
14. I unlocked pool2: CRASH.. like in 5 seconds - I guess the replication task was triggered
15. I booted up, unlocked pool1, no crash
16. I've deleted the replication task
17. I've unlocked pool2: no crash => I guess, the problem came from replicating pool1->pool2
I've checked the /data/crash - all the logs/dumps are there - I could send it to someone who can use it if needed. From my side, I can interpret in only like this: trap 12, "supervisor read data, page not present" => HW/RAM/drivers/firmware.
The MB has nice active UPC power and is in normal temperature, always; all the sensors were in normal range after the crash.
What I don't understand:
- The problem seems to be related to the replication task, e.g. SW error
- BUT: the reboot POST told me "no RAM" - how can this happen?!
- If the RAM would be deffect I would expect some MemTest errors
The MB log entries:
<CODE>
13,System Event,08/24/2016 20:52:33 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Timer interrupt
14,System Event,08/24/2016 20:52:34 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Hard Reset</CODE>
It looks like OS bug to me, although I can't explain how the POST could fail because of some crash-caused-by-OS-operation-bug - can it? Some kind of HW magic going on in the pool replication, which can even confuse the MB that bad, that it can't find the RAM?
Can someone use my inputs for bug analysis? Do you need something more from the logs/dumps?
The only thing I didn't done so far is stresstesting the system with some USB booted tool. I'll do that as the very last thing I can imagine in this moment. The situation is: I have a really bad feeling about my box, which was running very stable for a few months now.
Thank you,
Andrej