Help with diagnosing/next steps on crashes/reboots

Steve Kehlet · Jan 22, 2014

Hello, my new FreeNAS system is crashing every other day or so while I'm rsync'ing data onto it. I've tried to do my due diligence (googling, searching through this forum) but could use some help now interpreting the crash logs I have and/or with suggestions on what to try next. We've run memtest86++ several times and it hasn't found any problems.

FreeNAS 9.2.0 RELEASE
CPU: Xeon E3-1230V2
Mobo: ASUS P9D-X
~~RAM: 2x Kingston 16GB (2 x 8GB) 240-Pin DDR3 SDRAM DDR3 1333 (PC3 10600) ECC Registered Server Memory Model KVR1333D3D4R9SK2/16G~~
(Updated) RAM: Kingston Technology ValueRAM 32GB Kit of 4 (4 x 8 GB) DDR3 1600MHz PC3 12800 ECC CL11 DIMM with TS Server Workstation Memory KVR16E11K4/32
HBA: LSI SAS 9207-8i
HDs: 6x WD SE WD4000F9YZ

The full dmesg output is available here.
After the first reboot I created a 'syslog' ZFS dataset, and then found this in /var/log/messages after the next reboot:

Code:

Jan 19 17:38:35 freenas kernel: Fatal trap 9: general protection fault while in kernel mode
Jan 19 17:38:35 freenas kernel: cpuid = 6; apic id = 06
Jan 19 17:38:35 freenas kernel: instruction pointer    = 0x20:0xffffffff8089d150
Jan 19 17:38:35 freenas kernel: stack pointer          = 0x28:0xffffff88a9c1e1a0
Jan 19 17:38:35 freenas kernel: frame pointer          = 0x28:0xffffff88a9c1e1d0
Jan 19 17:38:35 freenas kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
Jan 19 17:38:35 freenas kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Jan 19 17:38:35 freenas kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
Jan 19 17:38:35 freenas kernel: current process        = 5185 (rsync)
Jan 19 17:38:35 freenas kernel: Copyright (c) 1992-2013 The FreeBSD Project.
Jan 19 17:38:35 freenas kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

After the most recent reboot there was nothing in /var/log/messages but I found the following in /data/crash/textdump.tar.last:

Code:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100057 td 0xfffffe0009649000
igb_txeof() at igb_txeof+0x141/frame 0xffffff885af379e0
igb_msix_que() at igb_msix_que+0x9f/frame 0xffffff885af37a20
intr_event_execute_handlers() at intr_event_execute_handlers+0xfd/frame 0xffffff885af37a50
ithread_loop() at ithread_loop+0x9a/frame 0xffffff885af37aa0
fork_exit() at fork_exit+0x11f/frame 0xffffff885af37af0
fork_trampoline() at fork_trampoline+0xe/frame 0xffffff885af37af0
--- trap 0, rip = 0, rsp = 0xffffff885af37bb0, rbp = 0 ---

and

Code:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer    = 0x20:0xffffffff804a6871
stack pointer          = 0x28:0xffffff885af37980
frame pointer          = 0x28:0xffffff885af379e0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process        = 12 (irq266: igb0:que 1)

The full textdump.tar.last is here. I see the backtrace there shows the ethernet interface (igb0). I don't know if/how exactly that's particularly significant (kernel bug?), or if it's just a matter of that's what happened to be running when something bad happened (e.g. hardware problem).

I'm considering pulling the 4 DIMMs and trying them one at a time to try to isolate the crashes to a particular DIMM.

Any better suggestions? Can I provide any more information to be helpful? Thanks so much.

Steve Kehlet · Feb 3, 2014

Just wanted to follow up on this in case it's useful for future googlers: it was RAM after all. After isolating to one or possibly two bad DIMMS, we replaced them and the system is now stable.

cyberjock · Feb 3, 2014

Well:

1. You were using registered DIMMs in a system that shouldbe using unregistered DIMMs only. I'm surprised you didn't permanently damage anything.
2. (and this scares the sh*t out of me) but ECC should have either corrected the errors in RAM(if correctable.. and it would have reported them) or the system should have halted. Guess that money spent for Asus really wasn't money well spent if their board wasn't properly identify and correcting(or halting as applicable). Just a big fat fail for ASUS.

Steve Kehlet · Feb 4, 2014

Nice catch, thanks for noticing and saying something. I asked the guy who picked the cpu/ram/mobo and assembled the box about what you said, and it turns out that memory in fact didn't work--beep codes at POST. He then returned it for something else compatible. I wasn't even aware this had happened.

Important Announcement for the TrueNAS Community.

Help with diagnosing/next steps on crashes/reboots

Steve Kehlet

Cadet

Steve Kehlet

Cadet

cyberjock

Inactive Account

Steve Kehlet

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Help with diagnosing/next steps on crashes/reboots

Steve Kehlet

Cadet

Steve Kehlet

Cadet

cyberjock

Inactive Account

Steve Kehlet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Help with diagnosing/next steps on crashes/reboots"

Similar threads