Spontaneous reboots under iSCSI load

Status
Not open for further replies.

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Hi folks,

This one is driving me nuts. I'm testing out an iSCSI setup to provide VMs to vsphere. However under load, the FreeNAS system will spontaneously reboot. And I can't see any evidence of things like a kernel panic in /var/crash. (the only file there is minfree).

First I ran extended Dell diagnostics, and then memtest86. All passed fine. Eventually took the FreeNAS USB sticks and moved them to a known-good server. Same results. System reboot w/o evidence.

Which in my mind leaves a couple of possibilities:
  1. One of the cards that is still in use with the completely-new server doesn't play well with FreeBSD and/or iSCSI
  2. Something is wrong with the FreeNAS installation / configuration
As for option 1, the relevant hardware that has been present in both servers is:

mpr0@pci0:6:0:0: class=0x010700 card=0x30a01000 chip=0x00971000 rev=0x02 hdr=0x00
vendor = 'LSI Logic / Symbios Logic'
device = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
class = mass storage
subclass = SAS
ix1@pci0:35:0:1: class=0x020000 card=0x00038086 chip=0x10fb8086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = '82599EB 10-Gigabit SFI/SFP+ Network Connection'
class = network
subclass = ethernet


Both systems are AMD-based multi-core (one server has 16 cores, the other 64) and have 256GB ECC RAM. FreeNAS version:
FreeBSD <HOSTNAME> 9.3-RELEASE-p29 FreeBSD 9.3-RELEASE-p29 #0 r288272+dc0354b: Sun Nov 1 18:57:19 PST 2015 root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys/FREENAS.amd64 amd64

The questions I have at this point are:
  1. Am I looking in the wrong place for evidence of a kernel panic or other cause of a reboot?
  2. Does FreeNAS automatically reboot in case of a kernel panic? I get the impression that the answer is "no" based on:
    [root@freenas] ~# sysctl debug.debugger_on_panic
    debug.debugger_on_panic: 1
  3. Is there anything obviously incompatible with the hardware described above?
  4. Are there any other tips for where I should go from here?
Thanks!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Is there anything obviously incompatible with the hardware described above?

Incompatible, no, but that SAS3 controller is still in relative infancy as far as driver support versus the more mature SAS2 chips.

Are you able to replicate this on-demand (eg: set it up, throw massive load at it, and it will crash) or is it more intermittent than that?
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Initially I thought that it was only very intermittently. But it seems like there's a chance that I can replicate it on demand. I also have a SAS2 card that I can swap out to see if that makes a difference.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Since you're currently in an intermittent-crash scenario I'd consider swapping it. Make sure your backups are good first, just in case something decides to get stupid.

Are you using any SAS2 gear other than drives (eg: backplanes, expanders, etc) that might be causing issues? @depasseg is using some SAS3 gear I believe and if I recall, he had an issue where SAS3 HBA + SAS2 expander resulted in drives deciding to disappear at random. That still wouldn't explain a reboot though.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Backups are good, and physically separate from the current system. I haven't seen any drives disappear. I have a SAS3 enclosure and a SAS2 enclosure. The option that I might try is the SAS2 card, which would mean that both enclosures will be at SAS2 speed.

Alternatively, I could run the SAS3 enclosure off of the SAS3 card, and the SAS2 enclosure off of the SAS2 card. But I'm sort of looking to minimize variables at this point!
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Backups are good, and physically separate from the current system. I haven't seen any drives disappear. I have a SAS3 enclosure and a SAS2 enclosure. The option that I might try is the SAS2 card, which would mean that both enclosures will be at SAS2 speed.

Alternatively, I could run the SAS3 enclosure off of the SAS3 card, and the SAS2 enclosure off of the SAS2 card. But I'm sort of looking to minimize variables at this point!
Yep, I'm sticking with SAS 2 enclosure on a SAS 2 controller and SAS 3 on SAS 3. I haven't tried it since the recent update to P20 and P10 though so maybe those issues no longer exist.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
I've switched to run both arrays off of my older SAS2 card, and while it ran for a few days under heavy load, it finally experienced a panic after about 2 days. I guess it's more useful that it's a panic as opposed to a spontaneous reboot, but having a file server panic under load isn't really acceptable. The particular panic is: Fatal trap 12: page fault while in kernel mode. ... Stopped at xpt_freeze_devq+0xd

Screenshot:
RB9UFfP.jpg



Any ideas?
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Is is possible that one or more of the prior spontaneous reboots has corrupted a zpool in a way that that the zpool indicates that it's fine, but it's really corrupt in a way that could cause a kernel panic? If that's possible, that makes me a bit uneasy about this system. But also if that's the case, I suppose I could just restore them from backup. That is, assuming that the corruption wouldn't have propagated through a zfs send | zfs recv backup!
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
While I've not yet got it completely nailed down, I've got evidence that it was a corrupt zpool. One of the troubleshooting steps that I performed was to reinstall a clean version of FreeNAS. And while booting the FreeNAS CD, I got the exact same panic!
ZLOVuOr.jpg


It happened 100% of the time, right after SAS devices were detected. By unplugging the external arrays, I was able to boot into FreeNAS successfully. I was then able to delete the zpools. Once I did this, I could reboot and FreeNAS could come up fine. Right now I've recreated the SAS2 array, restored it from backup successfully, and it reboots fine as well. I plan on leaving the SAS3 array unplugged/unconfigured until I'm confident of the stability of the original array.

It would seem that the above is probably the worst possible way that a zpool can fail, but that's what the evidence is pointing to at the moment.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
It looks like the "corrupt zpool" theory was a red herring. Simply having the SAS3 array daisy-chained to the original SAS2 array would cause the above panic. Even if there's no zpool configured on the SAS3 array. All of the drives in the SAS3 array were wiped.

The current theory is that, as depasseg has mentioned, mixing and matching SAS3 and SAS2 equipment is a bad idea. The original configuration was to have both the SAS2 and SAS3 arrays hung off of the SAS3 card. This resulted in a system that was unstable under load. The next configuration of both the SAS2 array and the SAS3 array hung off of the SAS2 card, and this resulted in the 100% reproducible panic outlined above.

So for now, I've got the SAS2 array on a SAS2 card, and the SAS3 array on a SAS3 card.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Apparently this is a known bug that is outlined in the thread Spontaneous system crashes with iSCSI LUNs. o_O

For the record, the panic that was causing my reboots mentioned in this thread are:

Dump header from device /dev/dumpdev
Architecture: amd64
Architecture Version: 1
Dump Length: 122368B (0 MB)
Blocksize: 512
Dumptime: Thu Dec 3 12:51:21 2015
Hostname: <HOSTNAME>
Magic: FreeBSD Text Dump
Version String: FreeBSD 9.3-RELEASE-p28 #0 r288272+a23e16d: Wed Nov 4 00:20:46 PST 2015
root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys
ctl_check_for_blockage: Invalid serialization value 186898005 for 2 => 14 Panic String: ctl_check_for_blockage: Invalid serialization value 1868980075 for 2 => 14
Dump Parity: 3789621530
Bounds: 0
Dump Status: good

Ignore any of the other red herrings above, such as suspected zpool corruption, as the panic screenshots were actually a result of an unsupported daisy-chaining of JBODs. There was nothing wrong with any of the zpools.
 
Last edited:
Status
Not open for further replies.
Top