Failed vMotions, dropped iSCSI connections, possible system reboot

Status
Not open for further replies.

zeroluck

Dabbler
Joined
Feb 12, 2015
Messages
43
System Landscape:

4 VMware hosts: ESXI-01, 02, 03, A
Separate iSCSI subnets for MPIO

2 FreeNAS boxes: Storage1, Storage2 - Storage Target for VMware -- I am not virtualizing FreeNAS, I'm storing backup/bulk storage volumes for virtual machines on it which are attached to virtual machines.

Storage1:

FreeNAS-9.3-STABLE-201509282017
  • Norco 4224 case
  • AP-RRP4ATX6808 Redundant 800 watt hot swap ATX power supply
  • Supermicro ATX DDR4 LGA 2011 Motherboards X10SRH-CLN4F-O
  • Intel Xeon E5-1620 v3 Haswell-EP 3.5GHz 4 x 256KB L2 Cache 10MB L3 Cache LGA 2011-3 140W BX80644E51620V3 Server Processor
  • Crucial 64GB Kit (16GBx4) DDR4 2133 (PC4-2133) DR x4 ECC Registered 288-Pin Server Memory CT4K16G4RFD4213 / CT4C16G4RFD4213
  • 2x Samsung 850 Pro 128GB 2.5-Inch SATA III Internal SSD BOOT
  • 2x Samsung 850 Pro 256GB 2.5-Inch SATA III Internal SSD L2ARC
  • 24x WD RED 3TB
  • 4X onboard Intel NIC
    • 2X MPIO iSCSI on separate subnets
    • 2X lagg0
Code:
ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW2 (1/0/16)
    options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
    ether 00:15:17:fd:0d:2a
    inet 10.0.3.29 netmask 0xffffff00 broadcast 10.0.3.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW1 (1/0/16)
    options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 00:15:17:fd:0d:2b
    inet 10.0.4.29 netmask 0xffffff00 broadcast 10.0.4.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (1/0/8)
    options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 78:2b:cb:0a:4f:9a
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bce1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (2/0/40)
    options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 78:2b:cb:0a:4f:9a
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
    inet 127.0.0.1 netmask 0xff000000
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 78:2b:cb:0a:4f:9a
    inet 10.0.50.45 netmask 0xfffffe00 broadcast 10.0.51.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: bce1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: bce0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>


Storage2:

FreeNAS-9.3-STABLE-201601181840

Server:
  • System: Dell PowerEdge r310
  • Intel(R) Xeon(R) CPU X3450 @ 2.67GHz
  • 72GB ECC Registered PC3-8500 DDR3 (Soon to be 56GB)
  • 2X SanDisk Micro Fit 64GB ultra fast boot USB mirrored
  • LSI SAS 9201-16e - Flashed to IT mode
  • SuperMicro 16 Bay JBOD chassis
  • 12x WD RED 3TB (2 in PowerEdge bays, 6 in Habey Enclosure)
  • 6x Barracuda XT 3T
  • 2x Barracuda 7200.11 3TB
  • 2 Port Intel NIC for MPIO iSCSI on separate subnets
  • 2 Port onboard NIC, server stack lagg0
Enclosure:
  • SuperMicro 16 bay SAS Enclosure SC936A-R1200B
  • SFF-8088 Cables 1M x4
  • 2ft SFF-8087 Cables >2ft x4
  • 8087-8088 adatper x2
  • SuperMicro CSE-PTJBOD-CS3 JBOD chassis controller w/IPMI
Code:
ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW1 (1/0/11)
    options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
    ether 00:15:17:75:8f:50
    inet 10.0.3.25 netmask 0xffffff00 broadcast 10.0.3.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW2 (1/0/11)
    options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 00:15:17:75:8f:51
    inet 10.0.4.25 netmask 0xffffff00 broadcast 10.0.4.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (1/0/26)
    options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 90:b1:1c:10:5b:cc
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (2/0/26)
    options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 90:b1:1c:10:5b:cc
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
    inet 127.0.0.1 netmask 0xff000000
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 90:b1:1c:10:5b:cc
    inet 10.0.50.43 netmask 0xfffffe00 broadcast 10.0.51.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: bge1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: bge0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>


Problem description:

Large vMotion operations between FreeNAS arrays fail. Smaller (less than 200gb) transfers are succeeding. I have tried 3-14TB transfers and they fail in the middle of the night. Error from vSphere (screenshot attached):

Code:
Remote connection failure
Failed to establish transport connection (9): Virtual machine config file does not exist..
Unable to load configuration file '/vmfs/volumes/55831b79-badbf7d0-b07e-90b11c1caa0c/xProtect/xProtect.vmx'.
Disconnected from virtual machine.


My daily security run output from Storage1 and Storage2 have come back and they are showing a networking issue being unable to connect to the iSCSI interfaces of ESXi-03, which is where the virtual machine that is being migrated lives.

Storage1 security run output:

Code:
storage1.brewerscience.com kernel log messages:
mps0: SAS Address for SATA device = d2644339d9cac06d
mps0: SAS Address from SATA device = d2644339d9cac06d
mps1: SAS Address for SATA device = d2644438d8cbbe71
mps1: SAS Address from SATA device = d2644438d8cbbe71
mps1: SAS Address for SATA device = d2644334dcc9bd74
mps1: SAS Address from SATA device = d2644334dcc9bd74
mps0: SAS Address for SATA device = d2644337d9c3ba75
mps0: SAS Address from SATA device = d2644337d9c3ba75
mps1: SAS Address for SATA device = d2644334dec9ba71
mps1: SAS Address from SATA device = d2644334dec9ba71
mps2: SAS Address for SATA device = d2625d44fbc4da92
mps1: SAS Address for SATA device = d2644334dbc5b96f
mps2: SAS Address from SATA device = d2625d44fbc4da92
mps1: SAS Address from SATA device = d2644334dbc5b96f
mps1: SAS Address for SATA device = d2644537ddcbb875
mps1: SAS Address from SATA device = d2644537ddcbb875
mps1: SAS Address for SATA device = d2644534dbc6ba74
mps2: SAS Address for SATA device = d2645148ede2e08f
mps1: SAS Address from SATA device = d2644534dbc6ba74
mps2: SAS Address from SATA device = d2645148ede2e08f
mps1: SAS Address for SATA device = d2625d3cf8e7b791
mps1: SAS Address from SATA device = d2625d3cf8e7b791
mps1: SAS Address for SATA device = d269634e02bcdf89
mps2: SAS Address for SATA device = d2625d33d8e5d26f
mps1: SAS Address from SATA device = d269634e02bcdf89
mps2: SAS Address from SATA device = d2625d33d8e5d26f
SMP: AP CPU #6 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #7 Launched!
Timecounter "TSC-low" frequency 1750034246 Hz quality 1000
vboxdrv: fAsync=0 offMin=0x2cd offMax=0x1016
pid 3990 (sssd_be), uid 0: exited on signal 6 (core dumped)
pid 4027 (sssd_be), uid 0: exited on signal 6 (core dumped)
WARNING: 10.0.4.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection
WARNING: 10.0.3.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection
pid 4437 (sssd_be), uid 0: exited on signal 6 (core dumped)
pid 5121 (sssd_be), uid 0: exited on signal 6 (core dumped)
pid 3451 (syslog-ng), uid 0: exited on signal 6 (core dumped)
-- End of security output --


Notice the no ping reply dropping connection errors.

Storage2 security run output:

Code:
storage2.brewerscience.com kernel log messages:
WARNING: 10.0.4.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection
-- End of security output --


Looking at the logs of Storage1, it appears as if the box rebooted itself! /var/log/messages here: http://pastebin.com/hYVkqJDa

Anyone seen anything like this before? I see the ctl_datamove errors but those are hours before the reboot. Nagios monitoring reports that Storage1 went offline at exactly 11:59PM CST, and the dmesg log corroborates that when we start seeing boot messages at Jan 27 00:04:52.

Is there another logfile I can look in to see what was happening right before the machine rebooted? Anywhere else I should be looking? Guidance on debug logging would be great too if I knew what to enable I can reproduce the problem and see if the logs contain more.
 

Attachments

  • Screen Shot 2016-01-27 at 7.50.27 AM.png
    Screen Shot 2016-01-27 at 7.50.27 AM.png
    57.4 KB · Views: 320

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Can you get a debug file from your system and post it? It may contain data that is sensitive if this is business related. If so, please PM me and include the debug file and I'll let you know what I see.
 

zeroluck

Dabbler
Joined
Feb 12, 2015
Messages
43
Can you get a debug file from your system and post it? It may contain data that is sensitive if this is business related. If so, please PM me and include the debug file and I'll let you know what I see.
PM'd. I am okay with relevant log information being posted into this thread so long as there's nothing business related in it for future troubleshooting of others. I think it's unlikely that there's anything sensitive in the logs, but since I don't know what all is in the debug file I've kept the whole thing private.
 

zambanini

Patron
Joined
Sep 11, 2013
Messages
479
we had that in the forum about two weeks ago. remove the l2arc and see if it still happens.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You had a kernel panic (actually, you've had several panics) due to an iSCSI bug that we fixed months ago.

invalid serialization value -2126892144

If you update to the latest version your system shouldn't crash anymore. :)
 

zeroluck

Dabbler
Joined
Feb 12, 2015
Messages
43
we had that in the forum about two weeks ago. remove the l2arc and see if it still happens.
Can you link the forum thread in here? I searched but I can't seem to find it, I'm probably searching for the wrong terms.

You had a kernel panic (actually, you've had several panics) due to an iSCSI bug that we fixed months ago.

invalid serialization value -2126892144

If you update to the latest version your system shouldn't crash anymore. :)

I'll give that a shot tonight. One of the reasons I am doing the vMotion is to be able to reboot and update this thing and make my volume bigger.
 

zeroluck

Dabbler
Joined
Feb 12, 2015
Messages
43
Updating storage1 to the latest version of FreeNAS resolved this issue for me! Thanks for your help!
 
Status
Not open for further replies.
Top