Failed vMotions, dropped iSCSI connections, possible system reboot

zeroluck · Jan 27, 2016

System Landscape:

4 VMware hosts: ESXI-01, 02, 03, A
Separate iSCSI subnets for MPIO

2 FreeNAS boxes: Storage1, Storage2 - Storage Target for VMware -- I am not virtualizing FreeNAS, I'm storing backup/bulk storage volumes for virtual machines on it which are attached to virtual machines.

Storage1:

FreeNAS-9.3-STABLE-201509282017

Norco 4224 case
AP-RRP4ATX6808 Redundant 800 watt hot swap ATX power supply
Supermicro ATX DDR4 LGA 2011 Motherboards X10SRH-CLN4F-O
Intel Xeon E5-1620 v3 Haswell-EP 3.5GHz 4 x 256KB L2 Cache 10MB L3 Cache LGA 2011-3 140W BX80644E51620V3 Server Processor
Crucial 64GB Kit (16GBx4) DDR4 2133 (PC4-2133) DR x4 ECC Registered 288-Pin Server Memory CT4K16G4RFD4213 / CT4C16G4RFD4213
2x Samsung 850 Pro 128GB 2.5-Inch SATA III Internal SSD BOOT
2x Samsung 850 Pro 256GB 2.5-Inch SATA III Internal SSD L2ARC
24x WD RED 3TB
4X onboard Intel NIC
- 2X MPIO iSCSI on separate subnets
- 2X lagg0

Code:

ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW2 (1/0/16)
    options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
    ether 00:15:17:fd:0d:2a
    inet 10.0.3.29 netmask 0xffffff00 broadcast 10.0.3.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW1 (1/0/16)
    options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 00:15:17:fd:0d:2b
    inet 10.0.4.29 netmask 0xffffff00 broadcast 10.0.4.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (1/0/8)
    options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 78:2b:cb:0a:4f:9a
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bce1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (2/0/40)
    options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 78:2b:cb:0a:4f:9a
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
    inet 127.0.0.1 netmask 0xff000000
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 78:2b:cb:0a:4f:9a
    inet 10.0.50.45 netmask 0xfffffe00 broadcast 10.0.51.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: bce1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: bce0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

Storage2:

FreeNAS-9.3-STABLE-201601181840

Server:

System: Dell PowerEdge r310
Intel(R) Xeon(R) CPU X3450 @ 2.67GHz
72GB ECC Registered PC3-8500 DDR3 (Soon to be 56GB)
2X SanDisk Micro Fit 64GB ultra fast boot USB mirrored
LSI SAS 9201-16e - Flashed to IT mode
SuperMicro 16 Bay JBOD chassis
12x WD RED 3TB (2 in PowerEdge bays, 6 in Habey Enclosure)
6x Barracuda XT 3T
2x Barracuda 7200.11 3TB
2 Port Intel NIC for MPIO iSCSI on separate subnets
2 Port onboard NIC, server stack lagg0

Enclosure:

SuperMicro 16 bay SAS Enclosure SC936A-R1200B
SFF-8088 Cables 1M x4
2ft SFF-8087 Cables >2ft x4
8087-8088 adatper x2
SuperMicro CSE-PTJBOD-CS3 JBOD chassis controller w/IPMI

Code:

ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW1 (1/0/11)
    options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
    ether 00:15:17:75:8f:50
    inet 10.0.3.25 netmask 0xffffff00 broadcast 10.0.3.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: connected to ISCSI-SW2 (1/0/11)
    options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 00:15:17:75:8f:51
    inet 10.0.4.25 netmask 0xffffff00 broadcast 10.0.4.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (1/0/26)
    options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 90:b1:1c:10:5b:cc
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: connected to ServerStack (2/0/26)
    options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 90:b1:1c:10:5b:cc
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
    inet 127.0.0.1 netmask 0xff000000
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
    ether 90:b1:1c:10:5b:cc
    inet 10.0.50.43 netmask 0xfffffe00 broadcast 10.0.51.255
    nd6 options=9<PERFORMNUD,IFDISABLED>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: bge1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: bge0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

Problem description:

Large vMotion operations between FreeNAS arrays fail. Smaller (less than 200gb) transfers are succeeding. I have tried 3-14TB transfers and they fail in the middle of the night. Error from vSphere (screenshot attached):

Code:

Remote connection failure
Failed to establish transport connection (9): Virtual machine config file does not exist..
Unable to load configuration file '/vmfs/volumes/55831b79-badbf7d0-b07e-90b11c1caa0c/xProtect/xProtect.vmx'.
Disconnected from virtual machine.

My daily security run output from Storage1 and Storage2 have come back and they are showing a networking issue being unable to connect to the iSCSI interfaces of ESXi-03, which is where the virtual machine that is being migrated lives.

Storage1 security run output:

Code:

storage1.brewerscience.com kernel log messages:
mps0: SAS Address for SATA device = d2644339d9cac06d
mps0: SAS Address from SATA device = d2644339d9cac06d
mps1: SAS Address for SATA device = d2644438d8cbbe71
mps1: SAS Address from SATA device = d2644438d8cbbe71
mps1: SAS Address for SATA device = d2644334dcc9bd74
mps1: SAS Address from SATA device = d2644334dcc9bd74
mps0: SAS Address for SATA device = d2644337d9c3ba75
mps0: SAS Address from SATA device = d2644337d9c3ba75
mps1: SAS Address for SATA device = d2644334dec9ba71
mps1: SAS Address from SATA device = d2644334dec9ba71
mps2: SAS Address for SATA device = d2625d44fbc4da92
mps1: SAS Address for SATA device = d2644334dbc5b96f
mps2: SAS Address from SATA device = d2625d44fbc4da92
mps1: SAS Address from SATA device = d2644334dbc5b96f
mps1: SAS Address for SATA device = d2644537ddcbb875
mps1: SAS Address from SATA device = d2644537ddcbb875
mps1: SAS Address for SATA device = d2644534dbc6ba74
mps2: SAS Address for SATA device = d2645148ede2e08f
mps1: SAS Address from SATA device = d2644534dbc6ba74
mps2: SAS Address from SATA device = d2645148ede2e08f
mps1: SAS Address for SATA device = d2625d3cf8e7b791
mps1: SAS Address from SATA device = d2625d3cf8e7b791
mps1: SAS Address for SATA device = d269634e02bcdf89
mps2: SAS Address for SATA device = d2625d33d8e5d26f
mps1: SAS Address from SATA device = d269634e02bcdf89
mps2: SAS Address from SATA device = d2625d33d8e5d26f
SMP: AP CPU #6 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #7 Launched!
Timecounter "TSC-low" frequency 1750034246 Hz quality 1000
vboxdrv: fAsync=0 offMin=0x2cd offMax=0x1016
pid 3990 (sssd_be), uid 0: exited on signal 6 (core dumped)
pid 4027 (sssd_be), uid 0: exited on signal 6 (core dumped)
WARNING: 10.0.4.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection
WARNING: 10.0.3.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection
pid 4437 (sssd_be), uid 0: exited on signal 6 (core dumped)
pid 5121 (sssd_be), uid 0: exited on signal 6 (core dumped)
pid 3451 (syslog-ng), uid 0: exited on signal 6 (core dumped)
-- End of security output --

Notice the no ping reply dropping connection errors.

Storage2 security run output:

Code:

storage2.brewerscience.com kernel log messages:
WARNING: 10.0.4.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection
-- End of security output --

Looking at the logs of Storage1, it appears as if the box rebooted itself! /var/log/messages here: http://pastebin.com/hYVkqJDa

Anyone seen anything like this before? I see the ctl_datamove errors but those are hours before the reboot. Nagios monitoring reports that Storage1 went offline at exactly 11:59PM CST, and the dmesg log corroborates that when we start seeing boot messages at Jan 27 00:04:52.

Is there another logfile I can look in to see what was happening right before the machine rebooted? Anywhere else I should be looking? Guidance on debug logging would be great too if I knew what to enable I can reproduce the problem and see if the logs contain more.

cyberjock · Jan 27, 2016

Can you get a debug file from your system and post it? It may contain data that is sensitive if this is business related. If so, please PM me and include the debug file and I'll let you know what I see.

zeroluck · Jan 27, 2016

cyberjock said:
Can you get a debug file from your system and post it? It may contain data that is sensitive if this is business related. If so, please PM me and include the debug file and I'll let you know what I see.

PM'd. I am okay with relevant log information being posted into this thread so long as there's nothing business related in it for future troubleshooting of others. I think it's unlikely that there's anything sensitive in the logs, but since I don't know what all is in the debug file I've kept the whole thing private.

zambanini · Jan 27, 2016

we had that in the forum about two weeks ago. remove the l2arc and see if it still happens.

cyberjock · Jan 27, 2016

You had a kernel panic (actually, you've had several panics) due to an iSCSI bug that we fixed months ago.

invalid serialization value -2126892144

If you update to the latest version your system shouldn't crash anymore. :)

zeroluck · Jan 27, 2016

zambanini said:
we had that in the forum about two weeks ago. remove the l2arc and see if it still happens.

Can you link the forum thread in here? I searched but I can't seem to find it, I'm probably searching for the wrong terms.

cyberjock said:
You had a kernel panic (actually, you've had several panics) due to an iSCSI bug that we fixed months ago.

invalid serialization value -2126892144

If you update to the latest version your system shouldn't crash anymore. :)

I'll give that a shot tonight. One of the reasons I am doing the vMotion is to be able to reboot and update this thing and make my volume bigger.

zeroluck · Jan 28, 2016

Updating storage1 to the latest version of FreeNAS resolved this issue for me! Thanks for your help!

Important Announcement for the TrueNAS Community.

Failed vMotions, dropped iSCSI connections, possible system reboot

zeroluck

Dabbler

Attachments

cyberjock

Inactive Account

zeroluck

Dabbler

zambanini

Patron

cyberjock

Inactive Account

zeroluck

Dabbler

zeroluck

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Failed vMotions, dropped iSCSI connections, possible system reboot

zeroluck

Dabbler

Attachments

cyberjock

Inactive Account

zeroluck

Dabbler

zambanini

Patron

cyberjock

Inactive Account

zeroluck

Dabbler

zeroluck

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Failed vMotions, dropped iSCSI connections, possible system reboot"

Similar threads