System Landscape:
4 VMware hosts: ESXI-01, 02, 03, A
Separate iSCSI subnets for MPIO
2 FreeNAS boxes: Storage1, Storage2 - Storage Target for VMware -- I am not virtualizing FreeNAS, I'm storing backup/bulk storage volumes for virtual machines on it which are attached to virtual machines.
Storage1:
FreeNAS-9.3-STABLE-201509282017
Storage2:
FreeNAS-9.3-STABLE-201601181840
Server:
Problem description:
Large vMotion operations between FreeNAS arrays fail. Smaller (less than 200gb) transfers are succeeding. I have tried 3-14TB transfers and they fail in the middle of the night. Error from vSphere (screenshot attached):
My daily security run output from Storage1 and Storage2 have come back and they are showing a networking issue being unable to connect to the iSCSI interfaces of ESXi-03, which is where the virtual machine that is being migrated lives.
Storage1 security run output:
Notice the no ping reply dropping connection errors.
Storage2 security run output:
Looking at the logs of Storage1, it appears as if the box rebooted itself! /var/log/messages here: http://pastebin.com/hYVkqJDa
Anyone seen anything like this before? I see the ctl_datamove errors but those are hours before the reboot. Nagios monitoring reports that Storage1 went offline at exactly 11:59PM CST, and the dmesg log corroborates that when we start seeing boot messages at Jan 27 00:04:52.
Is there another logfile I can look in to see what was happening right before the machine rebooted? Anywhere else I should be looking? Guidance on debug logging would be great too if I knew what to enable I can reproduce the problem and see if the logs contain more.
4 VMware hosts: ESXI-01, 02, 03, A
Separate iSCSI subnets for MPIO
2 FreeNAS boxes: Storage1, Storage2 - Storage Target for VMware -- I am not virtualizing FreeNAS, I'm storing backup/bulk storage volumes for virtual machines on it which are attached to virtual machines.
Storage1:
FreeNAS-9.3-STABLE-201509282017
- Norco 4224 case
- AP-RRP4ATX6808 Redundant 800 watt hot swap ATX power supply
- Supermicro ATX DDR4 LGA 2011 Motherboards X10SRH-CLN4F-O
- Intel Xeon E5-1620 v3 Haswell-EP 3.5GHz 4 x 256KB L2 Cache 10MB L3 Cache LGA 2011-3 140W BX80644E51620V3 Server Processor
- Crucial 64GB Kit (16GBx4) DDR4 2133 (PC4-2133) DR x4 ECC Registered 288-Pin Server Memory CT4K16G4RFD4213 / CT4C16G4RFD4213
- 2x Samsung 850 Pro 128GB 2.5-Inch SATA III Internal SSD BOOT
- 2x Samsung 850 Pro 256GB 2.5-Inch SATA III Internal SSD L2ARC
- 24x WD RED 3TB
- 4X onboard Intel NIC
- 2X MPIO iSCSI on separate subnets
- 2X lagg0
Code:
ifconfig em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 description: connected to ISCSI-SW2 (1/0/16) options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO> ether 00:15:17:fd:0d:2a inet 10.0.3.29 netmask 0xffffff00 broadcast 10.0.3.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 description: connected to ISCSI-SW1 (1/0/16) options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO> ether 00:15:17:fd:0d:2b inet 10.0.4.29 netmask 0xffffff00 broadcast 10.0.4.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: connected to ServerStack (1/0/8) options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 78:2b:cb:0a:4f:9a nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active bce1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: connected to ServerStack (2/0/40) options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 78:2b:cb:0a:4f:9a nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536 nd6 options=9<PERFORMNUD,IFDISABLED> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 inet 127.0.0.1 netmask 0xff000000 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 78:2b:cb:0a:4f:9a inet 10.0.50.45 netmask 0xfffffe00 broadcast 10.0.51.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect status: active laggproto lacp lagghash l2,l3,l4 laggport: bce1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: bce0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
Storage2:
FreeNAS-9.3-STABLE-201601181840
Server:
- System: Dell PowerEdge r310
- Intel(R) Xeon(R) CPU X3450 @ 2.67GHz
- 72GB ECC Registered PC3-8500 DDR3 (Soon to be 56GB)
- 2X SanDisk Micro Fit 64GB ultra fast boot USB mirrored
- LSI SAS 9201-16e - Flashed to IT mode
- SuperMicro 16 Bay JBOD chassis
- 12x WD RED 3TB (2 in PowerEdge bays, 6 in Habey Enclosure)
- 6x Barracuda XT 3T
- 2x Barracuda 7200.11 3TB
- 2 Port Intel NIC for MPIO iSCSI on separate subnets
- 2 Port onboard NIC, server stack lagg0
- SuperMicro 16 bay SAS Enclosure SC936A-R1200B
- SFF-8088 Cables 1M x4
- 2ft SFF-8087 Cables >2ft x4
- 8087-8088 adatper x2
- SuperMicro CSE-PTJBOD-CS3 JBOD chassis controller w/IPMI
Code:
ifconfig em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 description: connected to ISCSI-SW1 (1/0/11) options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO> ether 00:15:17:75:8f:50 inet 10.0.3.25 netmask 0xffffff00 broadcast 10.0.3.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 description: connected to ISCSI-SW2 (1/0/11) options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO> ether 00:15:17:75:8f:51 inet 10.0.4.25 netmask 0xffffff00 broadcast 10.0.4.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: connected to ServerStack (1/0/26) options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 90:b1:1c:10:5b:cc nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: connected to ServerStack (2/0/26) options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 90:b1:1c:10:5b:cc nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536 nd6 options=9<PERFORMNUD,IFDISABLED> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 inet 127.0.0.1 netmask 0xff000000 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 90:b1:1c:10:5b:cc inet 10.0.50.43 netmask 0xfffffe00 broadcast 10.0.51.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect status: active laggproto lacp lagghash l2,l3,l4 laggport: bge1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: bge0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
Problem description:
Large vMotion operations between FreeNAS arrays fail. Smaller (less than 200gb) transfers are succeeding. I have tried 3-14TB transfers and they fail in the middle of the night. Error from vSphere (screenshot attached):
Code:
Remote connection failure Failed to establish transport connection (9): Virtual machine config file does not exist.. Unable to load configuration file '/vmfs/volumes/55831b79-badbf7d0-b07e-90b11c1caa0c/xProtect/xProtect.vmx'. Disconnected from virtual machine.
My daily security run output from Storage1 and Storage2 have come back and they are showing a networking issue being unable to connect to the iSCSI interfaces of ESXi-03, which is where the virtual machine that is being migrated lives.
Storage1 security run output:
Code:
storage1.brewerscience.com kernel log messages: mps0: SAS Address for SATA device = d2644339d9cac06d mps0: SAS Address from SATA device = d2644339d9cac06d mps1: SAS Address for SATA device = d2644438d8cbbe71 mps1: SAS Address from SATA device = d2644438d8cbbe71 mps1: SAS Address for SATA device = d2644334dcc9bd74 mps1: SAS Address from SATA device = d2644334dcc9bd74 mps0: SAS Address for SATA device = d2644337d9c3ba75 mps0: SAS Address from SATA device = d2644337d9c3ba75 mps1: SAS Address for SATA device = d2644334dec9ba71 mps1: SAS Address from SATA device = d2644334dec9ba71 mps2: SAS Address for SATA device = d2625d44fbc4da92 mps1: SAS Address for SATA device = d2644334dbc5b96f mps2: SAS Address from SATA device = d2625d44fbc4da92 mps1: SAS Address from SATA device = d2644334dbc5b96f mps1: SAS Address for SATA device = d2644537ddcbb875 mps1: SAS Address from SATA device = d2644537ddcbb875 mps1: SAS Address for SATA device = d2644534dbc6ba74 mps2: SAS Address for SATA device = d2645148ede2e08f mps1: SAS Address from SATA device = d2644534dbc6ba74 mps2: SAS Address from SATA device = d2645148ede2e08f mps1: SAS Address for SATA device = d2625d3cf8e7b791 mps1: SAS Address from SATA device = d2625d3cf8e7b791 mps1: SAS Address for SATA device = d269634e02bcdf89 mps2: SAS Address for SATA device = d2625d33d8e5d26f mps1: SAS Address from SATA device = d269634e02bcdf89 mps2: SAS Address from SATA device = d2625d33d8e5d26f SMP: AP CPU #6 Launched! SMP: AP CPU #3 Launched! SMP: AP CPU #7 Launched! Timecounter "TSC-low" frequency 1750034246 Hz quality 1000 vboxdrv: fAsync=0 offMin=0x2cd offMax=0x1016 pid 3990 (sssd_be), uid 0: exited on signal 6 (core dumped) pid 4027 (sssd_be), uid 0: exited on signal 6 (core dumped) WARNING: 10.0.4.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection WARNING: 10.0.3.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection pid 4437 (sssd_be), uid 0: exited on signal 6 (core dumped) pid 5121 (sssd_be), uid 0: exited on signal 6 (core dumped) pid 3451 (syslog-ng), uid 0: exited on signal 6 (core dumped) -- End of security output --
Notice the no ping reply dropping connection errors.
Storage2 security run output:
Code:
storage2.brewerscience.com kernel log messages: WARNING: 10.0.4.51 (iqn.1998-01.com.vmware:esxi-03-263b589c): no ping reply (NOP-Out) after 5 seconds; dropping connection -- End of security output --
Looking at the logs of Storage1, it appears as if the box rebooted itself! /var/log/messages here: http://pastebin.com/hYVkqJDa
Anyone seen anything like this before? I see the ctl_datamove errors but those are hours before the reboot. Nagios monitoring reports that Storage1 went offline at exactly 11:59PM CST, and the dmesg log corroborates that when we start seeing boot messages at Jan 27 00:04:52.
Is there another logfile I can look in to see what was happening right before the machine rebooted? Anywhere else I should be looking? Guidance on debug logging would be great too if I knew what to enable I can reproduce the problem and see if the logs contain more.