Unable to add a 2nd NIC - System loses network access completely

markw7811 · Aug 1, 2021

I have an older FreeNas 11 VM on ESXi 6.7u3. It has worked great for years now.

I've added a new vSwitch and Port Group to ESX to dedicate to storage, this will have no physical NIC bound to it in ESX which should allow me to have more than 1gbps throughput through this new network.

When I add a new NIC to the FreeNas system and boot up, I have no networking at all. I've tried resetting both vmx0 (existing NIC) and vmx1, as well as setting static IP on the new NIC, and no matter what I seem to do, I'm unable to regain access to the existing network to manage the FreeNas system. Automatic config on the existing NIC fails, I have to say No to Reset network configuration and enter the info static if that's helpful.

If I remove the 2nd NIC everything is fine again.

arp table shows nothing other than a local address. I can ping the local IPs but nothing off of the VM.

Any thoughts?

Thanks
Mark

Patrick M. Hausen · Aug 1, 2021

FreeBSD on ESXi is prone to shuffle the order of NICs in surprising ways. Possibly vmx0 is now the new interface and vmx1 the old one. Just try - or look up the MAC addresses in the ESXi interface and compare to the ifconfig output.

jgreco · Aug 1, 2021

ESXi PCI device enumeration is broken in strange ways, make sure the MAC addresses match in FreeNAS and the ESX UI.

Also,

markw7811 said:
I've added a new vSwitch and Port Group to ESX to dedicate to storage, this will have no physical NIC bound to it in ESX which should allow me to have more than 1gbps throughput through this new network.

this is a fundamental misunderstanding of how things work. You can get more than 1Gbps throughput on any vSwitch with the E1000, E1000E, or VMXNET3 drivers regardless of physical uplinks. The fact that the E1000 is a "1Gbps ethernet" device doesn't mean anything, and as long as your VM and host can twiddle the bits to emulate the hardware, it'll go faster if the CPU is capable.

VMXNET3 is both the best and absolute worst choice, because it strips off a bunch of layers of pseudohardware bit twiddling and is supposed to just move data around, so theoretically it goes as fast as possible. However, bugs on both the FreeBSD side and the VMware side make it hazardous especially in environments where you vMotion things around, and I just finished evicting the idiotic VMXNET3 once again from our FreeBSD infrastructure hosts at a data center where we were experimenting with it, and after a round of on-site migrations where the network ended up dying with huge numbers of VMXNET3 packet truncations, well, screw it. This is, I think, the third or fourth time we've tried to use VMware's VMXNET stuff over the years which has ended badly. Always seem to end back up with E1000, solid as a rock...

markw7811 · Aug 1, 2021

Wow sure enough, can see it in the screenshots... the MAC's got reversed, sigh. Huge thanks! May have taken awhile before I realized that lol

Is changing this trivial? Nothing actually within FreeNas is going to care or get broken as far as listeners etc?

Re. the network, currently when I copy from the ESX host (ssh'd in with vmkfstools) I am only getting 100MB/s and I can tell network is pegged because all my other NFS VM's start having super high latency. I did some searching and found some forum posts indicating if you have a 1gbps NIC bound to the vSwitch being used, you may still get limited to the speed of said NIC, so I was going to setup this new VM only network to see if I get better results.

Patrick M. Hausen · Aug 1, 2021

Just configure your IP setup the other way round in TrueNAS and you will be fine. You may need to change interface assignments if you run jails/plugins/VMs, but all sharing services bind to IP address, not interface.

As for the perceived speed problem - I am currently no help for your specific issue, but keep in mind that the reported speeds of virtual interfaces like "1 G/s" are just strings of text compiled into the driver. Because the emulated long outdated hardware used to have that particular speed. An em1000 interface in a VM connected to a vswitch connected to a physical interface of 40 G/s will perfectly reach that speed if the ESXi host's hardware, the ESXi system, and the guest OS are up to it. The messages you see are really just "oh, I found an Intel bla bla bla 1G card" reported by the driver. Not an assessment of actual speed.

People frequently get confused by that.

HTH,
Patrick

jgreco · Aug 1, 2021

Patrick M. Hausen said:
An em1000 interface in a VM connected to a vswitch connected to a physical interface of 40 G/s will perfectly reach that speed if the ESXi host's hardware, the ESXi system, and the guest OS are up to it.

But in practice, em1000 will only maybe hit 3-4Gbps at peak, and that only if you tie a cinder block to each leg and toss the client out of a spacecraft from 20K miles up, whereas VMXNET3, when it's working, can probably sustain 5Gbps++ in lots of conditions. Damn shame VMXNET3 doesn't work as well as it ought to..

markw7811 · Aug 1, 2021

I thought that was the case re. the virtual interface, but genererally have had good luck with VMXNET3 and bad luck with E1000 haha.

As for the speeds, I'm only judging it off of the Network graph in the Reporting section of FreeNas

This is during a copy of data from the ESX server the FreeNas VM is running on, to the NFS share where I host other VMs, it should all be local ESX -> VM traffic.

Patrick M. Hausen · Aug 1, 2021

jgreco said:
But in practice, em1000 will only maybe hit 3-4Gbps at peak, ...

Thanks for some real world figures. What's keeping it from going faster? Interrupts and context switches saturated? I always wonder why VMware does not adopt VirtIO like everyone else, btw ...

markw7811 · Aug 1, 2021

My slog failed tonight randomly, luckily I had already migrated data over to my new disks via snapshot/replication (thanks Youtube and thanks FreeNas for making that so amazingly simple and fast) anyways...

So I think VM to VM may have been fast, but in this case I was trying to copy files between volumes/datastores on the ESX host itself, to do this, it uses the VMKernel adapter, which was connected to a switch with the 1gb physical interface. The fact that the graph from the old config shows such a perfect plateau at ~100MB definitely implies the physical NIC limitation coming into play somehow.

I created a new 2nd VMKernel adapter connected to a new port group on a new vSwitch, and then added a VM port group to the same switch. I did not provision a physical NIC on to this vSwitch, and this is the result on my new VMXNET3 adapter on my FreeNas VM, much better!

Interesting spike at the end of the copy, not sure what that was about but I'll take it... I suspect that is after it finished reading from the HDD spindles and blasted the last bits out to the SSD from memory or something, I never saw this on the previous network setup.

I need a break from this, I've replaced and reconfigured all the disks in my little home server this weekend, but if I get a chance I may do some netcat tests to take the disks out of the equation.

jgreco · Aug 2, 2021

Patrick M. Hausen said:
Thanks for some real world figures. What's keeping it from going faster? Interrupts and context switches saturated?

I don't actually know the EXACT answer here, but.... you're generally on the right track.

One of the many things I've expected out of FreeBSD over the years is to act as a router. Those of you who are thinking "uh huh me too", no, I don't mean a NAT gateway like the pfSense or OpenWRT you use as customer-side of a residential connection, but an actual internetwork router.

Back in the day, routers like the Cisco 4700M were an integral part of the network, but they were software-based routers, meaning that they had a limited number of PPS (packets-per-second) and Mbps (megabits-per-second), and they were quite expensive. In comparison, KA9Q NOS on a PC with a floppy disk made a credible low-end router for modest use, and later, FreeBSD added the ability to do more complex things like IGP's (internal gateway protocols) like RIP and OSPF.

Software-based routers like the 4700M or a PC are inherently at a disadvantage compared to silicon-based routers that have specialized ASICs to move the packets, because each packet needs to be fondled by the CPU. I have some of the very earliest silicon routers, the Netstar (Ascend) GRF's, and their ability to route any-to-any at full wire speed on all ports was unparalleled at the time. However, it was limited to 150K routes, in silicon, and also 100Mbps, even if it did have 32 ports of 100Mbps.

And the damn things cost like $80K. In 1997 dollars.

On the other hand, it was easy to see years ago that CPU was going to be getting very competitive with silicon over time. I made some statements at the time that I think caused some NANOG'er heads to explode, but I was ultimately correct in that we're seeing products like the Ubiquiti EdgeRouter Infinity and the Mikrotik CCR1072, which are basically high-core-count CPU-based devices that get packet forwarding rates into the 100Mpps rate arena on 8x10GbE.

The problem is that it takes a LOT of CPU to get there, and also the hardware needs to be able to work optimally with the host system. We no longer have serious issues with stuff like PCI (not PCIe) bus contention, but when push comes to shove, the design of lots of bits of FreeBSD aren't optimized for the task. Luigi Rizzo and others have spent years improving this, and he had an impressive demo of 185Kpps packet handling in the mid 2000's, but this is really where the PC hardware gets stressed out, and you really need to have everything lined up just right. To get there on modern hardware, you need multiple queues and a variety of other things, intended to spread the load across multiple cores, which are things that were not a consideration when Intel designed the E1000 hardware. I *suspect* that the vast majority of the answer to your question is related to this.

For those of us building routers, or servers too really, there is a lot of value in looking at BSD router performance tuning, a lot of which has recently been summarized at

https://wiki.freebsd.org/Networking/10GbE/Router

The primary difference for a FreeNAS/TrueNAS server is that unlike a router, LRO/TSO are good for a server.

So I feel like the correct answer to your question is that CPU speeds have plateaued -- we've had 3GHz for about twenty years, and okay fine we're around 5GHz now, and to get "faster" computers we've gone to multiple cores and GPU's -- and that the E1000 isn't designed to be able to take true advantage of multiple cores, because Intel expected E1000 to be physically limited to 1GbE, and their CPU's at the time could cope. Additionally, there is a lot of superfluous bit-twiddling going on that doesn't really need to happen, none of which would be there in a virtio setup, so you waste a bunch of potential.

My best guess is that if you could double CPU core speed, you could probably get a lot more out of E1000, but as I said, we've been generally plateaued on CPU speeds for like twenty years, especially on the server CPU's, where higher core count has kept the GHz-per-core somewhat lower.

I hope this has been particularly enlightening for an answer that is essentially "I don't really know".

;-)

I always wonder why VMware does not adopt VirtIO like everyone else, btw ...

The ten ethernet limit is really annoying, too.

Patrick M. Hausen · Aug 2, 2021

jgreco said:
products like the Ubiquiti EdgeRouter Infinity and the Mikrotik CCR1072, which are basically high-core-count CPU-based devices that get packet forwarding rates into the 100Mpps rate arena on 8x10GbE.

And a single-threaded BGP process that goes out for lunch whenever a peer toggles and takes the entire interactive administration environment with it ... that's why we stopped bothering with these products. Moving towards good ol' ASIC based layer 2 (still Cisco) and FreeBSD for layer 3.

jgreco · Aug 2, 2021

Single-threaded BGP processes have been a thing basically forever, and of course is strongly related to the CPU GHz thing. I hear the CCR1072 is particularly atrocious though, since there's no significant CPU GHz to begin with. (eyeroll)

The Infinity was very promising, except that ZebOS has a bunch of bugs. You would think that peering with a small IX (internet exchange) and receiving maybe 10K routes would be no problem. Yet it sucks.

Code:

top - 13:13:56 up 32 days, 47 min,  1 user,  load average: 2.13, 2.07, 2.01
Tasks: 241 total,   2 running, 238 sleeping,   0 stopped,   1 zombie
%Cpu(s):  6.3 us,  0.1 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16476880 total, 15285804 free,   964304 used,   226772 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 15362556 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21813 root      20   0  367644 364088   4440 R 100.0  2.2  14023:43 bgpd
 8664 jgreco    20   0    6992   3176   2656 R   0.7  0.0   0:00.22 top

Ow. It works for a few hours before it locks up bgpd, at which point, off the air, of course. And this one is a very basic config, where the router owns the ASN, and is the only router handling transit and peering. Such a basic config is probably fine if you don't need redundancy and your entire network lives in a single rack somewhere, but hell, even that doesn't work.

In SOL's more complex network, I assigned the Infinity what I thought would be a simple task, which was to simply handle peering with a single IX. It needs to be able to hook up to the two onsite route reflectors, hook up to the two IX route servers, and do literally nothing difficult. That's a total of 8 BGP sessions, including both v4 and v6. But nooooooo. It crashes. Unless I filter out all routes from the IX in the Infinity, and just announce routes to the IX. That works fine.

I'm half tempted to see if it is possible to install FRR on the damn thing. I don't really need the UI or CLI support.

Anyways, it is very disappointing. Ubiquiti was very exciting five to ten years ago, and their EdgeRouter-Lite was particularly amazing for its abilities to do 1Mpps back in ... 2013? It has taken forever for some stuff like more recent OpenVPN versions to show up in the firmware, firmware updates have gotten very slow, UBNT continues to promise features like VRF or USG DPI on mirror ports, etc. Devices like the Unifi Security Gateway have me asking "what exactly is security, it's a NAT" although the shallow DPI web UI is nice. Devices like the UDM have me going "huh?" and more recently the switch from UNMS to UISP seems to signal that they're not really interested in supporting hard engineering like true security gateway features (hello to Sophos, etc) or BGP or any of that. I realize that by licensing ZebOS they thought that they were trying to outsource some of that, but Zebra's support sucked for many years, causing the Quagga fork, which did well for awhile and then languished a bit, ultimately resulting in FRR... did no one at Ubiquiti bother to look at the history before licensing ZebOS?

Sorry, I have no good place to vent about this kind of stuff. There are still some of us smaller networks out here who would like to be able to have a competent full table v4+v6 multi peer router in the range of 4 to 8 10Gbps ports, that doesn't take 4U+ and 1000W+ and cost five figures.

And of course this is all mostly offtopic, except that it does relate to the VMware issue, since some of us would love to be able to do higher speed routed networks in a virtualized environment.

Important Announcement for the TrueNAS Community.

Unable to add a 2nd NIC - System loses network access completely

markw7811

Cadet

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

markw7811

Cadet

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

markw7811

Cadet

Patrick M. Hausen

Hall of Famer

markw7811

Cadet

jgreco

Resident Grinch

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

Unable to add a 2nd NIC - System loses network access completely

Cadet

Hall of Famer

Resident Grinch

Cadet

Hall of Famer

Resident Grinch

Cadet

Hall of Famer

Cadet

Resident Grinch

Hall of Famer

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unable to add a 2nd NIC - System loses network access completely"

Similar threads