Network errors with Hyper-V Switch based on Mellanox Connectx-3

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
I have two TrueNAS servers, one is on Windows 10 and another is on Server 2019, both running in Hyper-V. I am having a similar issue on both systems. TrueNAS runs on FreeBSD 12.2. I have narrowed it down to a networking issue inside of the VM. The issue doesn't occur anywhere else on my network or in a Win10 guest VM on same system and it is not a problem with the disk drives because it occurs writing to a single SSD as well as a Z2 array. Both systems are using a virtual switch based on Mellanox Connectx-3 10G adapters.

What is happening is that everything boots up and runs fine, but I will get random network errors. Every so often, 5-30 minutes, I get "an unexpected network error has occurred" in Beyond Compare and/or qBittorrent when I am writing to the TrueNAS server. One file will error and then it will continue with the rest of the copy and I just have to resend that one file. I can copy several Terabytes and maybe a handful of files will fail or in qBittorrent a few downloads will error every 5-30 minutes usually resulting in all the downloads failing several times.

I have tried # sysctl net.link.ether.inet.max_age=60 but didn't seem to help. I am wired in with 10G fiber. I have tried disabling VMQ and also Enable MAC address spoofing in Hyper-V, but didn't make a difference that I can tell. Any idea what the problem is, I would really like to be able to use this server reliably. Please help
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
I changed the NIC in the Virtual Switch from Mellanox Connectx-3 to the built-in RealTek Gigabit adapter and problem persists. I also upgraded the ram to 32GB and set the CPU reserve to 100%, still same issue. If I can't get any help then I will have to look at moving to VMWare.
 

matthew3658

Cadet
Joined
Jan 27, 2021
Messages
3
Hey Tony, I currently run TrueNas in Hyper V, can you send a pic of the error and a topological setup for your V-Switch. The way I have it running is that the V-Switch has its own dedicated NIC and is running as an external switch. Thank You!
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
Hey Tony, I currently run TrueNas in Hyper V, can you send a pic of the error and a topological setup for your V-Switch. The way I have it running is that the V-Switch has its own dedicated NIC and is running as an external switch. Thank You!
Yes, I have tried setting it up as a dedicated switch. You just turn off allow management operating system to share this network adapter, this is all I need to do correct? I will send you screen shots of switch settings
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
vswitch1.JPG
vswitch2.JPG
vswitch3.JPG
vswitch4.JPG
vswitch5.JPG
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
Everything is connected through a unifi-switch-16-xg except a couple of things on dumb netgear gigabit switches. I should probably try to test without the 16-xg, because I had to enable flow control on it to get my 1GB connections to work properly, this shouldn't be creating network errors but you never know. The error is just "unexpected network error has occurred" in either Beyond Compare or qBittorrent, I am sure it is happening elsewhere but these applications are good about catching the error.
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
This is the only thing I have changed in switch, enabled flow control. There is a topology there, but I don't know anything about it.

unifi.JPG
 

matthew3658

Cadet
Joined
Jan 27, 2021
Messages
3
Here are the current setting on my switch that seems to provide the most optimal results to me (it may have something to do with turning on NDIS Capture attempting to comb through that much data along with flow control). I would attempt to turn off Flow Control and NDIS as they are both attempting to scan all of that traffic which may provide an inital transfer but the bug out. Let me know your results once you test it.

Switch Config:
1611848160129.png


Host VM Settings:
1611848339748.png


1611848355646.png
 

matthew3658

Cadet
Joined
Jan 27, 2021
Messages
3
Yes, I have tried setting it up as a dedicated switch. You just turn off allow management operating system to share this network adapter, this is all I need to do correct? I will send you screen shots of switch settings
Yes just make sure you select the correct network adapter from the drop-down list.
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
I can't turn off flow control because my 1G performance goes to pot. The NDIS Capture extension was showing a problem when I looked at it, so I disabled it. I don't have the VMM DHCPv4 extension, hopefully I don't need it. I am testing now, but it can be difficult to tell if problem is resolved because it will pop up almost randomly. Some times I get errors after a few minutes, sometimes it take 30 or more. Running several low speed transfers across the network does seem to cause it to happen faster though which I am doing. About 20 minutes in right now, no errors yet. If we make it an hour then that would be a good sign, but probably won't know for sure until it runs a day or more.
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
Failed again at about 35 minutes which is pretty common after a reboot. I am going to try again from a different PC to the TrueNAS VM.
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
I have 7 PCs reading and writing small bits of data to the NAS, but it has mainly been one VM running qBittorrent that has been failing consistently, even though I have had other errors when copying files with Beyond Compare, it seems to be much less common. Even when the qBittorrent client fails all of the other PCs seems to continue (mostly reading) normally. I think this is because the problem is mostly writing not reading, but not 100% sure. I have done some tests copying a game folder with many small files from other PCs to the NAS and was getting some errors, so that is why I have been focused on the NAS and not the source, but another test at this point makes sense.
 
Last edited:

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
Failed again running on a completely different PC sending to the NAS. Again after about 35-40 minutes. This has to be some sort of buffer overflow issue or something like that? Runs fine for 30 minutes then errors. If I have no load on network it seems to run longer.

qberr.JPG
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
I tried disabling spanning tree in the switch but no luck. So it isn't the array, it isn't the source pc, it isn't the NIC. Next move will be to take the 10G switch out of the network.
Two other options if the network error persists, use a SSD for everyday writes and only use NAS for bulk storage or attempt to move to the drives to a physical box. If I move to a physical box I suppose I can set a Windows VM to run my programs but at this point I am not confident the issue will not persist. Third option would be VMWare I guess
 
Last edited:

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
I installed on bare metal, completely different PC and still the same problem. So we can rule out Hyper-V at this point.
 

RegularJoe

Patron
Joined
Aug 19, 2013
Messages
330
I suggst a few things:

  1. post the windows error message and if that message has a code include it as text so it is searchable by our overlords at Google.
  2. if this is from a Windows guest try from a Mac, Linux or UNIX machine
  3. try a different physical/virtual switch as some are so crappy that you can' t "see" the interface counters like you should. :
https://www.truenas.com/community/t...to-windows-but-not-the-other-way-round.60334/

here is one of my older enterprise switches that cost me like $99:

Hardware is Gigabit Ethernet, address is 2893.fefc.ee99 (bia 2893.fefc.ee99)
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 1000Mb/s, link type is auto, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:05, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 1
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 2000 bits/sec, 2 packets/sec
2196319663 packets input, 475176914645 bytes, 0 no buffer
Received 24152749 broadcasts (17850585 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 17850585 multicast, 0 pause input
0 input packets with dribble condition detected
8939040668 packets output, 12904713120894 bytes, 0 underruns
0 output errors, 0 collisions, 1 interface resets
1 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
Removed my 10G switch and still a problem, some say it might be Ryzen, I disabled global c state but can't find any other C States to disable in BIOS
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Have you tried using PCI pass-through for the NIC to the VM, instead of using the vSwitch?
 

RegularJoe

Patron
Joined
Aug 19, 2013
Messages
330
SVR-IO enabled in your BIOS? IOMMU and VT-D might be a little different between AMD and Intel systems. Are you using a real 10/40gig switch or daisy chaning the Mellanox adapters?
 

tony95

Contributor
Joined
Jan 2, 2021
Messages
117
Yes just make sure you select the correct network adapter from the drop-down list.

This seems to have been a SMB bug in 12.0-U1.1. I updated to U2 today and the issue seems to be resolved. I am surprised more people did not pin point the issue.
 
Top