Resource icon

High Speed Networking Tuning to maximize your 10G, 25G, 40G networks

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
jgreco submitted a new resource:

High Speed Networking Tuning - Make your 10G, 25G, 40G networking faster

Both FreeBSD and Linux come by default highly optimized for classic 1Gbps ethernet. This is by far the most commonly deployed networking for both clients and servers, and a lot of research has been done to tune performance especially for local area networks. The default settings are optimized to be efficient for both small and large servers, but because memory is often limited on smaller servers, some tunables that could improve performance for higher speeds (10Gbps and above) are not...

Read more about this resource...
 

smdftw

Cadet
Joined
Jul 2, 2022
Messages
1
When I try to set these tunables in SCALE I get the "Sysctl 'xxx' does not exist in kernel" message.

Are these tunables called different in SCALE or are they only applicable to CORE?
 
Joined
Dec 29, 2014
Messages
1,135
I am curious about the "Do not try to use copper 10GBase-T" comment in the resource (which was very helpful). I am sure that there are scars associated with that comment. I am starting to see more of my customers wanting to use 10GBase-T and I would be interested in hearing about your experiences.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I am sure that there are scars associated with that comment.

Actually not; I own a number of bits of gear with 10GBase-T ports, including a very nice X9DR7-TF+ board and some Dell PowerConnect 8024F switches. These were basically incidental acquisitions where I did not deliberately seek them out, and generally use the ports as conventional 1G copper ports.

The most immediate arguments against 10GBase-T are:

1) that it consumes more power than an equivalent SFP+ or DAC setup. Which might seem like *shrug*, except that once you get to the point of burning several extra watts per port on a 48-port switch, it becomes a meaningful ongoing operational expense. Newer estimates are two to five watts per port, whereas SFP+ is about 0.5-0.7. It is worth noting that in a data center environment, if you burn five extra watts in equipment, there is usually about a five to ten watt cost to provide cooling as well. The electrical costs for 10GBase-T add up.

2) that it experiences higher latency than an SFP+ or DAC setup. SFP+ is typically about 300 nanoseconds. 10GBase-T, on the other hand, uses PHY block encoding, so there is a 3 microsecond step (perhaps 2.5 us more than the SFP+). This shows up as additional latency in 10GBase-T based networks, which is undesirable, especially in the context of the topic of this thread, which is performance maximization. I'm sure someone will point out that it isn't a major hit. True, but there regardless.

3) that people argue for 10GBase-T because they've already got copper physical plant. The problem is that this is generally a stupid argument. Unless you installed Cat7 years ago, your copper plant is unlikely to be suitable for carrying Cat7 at distance, and that janky old 5e or 6 needs to be replaced. Today's kids did not live through the trauma of the 90's, where we went from Cat3 to Cat5 to Cat5e as ethernet evolved from 10 to 100 to 1000Mbps. Replacing physical plant is not trivial, and making 10GBase-T run at 100 meters from the switch is very rough; Cat6 won't cut it (only 55m), you need Cat6A or Cat7 and an installer with testing/certification gear, because all four pairs have to be virtually perfect, and any problems with any pair can render the connection dead.

By way of comparison, fiber is very friendly. OM4 can take 10G hundreds of meters very efficiently. It's easy to work with, inexpensive to stock various lengths of patch, and you can get it in milspec variants that are resistant to damage. You can run it well past the standards-specced maximum length in many cases.

On the flip side, 10GBase-T has the advantage of being a familiar sort of plug and play that generally doesn't require extra bits like SFP+ modules and may be easier to work with inside a rack.

I am starting to see more of my customers wanting to use 10GBase-T

I think the big driver for many people is the familiarity thing I just mentioned; they can wrap their heads around 10GBase-T because at the superficial level it feels very much like 1GbE. There's a lot of FUD that has slowly percolated into the industry over the last few decades about fiber and fiber installers, because terminating fiber in the field is specialist work that requires expensive gear and supplies. However, these days you can often cheat and get factory-terminated prebuilt assemblies that can avoid the field termination work. Very easy to work with.
 
Joined
Dec 29, 2014
Messages
1,135
Thanks for the additional info. I have seen that in my own limited experience in my home office. The 10GBase-T liked most of the cables in a 10M trunk bundle (nice thing from fs.com), but not all of them. Perhaps it was the coupler in the no-punch panel, I don't know. Yes I can punch when needed, but it isn't one of my favorite leisure activities. All that was part of trying to consolidate some of the stuff in my home office to reduce the background noise, but I am straying too far off topic. Thanks again for the excellent clarification.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yes I can punch when needed, but it isn't one of my favorite leisure activities.

Actually, this is where things are more likely to go off the rails. Category cable is all about twisted pair characteristics, and if you'll allow me a little liberty for inaccuracy to do a better layman explanation, there are RF components to the issue as well as physical factors:

Category 3 cable, which any monkey should be able to terminate, operates at 16 MHz, handling 10Mbps, and due to the long twist length, it was very common to find people untwisting excessive amounts of cable and just punching it. This continued at least into the Cat5 100Mbps era, and among other things, both 10/100Mbps only used two out of the four pair, which made sloppiness somewhat more forgiving since screwing up a pair still left you operable in many cases.

badtermination.jpg

However, with 1GbE, better signal processing led to the use of all four pairs, AND all four pairs being used simultaneously in both directions. This is most of how we got to 1GbE without a significant increase in bandwidth; Cat5e was only 100MHz-350MHz depending on the era. Crosstalk (RF interference between pairs) and delay skew (difference in transmission times due to differing lengths of the pairs) become significant issues though, and therefore it became necessary for installers to up their games on the quality of field terminations. You had to bring the twist almost all the way up to the terminals, and also make sure that you weren't causing pair lengths to differ, or shortening one conductor of a pair more than another. Messing with this would cause weird problems and failures.

With 10GbE, we have once again boosted MHz to 500 MHz, and moved to a much more complicated encoding strategy that includes 16 discrete signal levels. This means that it is even more sensitive to field termination errors, and you really need a perfectionist grade punch technique followed by a thorough cable test/cerification to get this working reliably.
 

Tony-1971

Contributor
Joined
Oct 1, 2016
Messages
147
I was checking the values in my Core system, and there is this error:
Code:
root@freenas-sm[~]# sysctl net.inet.tcp.recvbuf_inc
sysctl: unknown oid 'net.inet.tcp.recvbuf_inc'

root@freenas-sm[~]# sysctl net.inet.tcp | grep -i recv
net.inet.tcp.recvspace: 524288
net.inet.tcp.recvbuf_max: 16777216
net.inet.tcp.recvbuf_auto: 1

The others oid are present.
Is it missing because
cc_cubic_load="YES"
cc_dctcp_load="YES"
are not present?

Best Regards,
Antonio
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
No, it looks like they rejiggered it in FreeBSD 13 to not be needed. That's what I get for doing my testing on TrueNAS CORE 12 I guess.

See


It's still fine, the _max value is really the more important bit.
 
Joined
Dec 29, 2014
Messages
1,135

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Those are loadable tunables, not sysctl tunables.

I may have misunderstood the question. @Tony-1971 I thought you were asking if the _inc OID was missing because the modules weren't loaded. You can check what CC modules are loaded and which is active:

net.inet.tcp.cc.available: newreno
net.inet.tcp.cc.algorithm: newreno
 
Joined
Dec 29, 2014
Messages
1,135
I think I need to tune something else. It looks like my box is still using newreno even though the other methods are available.
Code:
freenas2% kldstat | grep cc
 7    1 0xffffffff82e76000     1e90 cc_cubic.ko
 8    1 0xffffffff82e78000     2850 cc_dctcp.ko
freenas2% sysctl -a | grep '\.\cc\.'
net.inet.tcp.cc.dctcp.slowstart: 0
net.inet.tcp.cc.dctcp.shift_g: 4
net.inet.tcp.cc.dctcp.alpha: 1024
net.inet.tcp.cc.newreno.beta_ecn: 80
net.inet.tcp.cc.newreno.beta: 50
net.inet.tcp.cc.abe_frlossreduce: 0
net.inet.tcp.cc.abe: 0
net.inet.tcp.cc.available: newreno, cubic, dctcp
net.inet.tcp.cc.algorithm: newreno
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Do you have a sysctl that sets

net.inet.tcp.cc.algorithm=cubic

Maybe I didn't make that clear in my resource... mmm
 
Joined
Dec 29, 2014
Messages
1,135
Maybe I didn't make that clear in my resource... mmm
It is referenced in there, but it wasn't as obvious as the other things.
 

specter9mm

Dabbler
Joined
Apr 21, 2023
Messages
13
When I try to set these tunables in SCALE I get the "Sysctl 'xxx' does not exist in kernel" message.

Are these tunables called different in SCALE or are they only applicable to CORE?
Most of the kernel options there are specific to the BSD based Truenas Core. There are similar options for the Debain Linux based Truenas Scale, but require some research.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
Most of the kernel options there are specific to the BSD based Truenas Core. There are similar options for the Debain Linux based Truenas Scale, but require some research.
On the 10g systems that I have switched to Scale, these are the network sysctl's I have implemented, fwiw. Make sure you add the modprobe tcp_dctcp to the post init command.

net.core.netdev_max_backlog = 300000 net.core.optmem_max = 268435456 net.core.rmem_default = 212992 net.core.rmem_max = 134217728 net.core.somaxconn = 8192 net.core.wmem_default = 212992 net.core.wmem_max = 134217728 net.ipv4.tcp_congestion_control = dctcp net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_sack = 0 net.ipv4.tcp_wmem = 4096 65536 134217728
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
I think I need to tune something else. It looks like my box is still using newreno even though the other methods are available.
Code:
freenas2% kldstat | grep cc
 7    1 0xffffffff82e76000     1e90 cc_cubic.ko
 8    1 0xffffffff82e78000     2850 cc_dctcp.ko
freenas2% sysctl -a | grep '\.\cc\.'
net.inet.tcp.cc.dctcp.slowstart: 0
net.inet.tcp.cc.dctcp.shift_g: 4
net.inet.tcp.cc.dctcp.alpha: 1024
net.inet.tcp.cc.newreno.beta_ecn: 80
net.inet.tcp.cc.newreno.beta: 50
net.inet.tcp.cc.abe_frlossreduce: 0
net.inet.tcp.cc.abe: 0
net.inet.tcp.cc.available: newreno, cubic, dctcp
net.inet.tcp.cc.algorithm: newreno
In my environment on TrueNAS Core systems, I found the htcp congestion control algorithm to yield the best results. You could give it try and see if you have a similar experience. You'll need to load the htcp kernel modual before switching to htcp

Code:
kldload cc_htcp

sysctl net.inet.tcp.cc.algorithm=htcp


If that works out for you add the following tunables:

VariableValueTypeEnabled
cc_htcp_loadYESLOADERyes
net.inet.tcp.cc.algorithmhtcpSYSCTLyes
 

specter9mm

Dabbler
Joined
Apr 21, 2023
Messages
13
Alright, curious what, if anything, else I can do. I am topping out at 7.1Gbps. Attached are screenshots of my sysctl settings. The systems have identical processors, motherboards, memory, and SFP cards. In each instance the first connection is over my switch, which has plenty of capacity to handle multiple 10Gbps simultaneous throughput. The second in each screenshot is direct from server to server interface. The sysctl options are the same on both, which I have also attached a screenshot of. MTU on the server to server is set to 9000, through the switch is 1500, as nothing else on my network is jumbo packet. I just can't seem to get close to 10Gbps.
 

Attachments

  • Screenshot 2024-02-22 221428.png
    Screenshot 2024-02-22 221428.png
    49.7 KB · Views: 44
  • Screenshot 2024-02-22 221105.png
    Screenshot 2024-02-22 221105.png
    71.3 KB · Views: 46
  • Screenshot 2024-02-22 220856.png
    Screenshot 2024-02-22 220856.png
    70.3 KB · Views: 36

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
Alright, curious what, if anything, else I can do. I am topping out at 7.1Gbps. Attached are screenshots of my sysctl settings. The systems have identical processors, motherboards, memory, and SFP cards. In each instance the first connection is over my switch, which has plenty of capacity to handle multiple 10Gbps simultaneous throughput. The second in each screenshot is direct from server to server interface. The sysctl options are the same on both, which I have also attached a screenshot of. MTU on the server to server is set to 9000, through the switch is 1500, as nothing else on my network is jumbo packet. I just can't seem to get close to 10Gbps.
The retransmits on that iperf screenshot (retr 7149) shows you've got some issues going on. The column Retr stands for Retransmitted TCP packets and indicates the number of TCP packets that had to be sent again. The lower the value in Retr the better. An optimal value would be 0, meaning that no matter how many TCP packets have been sent, not a single one had to be resent. A value greater than zero indicates packet losses which might arise from network congestion (too much traffic) or corruption, perhaps hardware issues.
 

specter9mm

Dabbler
Joined
Apr 21, 2023
Messages
13
The retransmits on that iperf screenshot (retr 7149) shows you've got some issues going on. The column Retr stands for Retransmitted TCP packets and indicates the number of TCP packets that had to be sent again. The lower the value in Retr the better. An optimal value would be 0, meaning that no matter how many TCP packets have been sent, not a single one had to be resent. A value greater than zero indicates packet losses which might arise from network congestion (too much traffic) or corruption, perhaps hardware issues.
Yep, that's why I went direct interface to interface, the re transmits only happen through the switch. The only other difference is the gbics. One machine has finisar, and one machine has cisco gbics. The fiber cable is only .5 meter, and I have tried swapping it with other .5 meter cables I have. Only other thing I can think is to try cisco to cisco gbic, and finisar to finisar gbic.
 
Last edited:
Top