Transfer above 50MB/s causes network failure on both NAS and client

Hamberglar

Cadet
Joined
Nov 7, 2022
Messages
6
I'm consistently facing an issue where reading or writing around 60 MB/s of data from my NAS causes both the NAS and the client to lose all network connectivity for a short time until the SMB share is closed due to said network connectivity loss.

The network itself is, as far as I can tell, fine during this time. However, from the client computer, you lose connectivity to everything, including the gateway.

The NAS and client are stable until you reach about 50-60 MiB/s, according to the readout on the dashboard. At which point a ping test to any local network resource (such as the gateway or the nas itself) will go from the typical <1ms to about 25ms.

I've tested and encountered this issue both with smb and ftp, so it's not a protocol issue as far as I can tell.

I did a bit of research and determined that the realtek NIC I was using is not supported and switched to an explicitly supported intel chipset (i210). The issue persists. I have disabled the onboard nic and all other non-essential onboard functions like rgb and sound. Memtest86 passed. All temps are ice cold. Brand new machine, brand new install. It's been an issue since day 1. The client machine doesn't seem to be part of the equation because I haven't found a computer that doesn't experience this problem with my NAS and I've tried several.

When the crash happens, nothing is written to /var/log/messages or any other log as far as I can tell. I've attached a ping test that shows the severity of the issue.

TrueNAS-13.0-U3
AMD Ryzen 5 5600G
ASRock A520M-ITX/AC
Intel i210
Samsung 860 evo (truenas system drive)
3x 8TB MaxDigitalData 7200RPM
1x 16GB Intel Optane (cache)
16GB non-ecc memory

I found a few other threads that mentioned similar issues but the "solution" seemed to always be switching to a supported NIC, which I've done.

Unfortunately, I can't think of a convenient way to do any testing to see if this is also an issue with a different OS. These drives are all I have at the moment.

I'd be happy to share any additional information, I'm a Truenas novice.
 

Attachments

  • cmd_E0lnmju8i6.png
    cmd_E0lnmju8i6.png
    677.9 KB · Views: 87

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Conspicuously missing is any description of your network or networking hardware. Got a cheap ethernet switch in there? Trying to traverse a residential CPE device ("router")? Got a Realtek on the client side? You basically have to do some legwork here to show a stable network. A NAS is going to be very unpleasant to use if you do not have a rock solid network.

You'll need to be able to show that you're able to run iperf3 in each direction without hiccup. This is usually a good starting point for network debug.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
I'm consistently facing an issue where reading or writing around 60 MB/s of data from my NAS causes both the NAS and the client to lose all network I did a bit of research and determined that the realtek NIC I was using is not supported and switched to an explicitly supported intel chipset (i210). The issue persists.
Good research. What is the i210 AIC and are you certain it is genuine?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
It almost sounds like you have a half-duplex Ethernet link in their somewhere.
 

Hamberglar

Cadet
Joined
Nov 7, 2022
Messages
6
Conspicuously missing is any description of your network or networking hardware. Got a cheap ethernet switch in there? Trying to traverse a residential CPE device ("router")? Got a Realtek on the client side? You basically have to do some legwork here to show a stable network. A NAS is going to be very unpleasant to use if you do not have a rock solid network.

You'll need to be able to show that you're able to run iperf3 in each direction without hiccup. This is usually a good starting point for network debug.

My switch is a netgear gs748T. The client is a realtek, but if having to use a specific brand of NICs on every device in my network is a requirement, then Truenas is not going to work for me.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The client is a realtek, but if having to use a specific brand of NICs on every device in my network is a requirement, then Truenas is not going to work for me.

You don't need to use a specific brand of ethernet. However, if you pick crappy network components, the weakest link in the chain rule absolutely applies. You might be right in that TrueNAS is not going to work for you, but neither is anything else. If your network sucks, your network sucks. If your server-side ethernet controller sucks, your experience will be poor. If your switch sucks, your experience will be poor. If your client-side ethernet controller sucks, your experience will be poor. And these are, unfortunately, additive in nature.

So to circle around to what I said previously, using iperf3 to test your network is a really good idea. You should be able to get about 930-940Mbps through your network, sustained, on a good network. That's probably more like 700-800Mbps with Realtek. If that's fine for you, then that's fine. But if the network actually stalls out and stops working, that is probably not going to work out well for you.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
What brand is this, exactly?

Presumably this


I believe it's some white label rebranded WD drive. The large cache size is suspicious; probably SMR. White label drives are a big flashing danger sign, as they are typically lower quality than the WD drives.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Presumably this


I believe it's some white label rebranded WD drive. The large cache size is suspicious; probably SMR. White label drives are a big flashing danger sign, as they are typically lower quality than the WD drives.

The official WD Purple Pro shows as 256MB cache and CMR - but I'm suspicious of the white-label rebranded HDDs as well.

Being as this is an AMD Ryzen system, were the typical steps of disabling the C6 deep sleep state and the "Cool N Quiet" power management done? I'm not sure how pressing that is under TN13, but it likely still applies.
 

Hamberglar

Cadet
Joined
Nov 7, 2022
Messages
6
You don't need to use a specific brand of ethernet. However, if you pick crappy network components, the weakest link in the chain rule absolutely applies. You might be right in that TrueNAS is not going to work for you, but neither is anything else. If your network sucks, your network sucks. If your server-side ethernet controller sucks, your experience will be poor. If your switch sucks, your experience will be poor. If your client-side ethernet controller sucks, your experience will be poor. And these are, unfortunately, additive in nature.

So to circle around to what I said previously, using iperf3 to test your network is a really good idea. You should be able to get about 930-940Mbps through your network, sustained, on a good network. That's probably more like 700-800Mbps with Realtek. If that's fine for you, then that's fine. But if the network actually stalls out and stops working, that is probably not going to work out well for you.

Tbh, I've never really understood why iperf does the things that it does. I get about 650-ish mbps from iperf, but my real world transfer speeds from the internet are faster than the numbers described by a 10-stream iperf test. So color me biased, but I don't really put much stock in those results because they're just simply not correct.
What brand is this, exactly?
POS white label. I'm aware. I just wanted more than 10TB of storage as cheap as I could get. Might be SMR, not quite sure. But that's a matter about performance, and performance doesn't really concern me right now. I just want it to stop crashing my network, and I really doubt the drive is causing an net interface crash on a device it's not even installed in.
The official WD Purple Pro shows as 256MB cache and CMR - but I'm suspicious of the white-label rebranded HDDs as well.

Being as this is an AMD Ryzen system, were the typical steps of disabling the C6 deep sleep state and the "Cool N Quiet" power management done? I'm not sure how pressing that is under TN13, but it likely still applies.
I don't think I did anything like that. Got any documentation? I'll do some poking around in the meantime.
 

Attachments

  • cmd_0fGf8BFdUI.png
    cmd_0fGf8BFdUI.png
    47.2 KB · Views: 81
  • firefox_JAAN0mlPpW.png
    firefox_JAAN0mlPpW.png
    79.6 KB · Views: 80

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Tbh, I've never really understood why iperf does the things that it does. I get about 650-ish mbps from iperf, but my real world transfer speeds from the internet are faster than the numbers described by a 10-stream iperf test. So color me biased, but I don't really put much stock in those results because they're just simply not correct.

It is correct, it's just that you aren't really understanding what it is doing. Your not understanding the test is different than the test being "not correct." I'll see if I can explain.

iperf3 is testing the network chipset at both ends, along with components along the way, which would include stuff like LACP bundles and ethernet switch performance.

Your real world transfer speeds "from the internet" are largely irrelevant, as they are presumably transmitted on the sending side by a competent ethernet chipset in a server buried in a data center somewhere. What happens is that, most of the time, for test points that aren't too far away, and don't have significant network problems such as packet loss, these packets are firehosed at you, and because commercial internet gear is optimized for the task, these basically firehose out your CPE's ethernet port (cable/DSL modem, etc) at peak speed, with no pause between packets. This gets you great speed as reported "from the internet". Even the cheapest ethernet chipsets try very hard to receive these packets and deliver them to the OS, because a failure to do so would result in a retransmission of a packet (or packets). Further, this is a single stream of packets. This is easier to deal with.

However, when you have a local area network, and you have a crappy ethernet chipset doing both the transmitting and receiving, things are different. In particular, with multiple packet streams, there is contention for the transmit channel, and a cheap-arse non-optimized non-server chipset may struggle with the complicated task of keeping the traffic flowing smoothly as it multiplexes many packet streams onto the single TX channel. Ethernet manufacturers such as Intel, who design high performance ethernet chipsets, have specialists who are optimizing both the silicon on the card and the driver in the operating system, making optimizations such as interrupt coalescing, TCP offload, multiple queues, DMA access, etc., all operate at peak performance. I can *guarantee* you that Realtek isn't heavily invested in these sorts of optimizations. So what you get with the cheap stuff is instead of your network being a racetrack on which you drive high performance (Intel, Chelsio, Solarflare, etc) race cars, it turns into more of a gravel road where you are driving old jalopies (Realtek, Atheros, etc).

The thing that iperf3 is designed to do is to try to maximize the traffic being transmitted over the network, especially in a model that represents a heavily utilized server situation, which typically consists of dozens or even thousands of parallel flows. This is how those of us who do this professionally manage to maximize server performance. We're not interested in the fact that Realtek parts test poorly when used as network components -- we expect that. iperf3 isn't expected to be the equivalent of speedtest.net, it's intended to place the networking subsystem under stress, just like memtest86 for memory, CPUburn for CPU's, or solnet-array-test for your disk array.

I get about 650-ish mbps from iperf,

Pretty close to what I expected:

That's probably more like 700-800Mbps with Realtek.

And your 650 makes perfect sense if both ends were Realtek. Did I guess that correctly? And if you keep increasing the number of streams with the Realtek, you will get more degraded performance.

My switch is a netgear gs748T. The client is a realtek, but if having to use a specific brand of NICs on every device in my network is a requirement, then Truenas is not going to work for me.

It isn't TrueNAS that's at fault here. It's the cheapo ethernet chipset, and quite possibly the low end switch. If you aren't performance oriented, it may not matter. But if it matters to you, your only real option is to start chucking the crummy gear into the trash. I would start with the ethernet card in the server, followed by the ethernet card in the client, followed by the Netgear. That's the order which I would expect to make the most difference.

(*) Obligatory disclosure: we're a Netgear Powershift Solutions Partner over here, but that doesn't require me to lie about the product performance.
 

Hamberglar

Cadet
Joined
Nov 7, 2022
Messages
6
It is correct, it's just that you aren't really understanding what it is doing. Your not understanding the test is different than the test being "not correct." I'll see if I can explain.

iperf3 is testing the network chipset at both ends, along with components along the way, which would include stuff like LACP bundles and ethernet switch performance.

Your real world transfer speeds "from the internet" are largely irrelevant, as they are presumably transmitted on the sending side by a competent ethernet chipset in a server buried in a data center somewhere. What happens is that, most of the time, for test points that aren't too far away, and don't have significant network problems such as packet loss, these packets are firehosed at you, and because commercial internet gear is optimized for the task, these basically firehose out your CPE's ethernet port (cable/DSL modem, etc) at peak speed, with no pause between packets. This gets you great speed as reported "from the internet". Even the cheapest ethernet chipsets try very hard to receive these packets and deliver them to the OS, because a failure to do so would result in a retransmission of a packet (or packets). Further, this is a single stream of packets. This is easier to deal with.

However, when you have a local area network, and you have a crappy ethernet chipset doing both the transmitting and receiving, things are different. In particular, with multiple packet streams, there is contention for the transmit channel, and a cheap-arse non-optimized non-server chipset may struggle with the complicated task of keeping the traffic flowing smoothly as it multiplexes many packet streams onto the single TX channel. Ethernet manufacturers such as Intel, who design high performance ethernet chipsets, have specialists who are optimizing both the silicon on the card and the driver in the operating system, making optimizations such as interrupt coalescing, TCP offload, multiple queues, DMA access, etc., all operate at peak performance. I can *guarantee* you that Realtek isn't heavily invested in these sorts of optimizations. So what you get with the cheap stuff is instead of your network being a racetrack on which you drive high performance (Intel, Chelsio, Solarflare, etc) race cars, it turns into more of a gravel road where you are driving old jalopies (Realtek, Atheros, etc).

The thing that iperf3 is designed to do is to try to maximize the traffic being transmitted over the network, especially in a model that represents a heavily utilized server situation, which typically consists of dozens or even thousands of parallel flows. This is how those of us who do this professionally manage to maximize server performance. We're not interested in the fact that Realtek parts test poorly when used as network components -- we expect that. iperf3 isn't expected to be the equivalent of speedtest.net, it's intended to place the networking subsystem under stress, just like memtest86 for memory, CPUburn for CPU's, or solnet-array-test for your disk array.



Pretty close to what I expected:



And your 650 makes perfect sense if both ends were Realtek. Did I guess that correctly? And if you keep increasing the number of streams with the Realtek, you will get more degraded performance.



It isn't TrueNAS that's at fault here. It's the cheapo ethernet chipset, and quite possibly the low end switch. If you aren't performance oriented, it may not matter. But if it matters to you, your only real option is to start chucking the crummy gear into the trash. I would start with the ethernet card in the server, followed by the ethernet card in the client, followed by the Netgear. That's the order which I would expect to make the most difference.

(*) Obligatory disclosure: we're a Netgear Powershift Solutions Partner over here, but that doesn't require me to lie about the product performance.
Okay. So, we've already ruled all that out. I'm not using realtek at either end, and I've completely eliminated the switch during the testing and neither the performance nor the issue changes. Also, it's a really nice switch, I literally have no clue what you're talking about. It's a 48 port poe managed switch. It costs like $500 brand new.

If we can't get past this, I guess I'll just give up on truenas and work with a different platform.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'm not using realtek

You just finished saying you were using Realtek on the client side. What are you using, if not?

Also, it's a really nice switch, I literally have no clue what you're talking about. It's a 48 port poe managed switch.
My switch is a netgear gs748T.

Presumably you meant GS748TP, because the -T isn't a PoE switch. It also isn't a managed switch, it's a ProSafe Smart Switch, which means it only has a web UI, not a telnet or ssh interface.

A Dell PowerSwitch N3248P-ON is a managed PoE 48 port switch that's "really nice". Of course one expects "really nice" for almost $15,000. But you can get used gear like a Dell PowerConnect 5548P for about $500. Also managed, PoE, 48 port, "really nice". Both are likely to have better performance than the Netgear.

If we can't get past this, I guess I'll just give up on truenas and work with a different platform.

Okay. Good luck with that. The stuff I'm talking about applies to both FreeBSD and Linux. Those are your two major OS choices for NAS platforms. And whatever is causing you issues isn't likely to magically go away. 1GbE is a 20-year proven technology at this point, and most of the software kinks have been worked out for years. That suggests that you've got some hardware problem. But you're not being super helpful. I can only do so much from remote. At the end of the day, you've got to provide the clues.
 

Hamberglar

Cadet
Joined
Nov 7, 2022
Messages
6
Wow. Followup to that last post of mine. Big correction. And I guess I owe you an apology.

I guess I never actually tested real world stability from transferring from a client with intel nic to my server's intel nic (via smb). In my defense, I basically had to steal another computer with an intel nic because everything I own happens to be realtek (I'm what you call a prosumer and swing very team red so nothing I own has intel anything built into it).

The speeds are the same as on realtek to intel (or realtek to realtek tbh) but ho-ho-holy shit the stability is way better. Where I was seeing latency from realtek client to intel server around 25ms (to sometimes over 100ms) during a large transfer, I'm now seeing between 1ms and 15ms, give or take. God damn, what do they even do at Realtek all day? Because making their product better ain't it, as you said.

And I don't know why it never occurred to me to just try doing the same transfer to an smb share on a different computer before, but I just did. Same issue with the crashing, so it is the realtek card causing the problem, and it just happens to be an issue with every device I own except for this one Lenovo Thinkcentre. Well, shit. What do I do now? I guess start by buying another intel nic for my main pc and just not buying a motherboard with a realtek nic ever again.

Well, thanks for teaching me something new that has somehow never come up in my decade long career in small business IT. Consider me humbled.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And I guess I owe you an apology.

None needed or expected. This stuff can be *hard*. I attempt to educate by pushing as much information as I can out there. I'm usually willing to chase problems as long as I can see a potential path to resolution, and as long as I'm getting cooperation. It's free community support. I'm not an employee. I just enjoy the challenge.

I'm what you call a prosumer and swing very team red so nothing I own has intel anything built into it

That's not a fatal flaw. The bits that are problematic for you, though, are that the days of revenue producing gigabit ethernet chipsets are fifteen to twenty years behind us. There are a host of ethernet companies that no longer exist, whose IP went up in smoke, eaten on the low end by the Realtek class companies, and who couldn't compete with the incumbency of Intel on its march through server dominance from ... what, 2005-2020? Years are debatable but the point is valid anyways.

but ho-ho-holy shit the stability is way better.

You're almost certainly losing the occasional packet. That's the usual "stability" issue. You might be able to narrow it down to sending or receiving side, but that will just be frustrating or even rage inducing.

Where I was seeing latency from realtek client to intel server around 25ms (to sometimes over 100ms) during a large transfer, I'm now seeing between 1ms and 15ms, give or take. God damn, what do they even do at Realtek all day? Because making their product better ain't it, as you said.

I'm totally fine with trash talking Realtek, but I will note that some of their problems may not be their own. There is a thriving business over in Shenzhen for knock-off parts that have probably only been tested for compatibility with Windows drivers. The PC market is a race to the bottom. This is what gave us crap like the Capacitor Plague. Everybody trying to find a way to sell for a buck cheaper.

And I don't know why it never occurred to me to just try doing the same transfer to an smb share on a different computer before, but I just did. Same issue with the crashing, so it is the realtek card causing the problem, and it just happens to be an issue with every device I own except for this one Lenovo Thinkcentre. Well, shit. What do I do now? I guess start by buying another intel nic for my main pc and just not buying a motherboard with a realtek nic ever again.

Sorry to hear it. I'm glad you had the resources to discover this yourself. We've had lots of users come in over the years with "Realtek problems" and it always seems like substituting in an Intel Desktop CT ethernet card is the typical $35 fix, but I always feel a little bad providing that remediation advice.

Well, thanks for teaching me something new that has somehow never come up in my decade long career in small business IT. Consider me humbled.

Not a problem. It's a funny thing, doing this sort of stuff on a forum. You kinda threw me above with the iperf-vs-speedtest stuff. It can be difficult to convey these topics in a text format, and I kinda got the sinking feeling that this was going nowhere. Glad it ended up productive. Also, stick around. There are a number of very clueful people here who have expertise in various areas of PC's, networking, etc. I'm an old UNIX guy; don't talk to me about SMB or AD. But you want to talk about hardware or UNIX fundamentals? Let's chat. :smile: You can learn a lot of cool stuff. If you want.
 
Top