Repeatable dramatic drop in network throughput after about a day of uptime

saspus

Dabbler
Joined
Mar 1, 2022
Messages
10
I'm running TrueNAS-12.0-U8 on P8Z68 board. After a while -- usually after a day -- network throughput drops dramatically (to about 1-2Mbps), as reported by iperf3. Moreover, ssh-ing to the box and typing somethign is laggy, so somethign is definitely bonkers with network.

This is consistently happening after about 12-24 hours of uptime. What I have observed so far:

1. This happens with built-in Intel LAN adapter (Asus P8Z68 v-pro/gen3 board)
2. This happens with HP dual head NIC NC360T, connected with a single port.
3. This happens with NC360T in LACP.
4. This happens with similar IBM card. At this point I'm convinced it has nothing to do with the NIC per se
5. IT does not matter if the system was idle all that time or transferring massive amounts of files.
6. Replugging lan cable does not help recover
7. Resetting network configuration via console either does not help or leaves the network in bad state (complete loss of connectivity)
8. Rebooting the whole nas fixes it, for about a day.
6. There is nothing useful in the /var/log/messages when that happens: it was working find this morning, and wasn't after 11:20:


Code:
truenas% sudo tail /var/log/messages
Mar  1 00:00:00 truenas newsyslog[55520]: logfile turned over due to size>200K
Mar  1 00:00:00 truenas syslog-ng[1028]: Configuration reload request received, reloading configuration;
Mar  1 00:00:00 truenas syslog-ng[1028]: Configuration reload finished;
Mar  1 11:20:19 truenas kernel: Limiting open port RST response from 289 to 200 packets/sec
Mar  1 11:21:19 truenas kernel[1028]: Last message 'Limiting open port R' repeated 1 times, suppressed by syslog-ng on truenas.local


Is " Limiting open port RST response from " relevant here? Reading this forum it appears that this is a result of someone knocking to the closed port. For the lack of other ideas I can try to research in that direction, but this does not look to be likely culprit -- it would have been a bug if it was possible to bring down the nas by just knocking at closed port.

Questions:
1. What other avenues do you guys suggest I can explore to further triage it? I cannot reproduce it on-demand, but it happens on its own within a day.
2. I have the system in that state now. What OS state can I look at to see what's going on?

This appears to be FreeBSD system issue at this point, as opposed to TrueNAS specific one (storage is not involved). I only found this vaguely relevant thread with no outcome: https://www.truenas.com/community/threads/weird-networking-problems-after-60-days-of-uptime.38175/

Any ideas are welcomed!
 

saspus

Dabbler
Joined
Mar 1, 2022
Messages
10
Ok, to close this cliffhanger, this is what happened:

In desperation I've booted into memory test environment (memtest86) and while running the test I've noticed unusually high CPU temperatures -- pushing 95C. The heatsink was pretty cold, so I thought it inaccurate, but then decide to check thermal paste. Well, 10 years vintage crusted crap, that's what it was. I've replaced the paste with the fresh MX-4, reassembled the thing, re-run the memtest and the temperature did not exceed 65C this time.

Booted the truenas back and haven't seen the original issue anymore for over 2 weeks now, including with built-in LAN adapter.

Could this be related (given that the CPU load is rarely gives above 5%, according two the truenas monitoring)? Maybe. Maybe there were spikes throwing the throttling that the rest of the system did not like. Not sure what's up with that. The fact is -- my issue has since disappeared, and the NAS is rock solid (minus occasional mDNS concussions, but that's separate story).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Thermal throttling is a thing... if the CPU was getting too hot, maybe transferring some of that excess heat to other components via the board (maybe the most sensitive of those would be the ones with lower airflow over them), may have been causing the system or the individual components themselves to throttle performance to handle the excess of heat.
 
Top