SOLVED Complete network failure after a few hours of uptime

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
Hello community, I'm stumped with an issue, hoping someone can provide some advice or shed some light.

Problem description:​

The symptom is that the nas randomly completely loses all network connectivity, both on the local network and on the internet. The nas will be running fine for hours to days, then suddenly will experience this issue. A reboot fixes restores all connectivity (until the next outage).

To clarify, the nas is not shut off. It continues logging other messages.

My local network consists of a router (proprietary Telus Wi-Fi Hub maybe made by Arcadyan?) which I have connected the nas directly via ethernet lan 1 (Intel), and my PC which is also connected via ethernet directly to the router. My PC does not experience any network issues.

The Nas is inaccessible by web gui, and also inaccessible by ssh and SMB(windows finder) from my PC. The jails are also offline, and their logs show similar network outages.

In /var/log/messages, I see a message related to my DDNS saying that an external known good domain cannot be resolved:
Code:
Nov  3 01:23:50 truenas 1 2022-11-03T01:23:50.755089-07:00 truenas.local inadyn 1494 - - Failed resolving hostname domains.google.com: Name does not resolve

State of my Nas:​

Version: TrueNAS-13.0-U2
4gyr6rsyyrx91.png


rbm50w20zrx91.png


lkz9op21zrx91.png

adxhp1egzrx91.png


What I've tried (and hasn't worked)​

I've set my router to assign a static local address for my Nas. Note that it used to be on 192.168.1.69. I've switched it to 192.168.1.250 for testing purposes.

uuij6056yrx91.png

The frequency increased when I set up my openvpn server, so I disabled that and rolled back the config. Frequency still remains on the "hours to days" timescale.

I've tried switching over to the other lan port on the motherboard, but it won't connect to network at all. I suspect it's either disabled or doesn't have good drivers. I'm unable to verify the reason due to not having a monitor to plug into the Nas at this moment.

I've tried assigning my nas (via router ui) to 192.168.1.250 to help avoid any dhcp conflicts (as recommended by a member of the Truenas community discord).



Potential paths forward​

In case it is the intel lan that is bad, I plan on getting a network card and trying that out.

Otherwise, I'm hoping the community can make suggestions to help diagnose and fix this issue.


Relevant hardware​



CPU​
Xeon E-2286M​
Mobo​
Mobo lan 1​
Intel GbE (10/100/1000 Mbit)​
Mobo lan 2​
Killer E2400​
Ram​
Samsung 16x2 UDIMM ECC​
SSD Cache​
WD SN570​
OS SSD​
Samsung SM951 M.2 256GB​
CPU Cooler​
Be-Quiet tower​
PSU​
Seasonic Focus GX-650​
Case​
Phanteks Enthoo Pro​
HDD​
8x WD Easystore 12 TB​
GPU​
NVIDIA M2000​
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
I thought it was best to have more exposure. I'll remove my post from reddit for now.
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
I received a comment on reddit from MisterBazz:
This is why I never recommend treating your NAS like a hypervisor or container host.

Disable/shutdown ALL of your jails. See if that fixes the issue. Then, you'll know it was a jail (or multiple).

If you are setting a static IP on your TrueNAS server (which you should be), then you don't need a static reservation in your router's DHCP pool. Just exclude that address from the rest of the pool.

Also, make sure no other device on your network is using that IP (1.250).

I might have to disable all my jails to see. If it were to be one jail, it would certainly narrow it down. But I'm confused about how a jail can cause network outage for the main system?

Regarding the DHCP pool, I did have it reserved via router so nothing else should've been able to take it. Unless there's reason to believe otherwise I think I'd rule that out for now.
 
Last edited by a moderator:

Volts

Patron
Joined
May 3, 2021
Messages
210
Ahh, sorry - I overlooked your comment before:

I've tried switching over to the other lan port on the motherboard, but it won't connect to network at all. I suspect it's either disabled or doesn't have good drivers. I'm unable to verify the reason due to not having a monitor to plug into the Nas at this moment.

As a general rule Intel Ethernet is the safe bet and a good place to start.

I have no sense of how well Atheros-based "Killer" Ethernet works on FreeBSD/TrueNAS, only that Intel Ethernet is very mature.
 

WN1X

Explorer
Joined
Dec 2, 2019
Messages
77
What does ifconfig alc0 show once the network goes out?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If it were to be one jail, it would certainly narrow it down. But I'm confused about how a jail can cause network outage for the main system?

Quite easily. When two IPv4 devices are fighting over an IP address, you get lots of ARP activity, and neither one ends up working properly all of the time. If your NAT gateway, for example, temporarily learns the jail MAC as owner of the IP address in question, your NAS loses its ability to receive traffic from the (entire) Internet. But it likely changes seconds or minutes later. It ends up being a weird combination of which device (jail or host) is able to convince the various endpoints on your local network of what MAC owns the IP in question, so there is neither consistency or deterministic behaviour.

alc0 is active.

And generally speaking the Atheros ethernet chipsets are crap. Our laptops have them and the only upside I can think of is "at least they're not wifi".

If you are setting a static IP on your TrueNAS server (which you should be),

And this is also an excellent recommendation. I'm sure I make it at least monthly to someone here.
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
What does ifconfig alc0 show once the network goes out?
I'm currently unable to connect a monitor to the nas so I cannot run commands on it (I hard shutdown/reboot every time, though I suspect I can probably schedule a reboot every, say, 3 hours to mitigate this sort of.)

Once I have a suitable setup I'll be able to run this and post here.

And this is also an excellent recommendation. I'm sure I make it at least monthly to someone here.
I just read this guide and it seems like I already have a static ip (192.168.1.69). Did you mean something else?
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
I think I might set up a command to log the output of `ifconfig alc0` every minute. That way I should be able to scoop it out after a reboot.

Set up this cron job: Will be monitoring it next time it crashes.

Code:
echo $(date -u) $(ifconfig alc0) > temp$(date +%s).txt
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Did you mean something else?

No. I'm just confirming that it's good advice.

I think I might set up a command to log the output of `ifconfig alc0` every minute. That way I should be able to scoop it out after a reboot.

That may be useless. The failure modes with these crappy ethernet chipsets are often a failure under load, or when buffers fill, or when the ethernet cable isn't quite 100%, or any of a bunch of other random dumbness. I usually beat on Realtek for being complete crap, but it really extends to Atheros and some others as well. A lot of these are really designed for Windows, and are designed to be *cheap*, so that a laptop or desktop manufacturer can tick off a checkbox on their feature list. They often don't publish documentation on how the chipset works, and then the FreeBSD and Linux folks have to reverse engineer a driver to make them work.
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
That may be useless. The failure modes with these crappy ethernet chipsets are often a failure under load, or when buffers fill, or when the ethernet cable isn't quite 100%, or any of a bunch of other random dumbness. I usually beat on Realtek for being complete crap, but it really extends to Atheros and some others as well. A lot of these are really designed for Windows, and are designed to be *cheap*, so that a laptop or desktop manufacturer can tick off a checkbox on their feature list. They often don't publish documentation on how the chipset works, and then the FreeBSD and Linux folks have to reverse engineer a driver to make them work.

Sure, but I'm using the Intel lan. Does that still apply? If so, can you recommend a good network card?
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
And what about listing a default gateway?
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
As Volts noted above, the Atheros ethernet is active. What makes you think you're using the Intel LAN? Because its state is down.
Ah that's a good point, for some reason I assumed it was the case. The motherboard documentation doesn't actually specify which lan port is which controller, so I guess it's the other one.
And what about listing a default gateway?
What about it? I believe my default gateway is already set as 192.168.1.69, but how would I set it otherwise?
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
I have attached a monitor and got my intel lan to work (em0). I will be monitoring the system for the next while to report any crashes. No crashes since I switched over about 6 hours ago.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
What about it? I believe my default gateway is already set as 192.168.1.69, but how would I set it otherwise?
Your Global Configuration screenshot did not show an entry for Default Gateway.
 

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27

toxikat

Dabbler
Joined
Nov 3, 2022
Messages
27
Ever since changing my physical port to the intel one, I haven't had related network issues. Will be resolving this thread for now. Thanks for the help, everyone.
 
Top