BUG: Unable to get networking to stay stable in 23.10.1

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
Hello,

I recently installed Truenas Scale. This immediately on 23.10.1. I seem to have quite some network issues though.

I had to reset truenas completely recently to fix some network changes which were unable to be removed. After resetting I was able to configure my LAG again with vlans on top of that. I now wanted to create a change again and it seems I have bumped into the same issue again.

My specs:
Asrock Rack X570D4U
3x MX500 2TB
1x Samsung 960 Evo boot drive
1x AMD Ryzen 3700x
1x 16GB DDR4 ram (temporary until my 64GB arrives)|

The asrock has 2 interfaces, I would like to combine them in LACP. On top of that I have vlans running. what also might be of interest is that my LACP is running on 9200 MTU and the vlans on 9000.

The terminal shows the issue, I am unable to alter enp38s0 after applying the changes it goes back to normal + I can not create a bond anymore (it was there before the last reboot)
I rebooted after those issues and the issues persist:

I can not create a bug from the interface as I am unable to connect the interface to the internet at the moment.
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
To add to this, you can see there is an interface in the linux cli which I created before but is not showing up in the network editor anymore. interestingly enough I created this vlan on top of the bond but now it shows on top of one of the members.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
The error appears to be saying that you've already configured one of those member interfaces. Delete the configuration from any interface you want to be a member of the LAGG, create that interface (bond0 is fine for its name), and assign the two physical interfaces to it. Then configure it appropriately--set an IP address (as an alias), make sure gateway/DNS are set correctly, etc.
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
As you can see in the list there is no interface configuration visible. I only have my physical interfaces in the list but going to the Linux cli there clearly is a vlan which does not show up in the truenas network config.
The error appears to be saying that you've already configured one of those member interfaces. Delete the configuration from any interface you want to be a member of the LAGG, create that interface (bond0 is fine for its name), and assign the two physical interfaces to it. Then configure it appropriately--set an IP address (as an alias), make sure gateway/DNS are set correctly, etc.
 

gozfly

Dabbler
Joined
Dec 24, 2023
Messages
10
Hello @33Fraise33
We have exactly the same problem with our bond0 interface.
Apparently the bond0 interface is renamed when the kernel initialize causing it to lose its LACP configuration and loss of connection.

Our debugging in our case:
[2023/12/24 02:47:54] (INFO) middlewared.setup():203 - Interface 'bond0' is now 'enp5s0f0' (matched by link address '00:0a:f7:xx:xx:xx')
[2023/12/24 02:47:54] (INFO) middlewared.rename():112 - Renaming interface 'bond0' to 'enp5s0f0'
[2023/12/24 02:47:54] (INFO) middlewared.commit():118 - Renaming hardware interface 'bond0' to 'enp5s0f0'
[2023/12/24 02:47:54] (INFO) middlewared.commit():125 - Renaming interface configuration 'bond0' to 'enp5s0f0'

After the TrueNAS is restarted and the interface is renamed and if you try to create the interface like in your video it will return the error UNIQUE constraint failed: network_lagginterfacemembers.lagg_physnic because the bond0 interface is only renamed, it still has the same lag_ports and for the database those ports remains part of the original link aggregation.

We are still investigating how to solve it but these are some clues we have found.
https://www.truenas.com/community/threads/problems-with-lacp-truenas-scale-22-02-0.99759/
See the comment specifically from the user @bgermain1689

You can also see our post on the related topic.
https://www.truenas.com/community/threads/bonding-interface-renamed-after-reboot.115154/

We have tried these commands without any success to tell the kernel not to rename it, but we are still doing the same thing.
midclt call system.advanced.update '{"kernel_extra_options": "bonding.max_bonds=0"}'

We hope this bug is resolved soon
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
Hello @33Fraise33
We have exactly the same problem with our bond0 interface.
Apparently the bond0 interface is renamed when the kernel initialize causing it to lose its LACP configuration and loss of connection.

Our debugging in our case:


After the TrueNAS is restarted and the interface is renamed and if you try to create the interface like in your video it will return the error UNIQUE constraint failed: network_lagginterfacemembers.lagg_physnic because the bond0 interface is only renamed, it still has the same lag_ports and for the database those ports remains part of the original link aggregation.

We are still investigating how to solve it but these are some clues we have found.
https://www.truenas.com/community/threads/problems-with-lacp-truenas-scale-22-02-0.99759/
See the comment specifically from the user @bgermain1689

You can also see our post on the related topic.
https://www.truenas.com/community/threads/bonding-interface-renamed-after-reboot.115154/

We have tried these commands without any success to tell the kernel not to rename it, but we are still doing the same thing.


We hope this bug is resolved soon
Interesting, it seems to be a long standing bug then?
Is there any way to reset the network config without needing to fully reset truenas itself?

So what I read is that if I used bond1 instead of 0 it should have been fine?
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
Hello @33Fraise33
We have exactly the same problem with our bond0 interface.
Apparently the bond0 interface is renamed when the kernel initialize causing it to lose its LACP configuration and loss of connection.

Our debugging in our case:


After the TrueNAS is restarted and the interface is renamed and if you try to create the interface like in your video it will return the error UNIQUE constraint failed: network_lagginterfacemembers.lagg_physnic because the bond0 interface is only renamed, it still has the same lag_ports and for the database those ports remains part of the original link aggregation.

We are still investigating how to solve it but these are some clues we have found.
https://www.truenas.com/community/threads/problems-with-lacp-truenas-scale-22-02-0.99759/
See the comment specifically from the user @bgermain1689

You can also see our post on the related topic.
https://www.truenas.com/community/threads/bonding-interface-renamed-after-reboot.115154/

We have tried these commands without any success to tell the kernel not to rename it, but we are still doing the same thing.


We hope this bug is resolved soon
I looked through journalctl and dmesg and I dont seem to have the "middlewared" entry you mention above. I also don't immediately see an error log that indicates what the issue might be. Which is even more scary, imagine running this in a datacenter on a multihomed setup and suddenly completely losing the storage because you run LACP.
 

gozfly

Dabbler
Joined
Dec 24, 2023
Messages
10
I looked through journalctl and dmesg and I dont seem to have the "middlewared" entry you mention above. I also don't immediately see an error log that indicates what the issue might be. Which is even more scary, imagine running this in a datacenter on a multihomed setup and suddenly completely losing the storage because you run LACP.
For that same reason we have stopped implementing TrueNAS in production, it would be very problematic just to imagine it.

It turns out that if you reset the configuration and recover connectivity, the logs are persistent.
The logs shown above are obtained from the UI. System > Advanced > Save Debug
Exactly the log is in the file ./ixdiagnose/artifacts/logs/middlewared.log

It seems more users are reporting the same problem with the interfaces after the update.
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
For that same reason we have stopped implementing TrueNAS in production, it would be very problematic just to imagine it.

It turns out that if you reset the configuration and recover connectivity, the logs are persistent.
The logs shown above are obtained from the UI. System > Advanced > Save Debug
Exactly the log is in the file ./ixdiagnose/artifacts/logs/middlewared.log

It seems more users are reporting the same problem with the interfaces after the update.
Interesting, you were right:


Code:
[2023/12/24 13:32:27] (INFO) middlewared.setup():203 - Interface 'vlan33' is now 'enp38s0' (matched by link address 'MAC HERE')
[2023/12/24 13:32:27] (INFO) middlewared.rename():112 - Renaming interface 'vlan33' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.setup():203 - Interface 'bond0' is now 'enp38s0' (matched by link address 'MAC HERE')
[2023/12/24 13:32:27] (INFO) middlewared.rename():112 - Renaming interface 'bond0' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():118 - Renaming hardware interface 'vlan33' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():118 - Renaming hardware interface 'bond0' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():125 - Renaming interface configuration 'bond0' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():125 - Renaming interface configuration 'vlan33' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():155 - Changing VLAN 'vlan50' parent NIC from 'bond0' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():155 - Changing VLAN 'vlan333' parent NIC from 'bond0' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():155 - Changing VLAN 'vlan33' parent NIC from 'bond0' to 'enp38s0'
[2023/12/24 13:32:27] (INFO) middlewared.commit():161 - Changing VM NIC device 2 from 'bond0' to 'enp38s0'
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
It is not bad practice to map a virtual interface (bridge or vlan) to a physical mac linked to that interface, most network vendors do this. The issue seems to be it thinks that an interface is a duplicate and it tries to rename / remove the duplicate by matching the mac address.
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
I am not sure what I am doing wrong but I am encountering bug after bug. I now created my network as:
- physical => 2 vlans => 2 bridges (brxx - mgmt, bryy - vm) each with a vlan attached as member => dhcp on brxx for truenas (initially 172.16.33.29, changed it 172.16.33.3 through a static lease)
- Apply, reboot, all fine, it took the latter lease
- try to change the description of bryy and my dhcp lease dissapears on brxx and it takes the lease it initially had (172.16.33.29) but my dhcp server did not give that one, so it must have remembered that address somewhere.
- After a reboot it seems to take dhcp correctly again.
 

gozfly

Dabbler
Joined
Dec 24, 2023
Messages
10
I have new findings to temporarily solve the bonding problem.
Go to the file: /usr/lib/python3/dist-packages/middlewared/plugins/interface/link_address.py [237]
and only comment the line [237], save the changes.

Then try restarting the appliance several times, and you will see that the problem is temporarily resolved.
link_address.png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
imagine running this in a datacenter on a multihomed setup and suddenly completely losing the storage because you run LACP.

You have misused the word "multihomed", which refers to a machine that is present on multiple layer 3 networks. LACP is a layer 2 thing. LACP is also not advisable for datacenter storage for the common uses; you are much better off with discrete networks attached to separate switches, so that a switch firmware crash or update doesn't take out your storage. This is one of the reasons that redundant switching needs separate layer 3 networks.
 

Brad T

Dabbler
Joined
Nov 23, 2016
Messages
15
Does anyone have a solution for this issue?
(I tried the bond1 name trick and changing the xmit_hash_policy from LAYER2+3 to LAYER3+4, but I still get the sqlite3 UNIQUE constraint error.)

I see multiple jira tickets closed without any hint to fix the problem. Here's my ticket: NAS-125968

Is the workaround for now to abandon LACP / aggregation and downgrade to a single interface?
 

33Fraise33

Dabbler
Joined
Dec 24, 2023
Messages
10
Does anyone have a solution for this issue?
(I tried the bond1 name trick and changing the xmit_hash_policy from LAYER2+3 to LAYER3+4, but I still get the sqlite3 UNIQUE constraint error.)

I see multiple jira tickets closed without any hint to fix the problem. Here's my ticket: NAS-125968

Is the workaround for now to abandon LACP / aggregation and downgrade to a single interface?
They changed the priority for the bug to highest and are working on a hotfix which hopefully should be released soon: https://ixsystems.atlassian.net/browse/NAS-125932?focusedCommentId=233358

23.10.1.1
 
Last edited:

stillka

Explorer
Joined
Nov 15, 2014
Messages
55
Hello,

is there any way how to debug boot issue?

In my case, x-netif.service fail to start when I update from 23.10.0.1 to 23.10.1 (or latest 23.10.1.1).
I have no special configuration, one adapter, static IP:
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 04)

Because x-netif.service fail to start create infinite loop, I cant access console to check or collect log files...

Tnx.
 
Top