SOLVED TrueNAS Scale 22.12.3.1 Dropping Packets

tjarosz

Cadet
Joined
Mar 30, 2023
Messages
7
I just upgraded to TrueNAS Scale 22.12.3.1. After the upgrade completed successfully, I am seeing weird network instability. Keeping a constant ping to the TrueNAS web interface is seeing about 23% packet loss. Keeping a constant ping to servers on the same network/subnet/rack are showing 100% success rate. Even a constant ping to the TrueNAS server Board Management Card (BMC) shows 100% success. Only the TrueNAS interface is showing packet loss. Main interface is running on a 4x25g bond to dual-redundant TOR switches with a MLAG to the bonded ports.

1688051908631.png


I was seeing several other errors/alerts after the upgrade but gave the system time to settle down before I got worried. I was seeing failed jobs as in this screencap:

1688051896218.png


There were several alerts regarding something about an "Assert" "Processor Detected". Which I stupidly dismissed without screen capturing.

I'll also mention the web interface seems very sluggish to load pages. Some pages (such as Virtualization) seem to crash the controller because after I click on the page, I get a message "Waiting for Active TrueNAS controller to come up...". Then it comes back to the login page. After I login, it goes to the page I originally request. The controller also crashes when I'm trying to import a certificate. However, after the crash and re-login, the certificate is not present in the certificates page.

1688052766982.png


This is running on an enterprise-grade Gigabyte server: AMD EPYC 7313P 16-Core, 256GB RAM, 2 SATA SSD (boot/OS), 4 NVME SSD (2x L2 ARC on main datapool and 2x RAID 1 datapool for VMs), 16x 18TB Seagate Exos (15-wide RaidZ3 w/ hot spare), 2x 4-port 25gbps NICs (Broadcom).
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
what did you update from?

Is the problem isolated to 1 port or 1 NIC?
 

tjarosz

Cadet
Joined
Mar 30, 2023
Messages
7
I updated from 22.12.2.

Like I said above, the server has 2x 4-port 25bgps NICs. The main TrueNAS access IP is on the 4-port 25gpbs bond (first 2 ports from each NIC). The other NIC ports are on separate "storage" subnets trying to take advantage of SMB multi-channel. I just update to 22.12.3.1 because of the multi-channel support. I already had these storage subnets provisioned and configured before the update because my Windows cluster could use SMB multi-channel between the nodes. It's great that 22.12.3 can use SMB multi-channel, I don't think I was aware previous versions were without multi-channel support.

I initiated a constant ping from one of my computer cluster nodes on each subnet it shares with the TrueNAS box:
10.10.10.0/24 = Main network access (First 2 ports from each NIC) = 17% loss
10.10.75.0/24 = Storage Network 75 (NIC 1 Port 3) = 0% loss (lost 1 packet over 5 minutes)
10.10.76.0/24 = Storage Network 76 (NIC 2 Port 3) = 0% loss (0 packets lost)
10.10.77.0/24 = Storage Network 77 (NIC 1 Port 4) = 0% loss (0 packets lost)
10.10.78.0/24 = Storage Network 78 (NIC 2 Port 4) = 0% loss (0 packets lost)
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Could you disable one port at a time... to confirm its not a hardware issue?
No signs of packet loss at ethernet level?
Switches are not reporting packet drops?

17% is close to one quarter of the 4 port bond.....
If you can eliminate and hardware/cable issues, then please report a bug.
 

tjarosz

Cadet
Joined
Mar 30, 2023
Messages
7
I had some time to check the server today. I saw in the network reporting that 1 port, of the 4-port LAG, had almost no activity on it. Almost, no activity, mostly TX, minimal RX... weird since the 3 other ports have a higher level of activity on the RX. I went back to the rack and re-seated a few of the SFP28 optics on both the switch and NIC and, behold, all 4 ports are showing activity now. I'm sorry, I only noticed the dropped packets after the update, but it is possible that the issue was present prior. I'll keep and eye on it today if this isn't the fix. Thanks for the push to review the hardware.
 
Top