Port Flapping under load from NFS Mount (TrueNAS-12.0-U2.1)

halopla1

Cadet
Joined
Mar 15, 2021
Messages
4
Hi,

I recently built a TrueNAS system purely for NFS Mounting to Proxmox Cluster.
During my testing, I noticed when performing storage migrations from CEPH inside Proxmox over to the NFS mount (Write intensive operation) TrueNAS starts to report port flapping. The interesting aspect is I have separate physical uplinks to different storage networks that are not connecting to each other which is why I suspect the issue is System specific. The issue is occurring on both Intel 1GbE uplinks and Mellanox 40GbE uplinks.


Code:
Mar 16 08:10:53 srvstor1 igb0: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:10:53 srvstor1 mlxen1: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:10:53 srvstor1 mlxen3: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:10:54 srvstor1 mlxen2: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:10:54 srvstor1 kernel: lagg1: link state changed to DOWN
Mar 16 08:10:54 srvstor1 kernel: vlan202: link state changed to DOWN
Mar 16 08:10:55 srvstor1 mlxen0: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:10:55 srvstor1 kernel: lagg2: link state changed to DOWN
Mar 16 08:10:55 srvstor1 kernel: vlan207: link state changed to DOWN
Mar 16 08:11:09 srvstor1 kernel: lagg2: link state changed to UP
Mar 16 08:11:09 srvstor1 kernel: lagg1: link state changed to UP
Mar 16 08:11:09 srvstor1 kernel: vlan207: link state changed to UP
Mar 16 08:11:09 srvstor1 kernel: vlan202: link state changed to UP
Mar 16 08:12:41 srvstor1 1 2021-03-16T08:12:41.884359+09:00 srvstor1.mujin.co.jp collectd 4837 - - Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
    with Client() as c:
  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 281, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 124, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.8/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out
Mar 16 08:15:46 srvstor1 igb0: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:15:46 srvstor1 mlxen1: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:15:46 srvstor1 mlxen0: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:15:46 srvstor1 mlxen3: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:15:46 srvstor1 kernel: lagg2: link state changed to DOWN
Mar 16 08:15:46 srvstor1 mlxen2: Interface stopped DISTRIBUTING, possible flapping
Mar 16 08:15:46 srvstor1 kernel: vlan207: link state changed to DOWN
Mar 16 08:15:46 srvstor1 kernel: lagg1: link state changed to DOWN
Mar 16 08:15:46 srvstor1 kernel: vlan202: link state changed to DOWN
Mar 16 08:15:47 srvstor1 kernel: lagg1: link state changed to UP
Mar 16 08:15:47 srvstor1 kernel: vlan202: link state changed to UP
Mar 16 08:15:47 srvstor1 kernel: lagg2: link state changed to UP
Mar 16 08:15:47 srvstor1 kernel: vlan207: link state changed to UP


I see the Python core dump issues are been tracked via:

The disktemp.py python logs occur at the same time as the flapping under heavy writes.


System Specs:
Version: TrueNAS-12.0-U2.1
Chassis: Dell PowerEdge R730xd + Dell MD1400 JBOD
CPU: E5-2660 v3
Memory: 256gb ECC DDR4
Controller:
- PERC H800 for JBOD
- PERC H710 for OS
- Part No. P31H2 for NVMe Read/Write Cache

Network:
- 2x Mellanox Connect-X Pro-3 Dual 40GbE (Trunking with Jumbo Frames)
- 1x Intel LOM i350 + x510

Physical Topology:



NAS-topology.png


I saw some previous threads with similar cases running the older version of FreeNAS but there not quite the same issues I have observed. Has anyone else come across this or can give some advice?

If there is any information I can collect to help narrow down the issue I would be happy to provide it.

Thank you very much.
 

halopla1

Cadet
Joined
Mar 15, 2021
Messages
4
Thanks for the heads up. I will take this part into consideration:
The H800 is a direct pass-through card and I can see all the serials numbers of the devices with smartctl. Are there any concerns with the H800 SAS HBA?
The H710 is only used for Raid 1 HW mirror of the OS disk drives.
 
Top