Issues with Mellanox Connect-X 3 cards

Knogle

Dabbler
Joined
Jan 25, 2014
Messages
28
Hey friends.
Recently i have upgraded my home lab and installed Mellanox Connect-X 3 Dual 40Gbps QSFP cards in all of my systems.
My TrueNAS system is running on a dedicated machine, and is connected to my virtualization server through 2x 40Gbps links with LACP enabled.
All my virtual machines on the virtualization server are running on iSCSI shares on top of the TrueNAS device through the network connection.
Some of my machines are mission critical.
Unfortunately on the TrueNAS side i am experiencing some issues. After 14 days uptime during night, the NIC has failed.
My issue seems to be similar to this one https://www.truenas.com/community/threads/melanox-connectx-3.73634/ but there is no solution yet.

Is there any suggestion in order to mitigate these issues?

Thanks in advance!

Jan 11 00:13:07 truenas kernel: pid 13537 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 00:37:45 truenas kernel: pid 13827 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 01:39:13 truenas MCA: Bank 15, Status 0x9c2030000000011b Jan 11 01:39:13 truenas MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000 Jan 11 01:39:13 truenas MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 0 Jan 11 01:39:13 truenas MCA: CPU 0 COR GCACHE LG RD error Jan 11 01:39:13 truenas MCA: Address 0x40000000a8d0a00 Jan 11 01:39:13 truenas MCA: Misc 0xd01b0fff01000000 Jan 11 01:39:19 truenas mlx4_core0: Internal error detected: Jan 11 01:39:19 truenas mlx4_core0: buf[00]: 00180c40 Jan 11 01:39:19 truenas mlx4_core0: buf[01]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[02]: 202a1388 Jan 11 01:39:19 truenas mlx4_core0: buf[03]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[04]: 00180c40 Jan 11 01:39:19 truenas mlx4_core0: buf[05]: 0021c500 Jan 11 01:39:19 truenas mlx4_core0: buf[06]: 00000001 Jan 11 01:39:19 truenas mlx4_core0: buf[07]: 00200630 Jan 11 01:39:19 truenas mlx4_core0: buf[08]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[09]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[0a]: 000101f5 Jan 11 01:39:19 truenas mlx4_core0: buf[0b]: 00000043 Jan 11 01:39:19 truenas mlx4_core0: buf[0c]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[0d]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[0e]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: buf[0f]: 00000000 Jan 11 01:39:19 truenas mlx4_core0: device is going to be reset Jan 11 01:39:20 truenas mlx4_core0: device was reset successfully Jan 11 01:39:20 truenas kernel: mlx4_en mlx4_core0: Internal error detected, restarting device Jan 11 01:39:20 truenas kernel[1896]: Last message 'mlx4_en mlx4_core0: ' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:20 truenas mlx4_core0: command 0x49 failed: fw status = 0x1 Jan 11 01:39:21 truenas kernel: lagg0: link state changed to DOWN Jan 11 01:39:21 truenas kernel: mlx4_en: mlxen1: Failed activating Rx CQ Jan 11 01:39:21 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Fai' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:21 truenas kernel: mlxen1: link state changed to DOWN Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.2005-10.org.freenas.ctl): connection error; dropping connection Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.1993-08.org.debian:01:17166b15916c): connection error; dropping connection Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.2005-10.org.freenas.ctl): connection error; dropping connection Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.1993-08.org.debian:01:17166b15916c): connection error; dropping connection Jan 11 01:39:28 truenas mlx4_core0: Unable to determine PCI device chain minimum BW Jan 11 01:39:28 truenas kernel: mlx4_en mlx4_core0: Activating port:1 Jan 11 01:39:28 truenas kernel: mlxen0: Ethernet address: 24:be:05:c4:4a:21 Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 1: Using 16 TX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlxen0: link state changed to DOWN Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 1: Using 16 RX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen0: Using 16 TX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen0: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen0: Using 16 RX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen0: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen0: Initializing port Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen0: Ini' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en mlx4_core0: Activating port:2 Jan 11 01:39:28 truenas kernel: mlxen1: Ethernet address: 24:be:05:c4:4a:22 Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 2: Using 16 TX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlxen1: link state changed to DOWN Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 2: Using 16 RX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen1: Using 16 TX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen1: Using 16 RX rings Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen1: Initializing port Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Ini' repeated 1 times, suppressed by syslog-ng on truenas.lan Jan 11 01:39:28 truenas mlx4_core0: mlx4_restart_one was ended, ret=0 Jan 11 01:39:31 truenas kernel: mlx4_en: mlxen0: Link Up Jan 11 01:39:31 truenas kernel: mlxen0: link state changed to UP Jan 11 01:39:31 truenas kernel: mlx4_en: mlxen1: Link Up Jan 11 01:39:31 truenas kernel: mlxen1: link state changed to UP Jan 11 03:45:42 truenas kernel: pid 14383 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 04:32:00 truenas kernel: pid 17312 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 04:32:09 truenas kernel: pid 17977 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 04:32:13 truenas kernel: pid 17979 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 07:18:41 truenas kernel: pid 17981 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 08:14:53 truenas kernel: pid 17982 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 08:20:42 truenas kernel: pid 21198 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 08:23:05 truenas kernel: pid 21326 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 08:23:14 truenas kernel: pid 21290 (httpd), jid 0, uid 0: exited on signal 11 Jan 11 10:44:36 truenas kernel: mlx4_en: mlxen0: Link Down Jan 11 10:44:36 truenas kernel: mlx4_en: mlxen1: Link Down Jan 11 10:44:36 truenas kernel: mlxen0: link state changed to DOWN Jan 11 10:44:36 truenas kernel: mlxen1: link state changed to DOWN
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Mellanox Connect-X 3 Dual 40Gbps QSFP cards

The results with these cards are somewhat varied. There may be driver issues.

It's suggested to use the Intel or Chelsio cards. It's basically been that way for many years. Some discussion of the topic is in the 10 Gig Networking Primer; this is still relevant to you with 40G, and you should probably pay careful attention to the recommendations there. Both Intel and Chelsio have authored their own drivers, and iXsystems sells or has sold the Chelsio cards as its preferred card in the TrueNAS hardware product.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
Did you update NIC firmware?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I picked up a Connect-X3 EN thinking it would simply be an updated version of the -X2 and the drivers would be well tested, etc... Couldn't get it to play in ESXi at all, needs a firmware update. Mellanox web site likes to select the firmware bundle based the OS you're browsing with, not the one you intend to use. At which point I contemplated the amount of coffee required vs my sanity early on a Saturday, it got moved and the other system restored to service. It drops into Windows 10 just fine. One of these days when I feel like tearing apart half my home lab it might get moved to the intended machine. YMMV.
 

Knogle

Dabbler
Joined
Jan 25, 2014
Messages
28
Hey thanks a lot to all of you guys. I've already tried the option, upgrading the firmware, but the issue persists.
So i have moved back to 10Gbps Intel gear with X520 SFP+ cards and now it's running fine for almost 3 weeks without any outage. Also i am now able to achieve constant and continous throughput through the entire network.
Thanks a lot to all you.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, it's an unhappy resolution, but not unexpected. Sorry the Mellanox didn't work out.
 
Top