Dropped connections with Chelsio T580-SO-CR

Status
Not open for further replies.

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
We've got a 40GbE Chelsio T580-SO-CR card installed in our FreeNAS box, and have just begun serious testing with it today. It's able to run for less than a minute, before the connection is dropped. Here's what the 40Gb switch's log says:

Sep 22 14:49:16 192.168.1.209 INFO web server: System log cleared by user admin.
Sep 22 14:49:37 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 14:49:47 192.168.1.209 NOTICE link: link down on port 61
Sep 22 14:50:03 192.168.1.209 NOTICE link: link up on port 61
Sep 22 14:50:33 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 14:51:33 192.168.1.209 NOTICE link: link down on port 61
Sep 22 14:51:49 192.168.1.209 NOTICE link: link up on port 61
Sep 22 14:52:19 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 14:59:28 192.168.1.209 NOTICE link: link down on port 61
Sep 22 14:59:29 192.168.1.209 WARNING system: 1m QDAC removed at port 61

[At this point, I moved the cable to port 49 to see if the problem was with the port itself]

Sep 22 14:59:37 192.168.1.209 NOTICE system: 1m QDAC inserted at port 49 is Accepted
Sep 22 14:59:52 192.168.1.209 NOTICE link: link up on port 49
Sep 22 15:00:22 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 15:16:04 192.168.1.209 NOTICE link: link down on port 49
Sep 22 15:16:21 192.168.1.209 NOTICE link: link up on port 49
Sep 22 15:49:30 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 16:01:08 192.168.1.209 NOTICE link: link down on port 49

[Here I decided to test both another port *and* another cable, just to rule that out.]

Sep 22 16:01:20 192.168.1.209 WARNING system: 1m QDAC removed at port 49
Sep 22 16:01:31 192.168.1.209 NOTICE system: 2m QDAC inserted at port 9 is Accepted
Sep 22 16:01:46 192.168.1.209 NOTICE link: link up on port 9
Sep 22 16:02:16 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 16:02:39 192.168.1.209 NOTICE link: link down on port 9
Sep 22 16:02:55 192.168.1.209 NOTICE link: link up on port 9
Sep 22 16:03:24 192.168.1.209 ALERT stg: STG 1, topology change detected

The dropped connections are only happening when actual data is flowing through the switch. We're not seeing this on the Windows workstation (Mellanox ConnectX3-VPI card) that we're doing the tests with. Our tests involve reading/writing test files on a shared volume on the FreeNAS box, so both the Windows and FreeNAS boxes are handling the same amount of data on the network.

If the switch is left alone for a while, you don't get any warnings. But every time you try to move data, you only get about 1 minute (or usually less) before the connection craps out.

Any suggestions on where we should start with this?
 
Last edited:

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
We've got a 40GbE Chelsio T580-SO-CR card installed in our FreeNAS box, and have just begun serious testing with it today. It's able to run for less than a minute, before the connection is dropped. Here's what the 40Gb switch's log says:

Sep 22 14:49:16 192.168.1.209 INFO web server: System log cleared by user admin.
Sep 22 14:49:37 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 14:49:47 192.168.1.209 NOTICE link: link down on port 61
Sep 22 14:50:03 192.168.1.209 NOTICE link: link up on port 61
Sep 22 14:50:33 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 14:51:33 192.168.1.209 NOTICE link: link down on port 61
Sep 22 14:51:49 192.168.1.209 NOTICE link: link up on port 61
Sep 22 14:52:19 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 14:59:28 192.168.1.209 NOTICE link: link down on port 61
Sep 22 14:59:29 192.168.1.209 WARNING system: 1m QDAC removed at port 61

[At this point, I moved the cable to port 49 to see if the problem was with the port itself]

Sep 22 14:59:37 192.168.1.209 NOTICE system: 1m QDAC inserted at port 49 is Accepted
Sep 22 14:59:52 192.168.1.209 NOTICE link: link up on port 49
Sep 22 15:00:22 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 15:16:04 192.168.1.209 NOTICE link: link down on port 49
Sep 22 15:16:21 192.168.1.209 NOTICE link: link up on port 49
Sep 22 15:49:30 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 16:01:08 192.168.1.209 NOTICE link: link down on port 49

[Here I decided to test both another port *and* another cable, just to rule that out.]

Sep 22 16:01:20 192.168.1.209 WARNING system: 1m QDAC removed at port 49
Sep 22 16:01:31 192.168.1.209 NOTICE system: 2m QDAC inserted at port 9 is Accepted
Sep 22 16:01:46 192.168.1.209 NOTICE link: link up on port 9
Sep 22 16:02:16 192.168.1.209 ALERT stg: STG 1, topology change detected
Sep 22 16:02:39 192.168.1.209 NOTICE link: link down on port 9
Sep 22 16:02:55 192.168.1.209 NOTICE link: link up on port 9
Sep 22 16:03:24 192.168.1.209 ALERT stg: STG 1, topology change detected

The dropped connections are only happening when actual data is flowing through the switch. We're not seeing this on the Windows workstation (Mellanox ConnectX3-VPI card) that we're doing the tests with. Our tests involve reading/writing test files on a shared volume on the FreeNAS box, so both the Windows and FreeNAS boxes are handling the same amount of data on the network.

If the switch is left alone for a while, you don't get any warnings. But every time you try to move data, you only get about 1 minute (or usualy less) before the connection craps out.

Any suggestions on where we should start with this?
What are you seeing on the freeNAS side of the connection? What you posted looks like the output from the switch.
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
That is the switch log output.

I can't find similar logs on the FreeNAS side. Where should I be looking?
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
I'm seeing a lot of:

Sep 22 08:46:07 freenas kernel: cxl0: link state changed to UP
Sep 22 08:46:07 freenas kernel: cxl0: link state changed to UP
Sep 22 08:53:45 freenas kernel: cxl0: link state changed to DOWN
Sep 22 08:53:45 freenas kernel: cxl0: link state changed to DOWN

In /var/log/messages -- that doesn't really tell me much more than the switch log.

Again, I think I've ruled out the switch, since the other machine on it is fine, and we've tried different ports and different cables.
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
And a quick update. We're doing some performance testing right now to see what our read/write speeds are like. As part of that, i'm trying out different formats for the volumes. After deleting the last test (a 4-drive stripe) and making an 8-drive stripe, we don't seem to be having this problem. I've been running the disk testing tool we use on continuous loop for about 15 minutes now. Nothing I changed was related to the underlying network settings, though I did create a new SMB/CIFS share.

UPDATE to the update: It dropped the connection again. Here's the full log output from the time it dropped the connection. The previous log entry was 30 minutes earlier.

Sep 22 10:59:04 freenas kernel: cxl0: link state changed to DOWN
Sep 22 10:59:04 freenas kernel: cxl0: link state changed to DOWN
Sep 22 10:59:05 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:05 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:05 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:05 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:06 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:08 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:13 freenas mDNSResponder: mDNSPlatformSendUDP got error 50 (Network
is down) sending packet to 224.0.0.251 on interface 10.0.0.2/cxl0/1
Sep 22 10:59:20 freenas devd: Executing '/etc/rc.d/dhclient quietstart cxl0'
Sep 22 10:59:20 freenas kernel: cxl0: link state changed to UP
Sep 22 10:59:20 freenas kernel: cxl0: link state changed to UP
 
Last edited:

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Does the same issue happen when doing a long iperf test? I'm wondering about heat, power and maybe a tunable?
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
Does the same issue happen when doing a long iperf test? I'm wondering about heat, power and maybe a tunable?

Haven't tried a long iperf test yet. I'll set that up now. I'm not running any tunables. It's a pretty vanilla setup.

I'd be really surprised if heat is an issue here. CPU usage hasn't gone above 20% during these tests, with the exception of one very brief spike to 40%. This is rack mounted in an air conditioned server room, and the box it's in isn't blowing out air any hotter than the surrounding machines.
 

JustinClift

Patron
Joined
Apr 24, 2016
Messages
287
Hmmm, if you swap one of the Mellanox cards into the FreeNAS box (instead of the Chelsio card) do connections keep dropping?
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
I could try that, but if it can be avoided, I'd prefer that. Alone in the office this afternoon, and that case is a bit more heavy than I'd like it to be when pulling it out myself.

My current plan is to run a long iperf test, and depending on how that goes, maybe switching to the other interface on the NIC.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
I ran into an issue on a marvel NIC that needed a tunable when running a sustained load in FreeBSD, otherwise it would just stop working.

As for the heat, I was wondering about the NIC itself? Heatsink look ok? Airflow sounds good.
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
Been running iperf for about 8 minutes with no issues. I'm letting it go for another 20 and will report back here.

In terms of the tunables, that's a bit over my head. Also, since those are based on 10GbE but I'm running a 40Gb network, wouldn't the settings be different?
 

JustinClift

Patron
Joined
Apr 24, 2016
Messages
287
Yeah, but they're probably a good place to start. ie. closer to what you want than the default

Note - I'm not super FreeNAS experienced either. Just been mucking around with things, and have managed to figure enough bits out to make things work for me.

Literally anyone else's FreeNAS advice here is probably better than mine. :D
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
Looks like iperf crapped out for a bit then recovered.

[root@freenas] ~# iperf -c 10.0.0.4 -i 10 -t 1200
------------------------------------------------------------
Client connecting to 10.0.0.4, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.2 port 61581 connected with 10.0.0.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 16.2 GBytes 13.9 Gbits/sec
[ 3] 10.0-20.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 20.0-30.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 30.0-40.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 40.0-50.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 50.0-60.0 sec 16.5 GBytes 14.1 Gbits/sec
[ 3] 60.0-70.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 70.0-80.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 80.0-90.0 sec 15.8 GBytes 13.5 Gbits/sec
[ 3] 90.0-100.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 100.0-110.0 sec 16.5 GBytes 14.1 Gbits/sec
[ 3] 110.0-120.0 sec 15.7 GBytes 13.5 Gbits/sec
[ 3] 120.0-130.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 130.0-140.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 140.0-150.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 150.0-160.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 160.0-170.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 170.0-180.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 180.0-190.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 190.0-200.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 200.0-210.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 210.0-220.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 220.0-230.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 230.0-240.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 240.0-250.0 sec 16.2 GBytes 13.9 Gbits/sec
[ 3] 250.0-260.0 sec 16.9 GBytes 14.5 Gbits/sec
[ 3] 260.0-270.0 sec 16.3 GBytes 14.0 Gbits/sec
[ 3] 270.0-280.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 280.0-290.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 290.0-300.0 sec 16.5 GBytes 14.1 Gbits/sec
[ 3] 300.0-310.0 sec 15.8 GBytes 13.6 Gbits/sec
[ 3] 310.0-320.0 sec 15.4 GBytes 13.2 Gbits/sec
[ 3] 320.0-330.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 330.0-340.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 340.0-350.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 350.0-360.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 360.0-370.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 370.0-380.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 380.0-390.0 sec 15.7 GBytes 13.5 Gbits/sec
[ 3] 390.0-400.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 400.0-410.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 410.0-420.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 420.0-430.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 430.0-440.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 440.0-450.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 450.0-460.0 sec 15.7 GBytes 13.5 Gbits/sec
[ 3] 460.0-470.0 sec 16.5 GBytes 14.1 Gbits/sec
[ 3] 470.0-480.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 480.0-490.0 sec 15.7 GBytes 13.5 Gbits/sec
[ 3] 490.0-500.0 sec 16.5 GBytes 14.1 Gbits/sec
[ 3] 500.0-510.0 sec 16.0 GBytes 13.8 Gbits/sec
[ 3] 510.0-520.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 520.0-530.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 530.0-540.0 sec 16.0 GBytes 13.8 Gbits/sec
[ 3] 540.0-550.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 550.0-560.0 sec 16.1 GBytes 13.8 Gbits/sec
[ 3] 560.0-570.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 570.0-580.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 580.0-590.0 sec 13.3 GBytes 11.4 Gbits/sec
[ 3] 590.0-600.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 600.0-610.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 610.0-620.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 620.0-630.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 630.0-640.0 sec 4.69 GBytes 4.03 Gbits/sec
[ 3] 640.0-650.0 sec 15.7 GBytes 13.5 Gbits/sec
[ 3] 650.0-660.0 sec 15.7 GBytes 13.5 Gbits/sec
[ 3] 660.0-670.0 sec 16.0 GBytes 13.8 Gbits/sec
[ 3] 670.0-680.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 680.0-690.0 sec 16.4 GBytes 14.1 Gbits/sec
 

friolator

Explorer
Joined
Jun 9, 2016
Messages
80
Of course, I probably will have to do some tuning here, because these speeds are only about 1/4 of what I'd expect. Not that I expect the disk performance to keep up with the bandwidth at 40GbE, but it doesn't seem right that it's this low.

Also, I added in all those tunables from the thread linked to above, but there's no improvement in performance. Running another long iperf test to see if they helped with the dropped connections though.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Maybe the card and switch don't like each other over direct-attach? That seems to be a surprisingly frequent thing.
 
Status
Not open for further replies.
Top