bad cksum packets

Status
Not open for further replies.
Joined
Jun 20, 2016
Messages
22
Code:
17:41:46.043811 IP (tos 0x0, ttl 64, id 22891, offset 0, flags [DF], proto TCP (6), length 40, bad cksum 0 (->cae7)!)
	10.10.0.5.445 > 10.10.2.101.49925: Flags [.], cksum 0x1698 (incorrect -> 0xd8a4), seq 1435, ack 1882, win 2052, length 0
17:41:46.044102 IP (tos 0x2,ECT(0), ttl 64, id 22895, offset 0, flags [DF], proto TCP (6), length 168, bad cksum 0 (->ca61)!)
	10.10.0.5.445 > 10.10.2.101.49925: Flags [P.], cksum 0x1718 (incorrect -> 0x1fbc), seq 1435:1563, ack 1882, win 2053, length 128SMB-over-TCP packet:(raw data or continuation?)

17:41:46.044152 IP (tos 0x0, ttl 64, id 22897, offset 0, flags [DF], proto TCP (6), length 88, bad cksum 0 (->cab1)!)
	10.10.0.5.3260 > 10.10.2.101.49924: Flags [P.], cksum 0x16c8 (incorrect -> 0x3472), seq 3312:3360, ack 294129, win 65535, length 48
17:41:46.044575 IP (tos 0x0, ttl 128, id 8990, offset 0, flags [DF], proto TCP (6), length 1500)
	10.10.2.101.49924 > 10.10.0.5.3260: Flags [.], cksum 0xe49f (correct), seq 294129:295589, ack 3360, win 63664, length 1460


So i started having some strange issues with bad performance of the pool. Doing tests looks fine, but i noticed these bad checksum packets while checking traffic with tcpdump. Looking around I found that i should disable TCP checksum offloading. And since i have a 2x10G trunk, i did on lagg0 like this

ifconfig lagg0 -rxcsum -txcsum

aaand server crashed instantly. So my question is, should i do this on ix0 and ix1 or how can i check if's actually on. No ethtool on bsd.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Can you explain your train of thought a bit better? It seems like a weird conclusion that TCP checksum offloading is causing the problem.
 
Joined
Jun 20, 2016
Messages
22
Well it was an assumption :)
Wrong one indeed.
Disabling TCP checksum offloading removed the errors in tcpdump, but did not help with the issue.
 
Joined
Jun 20, 2016
Messages
22
zpool status is ok,
Code:
  pool: ssdstorage
state: ONLINE
  scan: scrub repaired 0 in 0h14m with 0 errors on Sun Apr 30 00:14:17 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		ssdstorage									  ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/4072ba54-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/40a8e379-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/40e21ef1-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/411a79ae-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/4155c3c3-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/4190b674-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/41ca40c3-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0
			gptid/4203d003-fb57-11e6-9ee9-001b213f34ff  ONLINE	   0	 0	 0



By bad i mean low ops/s and high busy time

Code:
L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
   1	355	  0	  0	0.0	120	586	0.1   32.9| da0
	1	373	  0	  0	0.0	120	567	0.1   33.2| da1
	0	355	  0	  0	0.0	121	590	0.1   32.5| da2
	0	380	  0	  0	0.0	121	567	0.1   33.2| da3
	0	342	  0	  0	0.0	115	567	0.1   31.0| da4
	0	354	  0	  0	0.0	115	551	0.1   32.9| da5
	1	349	  0	  0	0.0	114	567	0.1   32.6| da6
	1	384	  0	  0	0.0	114	539	0.1   31.0| da7


Can't get you data from when they are 95% busy, but queue is 5-10 and ops/s are around 1500.

These disks can handle a lot more. Tests done

Code:
dd if=/dev/zero of=/mnt/ssdstorage/test/temp.dat bs=2048k count=100k
214748364800 bytes transferred in 116.567978 secs (1842258643 bytes/sec)

dd if=/mnt/ssdstorage/test/temp.dat of=/dev/null bs=2048k count=100k
214748364800 bytes transferred in 93.723714 secs (2291291669 bytes/sec)


Something started plaguing the pool, and they get very easy to max busy %. We are looking into it and almost identified it, but some things must be done before we are sure. It is probably unrelated to freenas.
 
Status
Not open for further replies.
Top