SOLVED iSCSI drops connection

massmux · Mar 11, 2018

Hello
consider two FreeNAS sharing iSCSI zvols target to some xenservers. In this environment all seems smooth. But apparently when I add another FreeNAS and start using new iSCSI shares to other xenserver machines, i get the following errors in one of still existing FreeNAS servers, causing the connection to drop randomly and making the shares unreachable for minutes:

Code:

WARNING: 1.2.3.4 (iqn.2005-11.ch.mydom.ctl:97665f7s): no ping reply (NOP-Out) after 5 seconds; dropping connection
WARNING: 1.2.3.4 (iqn.2005-11.ch.mydom.ctl:97665f7s): no ping reply (NOP-Out) after 5 seconds; dropping connection		
WARNING: 1.2.3.4 (iqn.2005-11.ch.mydom.ctl:97665f7s): no ping reply (NOP-Out) after 5 seconds; dropping connection

Someone has suggestions?

kdragon75 · Mar 11, 2018

Tripel check your iqn's to make sure there's no duplicates, MACs, IPs, if you have multiple interfaces connected to the same L2 network, make sure IP forwarding is disabled...

Does it come back up if you disable iSCSI on the third SAN? What errors are you getting on the xen hosts?

Thinking a bit more, if it was a STP, MAC, IP issue, you would see issues with other services on those ports/address too..

massmux · Mar 11, 2018

hello Dragon thank you for reply. please see below:

Tripel check your iqn's to make sure there's no duplicates, MACs, IPs, if you have multiple interfaces connected to the same L2 network, make sure IP forwarding is disabled...
>ipforward is enabled on xenserver because it's needed to run the VM. Is this a problem?

Does it come back up if you disable iSCSI on the third SAN? What errors are you getting on the xen hosts?
>on xenserver host i have:
Kernel reported iSCSI connection 1:0 error
>yes when i disable the third san, the connection resumes

Thinking a bit more, if it was a STP, MAC, IP issue, you would see issues with other services on those ports/address too..
>yes but infact all the network goes down on the xenserver involved

kdragon75 · Mar 11, 2018

I would do some digging with packet captures on your xen server. It almost sounds like a routing/switching loop or a broadcast storm.

massmux · Mar 11, 2018

following the idea of iqn names, is it possible that i have to check
"Xen initiator compat mode" into exent configuration on freenas to be sure names are ok for xenserver?

kdragon75 · Mar 12, 2018

massmux said:
following the idea of iqn names, is it possible that i have to check
"Xen initiator compat mode" into exent configuration on freenas to be sure names are ok for xenserver?

I don't have any experience with that setting and the FreeNAS documentation offers no hints on what this setting does.

massmux · Mar 13, 2018

i was adding more infos, for this reason i changed the position. In any case, ok.
The first two servers are 64G ram while the third was 32G and i was asking if this could be a issue with iscsi making some problem on the network

massmux · Mar 21, 2018

i found a suggestion to set the following parameter

kern.cam.ctl.iscsi.ping_timeout=0

may this resolve this issue in your opinion?

massmux · Mar 29, 2018

Now the error is evolved with this:

Code:

Mar 29 13:02:16 stor ctld[70127]: 1.2.3.4 (iqn.2015-11.ch.xxx.host101:sdf349d85): error returned from CTL iSCSI handoff request: cfiscsi_ioctl_handoff: icl_conn_handoff failed with error 54; dropping connection
Mar 29 13:02:16 stor ctld[1869]: child process 70127 terminated with exit status 1
Mar 29 13:02:16 stor WARNING: icl_conn_start: disabling TCP_NODELAY failed with error 54

Anyone with suggestion?

massmux · May 14, 2018

For all the guys who may have the same issue, i just tell that after many tests i have found that the problem arise when the IO request is too much higher than network bandwidth possibility. Infact if you host, on initiator side, something that requests too much data in time units, and you go over the network possibilities, then the freenas side will get a disconnection because not receiving in time the PING packet back. This will cause the connection to go down intermittently
Check carefully your network in this case because it's also possible that there is somewhere an anomalous traffic that could be also malicious. For example memcached server may lead to such a traffic generation. Anyway this is the point to check carefully

James S · Oct 24, 2018

massmux said:
For all the guys who may have the same issue, i just tell that after many tests i have found that the problem arise when the IO request is too much higher than network bandwidth possibility. Infact if you host, on initiator side, something that requests too much data in time units, and you go over the network possibilities, then the freenas side will get a disconnection because not receiving in time the PING packet back. This will cause the connection to go down intermittently
Check carefully your network in this case because it's also possible that there is somewhere an anomalous traffic that could be also malicious. For example memcached server may lead to such a traffic generation. Anyway this is the point to check carefully

Thanks for the suggestion. I have the same issue i.e., a ping timeout on my NICs. I'm running Freenas 11.1 U5 with 2 direct connections to my Vmware (6.0). The setup follow John Keen's post: https://johnkeen.tech/freenas-11-iscsi-esxi-6-5-lab-setup/#comment-41
My timeout (on both NICs) occured in the middle of the night when the system was virtually sleeping. That puts me at variance with "IO request is too much higher than bandwith possibility"

jgreco · Oct 24, 2018

While not the only possible answer, this is a classic symptom of a transaction group blocking on write (see bug #1531). The typical remediation would be to reduce the size of a transaction group to something that the pool can actually write within the required timeframe. It is very important not to use RAIDZ.

It's also a good idea to make sure you're not doing something dumb like jumbo frames, which, if you have the wrong hardware, can cause TCP session hangs.

James S · Oct 28, 2018

Thanks for suggestion and link to the bug (it seems those issues your report in depth are not resolved in later versions of FN?)
Yes - I'm doing "something dumb like jumbo frames"... but am confused (I'm not a dedicated IT specialist) why I should get timeouts on a system that is pretty much at idle? Reports of timeouts are coming in at 3 or 4 am and a review of reporting (in the GUI) shows the system practically idle. Under load, where surely I'd expect problems with bigger frame sizes, I'm not getting these timeouts.

HoneyBadger · Oct 28, 2018

jgreco said:
While not the only possible answer, this is a classic symptom of a transaction group blocking on write (see bug #1531). The typical remediation would be to reduce the size of a transaction group to something that the pool can actually write within the required timeframe. It is very important not to use RAIDZ.

The new OpenZFS write throttle makes a lot of the old tunables (the write_limit family, mostly) useless; they've been replaced by the dirty_data family. Defaults will close a txg at 64MB of size, and allow a max of the smaller of "10% of RAM or 4GB" to be outstanding at a time, with the throttle starting to kick in at 60%. Since the cutover to the "new write throttle" implementation you've really got to work hard to make it straight-up block writes, but a combination of fast network + slow vdevs can still do it.

And yes, RAIDZ is very important to avoid in virtualization, and doubly so when using block storage (iSCSI) so if the OP can deliver a spec dump, that would help us sort it.

jgreco said:
It's also a good idea to make sure you're not doing something dumb like jumbo frames, which, if you have the wrong hardware, can cause TCP session hangs.

James S said:
Thanks for suggestion and link to the bug (it seems those issues your report in depth are not resolved in later versions of FN?)
Yes - I'm doing "something dumb like jumbo frames"... but am confused (I'm not a dedicated IT specialist) why I should get timeouts on a system that is pretty much at idle? Reports of timeouts are coming in at 3 or 4 am and a review of reporting (in the GUI) shows the system practically idle. Under load, where surely I'd expect problems with bigger frame sizes, I'm not getting these timeouts.

Step one in troubleshooting would be to disable those jumbo frames. If you have a firmware bug, mismatched max MTU (9000 vs 9216, maybe?) or one piece of hardware in the mix doesn't support them (or it isn't enabled) then you'll be dropping packets all over the place.

Can you post a full hardware spec: (CPU, RAM, HBA, drive type/count, network config) and a zpool status in CODE tags?[/QUOTE]

James S · Oct 29, 2018

HoneyBadger said:
Can you post a full hardware spec

Code:

root@freenas:~ # zpool status
  pool: NAS2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 02:37:43 with 0 errors on Sun Sep 23 02:37:44								2018
config:

		NAME											STATE	 READ WRITE CKS							   UM
		NAS2											ONLINE	   0	 0									0
		  raidz2-0									  ONLINE	   0	 0									0
			gptid/01dd6f7b-ccf5-11e3-8e0e-60a44ce69c93  ONLINE	   0	 0									0
			gptid/024dced5-ccf5-11e3-8e0e-60a44ce69c93  ONLINE	   0	 0									0
			gptid/02bcb705-ccf5-11e3-8e0e-60a44ce69c93  ONLINE	   0	 0									0
			gptid/0329aca0-ccf5-11e3-8e0e-60a44ce69c93  ONLINE	   0	 0									0

errors: No known data errors

  pool: NAS2_data
 state: ONLINE
  scan: none requested
config:

		NAME											STATE	 READ WRITE CKS							   UM
		NAS2_data									   ONLINE	   0	 0									0
		  raidz1-0									  ONLINE	   0	 0									0
			gptid/025b0bdd-c146-11e8-9d80-ac1f6b2542fe  ONLINE	   0	 0									0
			gptid/03993d48-c146-11e8-9d80-ac1f6b2542fe  ONLINE	   0	 0									0
			gptid/04da7480-c146-11e8-9d80-ac1f6b2542fe  ONLINE	   0	 0									0
			gptid/063a9ab5-c146-11e8-9d80-ac1f6b2542fe  ONLINE	   0	 0									0

errors: No known data errors

  pool: VM-Datastore
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:14:39 with 0 errors on Sun Oct 14 00:14:40								2018
config:

		NAME											STATE	 READ WRITE CKS							   UM
		VM-Datastore									ONLINE	   0	 0									0
		  mirror-0									  ONLINE	   0	 0									0
			gptid/f8b9d75d-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE	   0	 0									0
			gptid/f9b84327-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE	   0	 0									0
		  mirror-1									  ONLINE	   0	 0									0
			gptid/fbc1273f-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE	   0	 0									0
			gptid/fcc066d2-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE	   0	 0									0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:12 with 0 errors on Wed Oct 24 03:45:12								2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  ada5p2	ONLINE	   0	 0	 0

errors: No known data errors

CPU - AMD Opteron 4122 x 2
RAM - 32729MB
HBA

Code:

root@freenas:~ # ifconfig -a
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
		options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
		ether ac:1f:6b:25:42:fe
		hwaddr ac:1f:6b:25:42:fe
		inet 10.0.0.1 netmask 0xffffff00 broadcast 10.0.0.255
		nd6 options=9<PERFORMNUD,IFDISABLED>
		media: Ethernet autoselect (1000baseT <full-duplex>)
		status: active
igb1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
		options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
		ether ac:1f:6b:25:42:ff
		hwaddr ac:1f:6b:25:42:ff
		inet 10.0.1.1 netmask 0xffffff00 broadcast 10.0.1.255
		nd6 options=9<PERFORMNUD,IFDISABLED>
		media: Ethernet autoselect (1000baseT <full-duplex>)
		status: active
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
		options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
		ether 10:bf:48:4c:99:25
		hwaddr 10:bf:48:4c:99:25
		nd6 options=9<PERFORMNUD,IFDISABLED>
		media: Ethernet autoselect (1000baseT <full-duplex>)
		status: active
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
		options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
		ether 10:bf:48:4c:99:25
		hwaddr 10:bf:48:4c:93:9b
		nd6 options=9<PERFORMNUD,IFDISABLED>
		media: Ethernet autoselect (1000baseT <full-duplex>)
		status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
		options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
		inet6 ::1 prefixlen 128
		inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5
		inet 127.0.0.1 netmask 0xff000000
		nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
		groups: lo
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
		options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
		ether 10:bf:48:4c:99:25
		inet 192.168.2.80 netmask 0xffffff00 broadcast 192.168.2.255
		nd6 options=9<PERFORMNUD,IFDISABLED>
		media: Ethernet autoselect
		status: active
		groups: lagg
		laggproto lacp lagghash l2,l3,l4
		laggport: em0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
		laggport: em1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

HBA controller - Asus Pike 2108 (LSI chipset?) - configured as JBOD

jgreco · Oct 29, 2018

HoneyBadger said:
The new OpenZFS write throttle makes a lot of the old tunables (the write_limit family, mostly) useless; they've been replaced by the dirty_data family. Defaults will close a txg at 64MB of size, and allow a max of the smaller of "10% of RAM or 4GB" to be outstanding at a time, with the throttle starting to kick in at 60%. Since the cutover to the "new write throttle" implementation you've really got to work hard to make it straight-up block writes, but a combination of fast network + slow vdevs can still do it.

While the old tunables aren't useful/don't exist today, the problem definitely still exists. If you "sneak up" on the current system, you can still cause it to choke if it hasn't adjusted to the general IOPS capacity of the pool. Your "fast network + slow vdevs" is one specific example of this, but the problem is more generalized. If you have a slow-ish pool and the system hasn't yet "noticed" that because you've never plateaued at max write, you still get that massive first freeze if you stun it with a flood of writes. I used to talk about 1531 and write_shift value of 3 where you could have 4GB write buffer sizing on a 32GB box but this would be crazy-big for a host with only four disks (as an example) -- assuming mirrors and 200MB/s write speed, that's 10seconds best case to flush 4GB. Experimentally, some variation on this is still an issue with the new write throttle. You can pretty reliably cause it to happen on a newly restarted system. I leave that as an exercise for the reader.

However, there's a more insidious issue, one that's more difficult to fix: when you reach a state where the pool has a lot of fragmentation, ZFS has a hard time learning the pool characteristics based on the write stream because it really comes down to how many seeks are incurred during write operations. In such a case, the system "learns" that the pool is slow-ish, but this is a classic case of "past performance is no guarantee of future results." If you free a large amount of free space on that highly fragmented pool, the system once again learns that "it's fast" and you can cause it to be stunned again when it eventually has to relearn that hard lesson.

Overall, it's significantly better, yes, but to truly fix it, you'd actually need to understand when seeks were going to be experienced in the write stream, and then throttle based on that. That would mean that the only unexpected write delays would be due to disk errors (etc).

kdragon75 · Oct 29, 2018

jgreco said:
While the old tunables aren't useful/don't exist today, the problem definitely still exists. If you "sneak up" on the current system, you can still cause it to choke if it hasn't adjusted to the general IOPS capacity of the pool. Your "fast network + slow vdevs" is one specific example of this, but the problem is more generalized. If you have a slow-ish pool and the system hasn't yet "noticed" that because you've never plateaued at max write, you still get that massive first freeze if you stun it with a flood of writes. I used to talk about 1531 and write_shift value of 3 where you could have 4GB write buffer sizing on a 32GB box but this would be crazy-big for a host with only four disks (as an example) -- assuming mirrors and 200MB/s write speed, that's 10seconds best case to flush 4GB. Experimentally, some variation on this is still an issue with the new write throttle. You can pretty reliably cause it to happen on a newly restarted system. I leave that as an exercise for the reader.

However, there's a more insidious issue, one that's more difficult to fix: when you reach a state where the pool has a lot of fragmentation, ZFS has a hard time learning the pool characteristics based on the write stream because it really comes down to how many seeks are incurred during write operations. In such a case, the system "learns" that the pool is slow-ish, but this is a classic case of "past performance is no guarantee of future results." If you free a large amount of free space on that highly fragmented pool, the system once again learns that "it's fast" and you can cause it to be stunned again when it eventually has to relearn that hard lesson.

Overall, it's significantly better, yes, but to truly fix it, you'd actually need to understand when seeks were going to be experienced in the write stream, and then throttle based on that. That would mean that the only unexpected write delays would be due to disk errors (etc).

Thank you for the technical response! i love reading this kind of thing. Also can we all just have 8TB SSDs to just skip the whole head seek thing? :D

HoneyBadger · Oct 29, 2018

James S said:
CPU - AMD Opteron 4122 x 2
RAM - 32729MB
HBA controller - Asus Pike 2108 (LSI chipset?) - configured as JBOD

Data pool = 4 drives in single RAIDZ2
Data pool 2 = 4 drives in single RAIDZ2
VM pool = 4 drives in 2-way mirrors

Trimmed the specs and zpool status so I don't post a wall of text here. If all of your interfaces are going to the same switch, did you make sure that mtu 9000 is set on both of the 10.x.x.x subnets?

Not related to the ping dropouts since they weren't under load, but you're using sync=standard so I'm betting the 2x1Gbps link could overwhelm those 4 drives in mirrors if there's any kind of fragmentation in play.

HoneyBadger · Oct 29, 2018

Warning, here be sidebars and technical OpenZFS dragons.

jgreco said:
While the old tunables aren't useful/don't exist today, the problem definitely still exists. If you "sneak up" on the current system, you can still cause it to choke if it hasn't adjusted to the general IOPS capacity of the pool. Your "fast network + slow vdevs" is one specific example of this, but the problem is more generalized. If you have a slow-ish pool and the system hasn't yet "noticed" that because you've never plateaued at max write, you still get that massive first freeze if you stun it with a flood of writes. I used to talk about 1531 and write_shift value of 3 where you could have 4GB write buffer sizing on a 32GB box but this would be crazy-big for a host with only four disks (as an example) -- assuming mirrors and 200MB/s write speed, that's 10seconds best case to flush 4GB. Experimentally, some variation on this is still an issue with the new write throttle. You can pretty reliably cause it to happen on a newly restarted system. I leave that as an exercise for the reader.

The problem of "network can write faster than vdevs" is ultimately one that's only truly solvable with hardware. I heard the price of NAND flash is supposed to crash in 2019 ... ;)

The new OpenZFS throttle doesn't try to "benchmark your vdevs" to figure out how fast your pool is - you get a set of knobs with the dirty_data tunables, and some relatively sane defaults. With low write pressure, the amount of async I/Os (bulk dirty data flushes) queued to vdevs themselves is low; as you increase the amount of outstanding dirty data in the system, more I/O is queued up until you hit the point at which the throttle starts to apply; then it scales up rapidly towards a maximum delay value (currently 100ms) but this behavior is consistent regardless of pool, boot time, data ingested - there's no "learning" or "benchmarking" aspect, if you want it to behave differently (throttle sooner or later, more aggressively or gradually, or allow for slower or faster vdevs) you need to fiddle with those knobs.

jgreco said:
However, there's a more insidious issue, one that's more difficult to fix: when you reach a state where the pool has a lot of fragmentation, ZFS has a hard time learning the pool characteristics based on the write stream because it really comes down to how many seeks are incurred during write operations. In such a case, the system "learns" that the pool is slow-ish, but this is a classic case of "past performance is no guarantee of future results." If you free a large amount of free space on that highly fragmented pool, the system once again learns that "it's fast" and you can cause it to be stunned again when it eventually has to relearn that hard lesson.

Overall, it's significantly better, yes, but to truly fix it, you'd actually need to understand when seeks were going to be experienced in the write stream, and then throttle based on that. That would mean that the only unexpected write delays would be due to disk errors (etc).

Absolutely yes, free space fragmentation is a factor in vdev performance - but again, ZFS won't "know" that the vdevs are slow due to fragmentation, it will just throttle more quickly, and you'll need to tune the knobs to smooth out the throttle curve; or again, "throw hardware at it." Until block pointer rewrite exists - and judging by the difficulty, this will require divine intervention - the only cure for lousy performance from fragmentation is prevention. Don't take that "only use 50% of your pool" lightly if you want to retain speed.

SSDs introduce their own set of gremlins of course with the necessary garbage collection and spotty TRIM support, but that's a diatribe for later.

jgreco · Oct 29, 2018

kdragon75 said:
Thank you for the technical response! i love reading this kind of thing. Also can we all just have 8TB SSDs to just skip the whole head seek thing? :D

Well, not that technical and I took some liberties with facts/reality/details to try to give the big picture.

It is correct to say that the situation is much improved, but further improvement is possible. That would, however, require significant effort which may not be worth the investment of precious developer time. For people who would be impacted by this issue, the problem can be remediated mostly-acceptably by tweaking the existing tunables to reduce the transaction group size, using mirrors, and keeping large amounts of free space available on the pool, which helps the system avoid needing to do large numbers of seeks during writes.

SSD's have their own performance boggles, by the way. In the context of this discussion, where we're discussing an iSCSI timeout of five seconds, you might be able to cause pain with large transaction groups, by running the SSD's out of free pages, which causes write performance to tank.

https://www.seagate.com/tech-insights/lies-damn-lies-and-ssd-benchmark-master-ti/

While Seagate might have a bit of an axe to grind with SSD's, this is nevertheless fairly decent reporting. It shows how that SSD that is spec'd to write 275MBytes/sec actually hits a steady state more like 25MBytes/sec under constant pressure. This is the more interesting number, and ideally it's what you should design for if you've got a heavy write environment.

Important Announcement for the TrueNAS Community.

SOLVED iSCSI drops connection

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Dabbler

Dabbler

Dabbler

Explorer

Resident Grinch

Explorer

actually does care

Explorer

Resident Grinch

Wizard

actually does care

actually does care

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "iSCSI drops connection"

Similar threads