Massive CPU usage on concurrent access (iSCSI)

Jason Keller · May 6, 2017

Hi folks! Been a while since I was last on the forums, and I have another stumper. Hopefully @jgreco and @mav (and all of you!) have some ideas, as it's quite strange.

So I've got a pool on a storage head (HP DL380 G6, 2x E5649, 96GB memory, 16x Intel 320 series 120GB, 2x LSI 9207-8i, Chelsio T520-CR). Basic settings on the pool (LZ4 is on, no dedup, no sync, no atime). Built up a slightly overprovisioned ZVOL (1TB) and exported via iSCSI IPMP on the Chelsio card to the two 10Gbit BNT switches in our Bladecenter H chassis. I've got the blades (2x E5-2670, 96GB memory, Broadcom 57711) using Software iSCSI in vSphere 6.5 across both fabrics using IOPS=1 RR multipathing. The LUN is consumed as a VMFS6 datastore.

Some interesting bits...jumbos and toe have been tried on/off with minimal impact (toe seems to make it steadier).

So here we get to the meat of the issue...when I'm running IOmeter in one VM on one blade, I'm hitting 980MB/s at 120,000 IOPS with an average 260us latency for a 100% random 8k read run. CPU sits around 17%, and I'm only hitting about 4.5Gbit/s per 10Gbit lane.

So lets add another VM/blade into the equation! I fire up IOmeter in another VM (sitting in a different blade) and start that one up as well. My latencies double, my IOPS and bandwidth get cut in half between VMs. Ready for the worst part? CPU spikes up to 75% and stays there for the duration.

And this is where I'm puzzled - did something change in FreeNAS? I hadn't run my IOmeter instances in quite a while, and now suddenly this happened. I've read the posts about only being able to get to ~4 Gigabit in the 10Gbit thread, but that was VM to VM, this is a bare metal FreeNAS to ESXi. My previous setup had two ESXi heads direct attached to a similar storage head configuration with one lane per piece, and I was able to get 120k IOPS per host, around 970MB/s throughput, and 350us latency per 8k read using Chelsio's hardware offload iSCSI in ESXi with both hosts hitting the storage simultaneously (yes, I was scraping 1/4 million IOPS in 8k reads!). CPU never lifted above 40% or so (granted this was late 2015 or so).

I've tried 9.10.2 U3 as well as 11-RC, with no change in behavior.

If anyone has any ideas I'm all ears.

mav@ · May 6, 2017

IIRC on my tests 120K IOPS was a peak for single target LUN, since it is mostly handled by one CPU core. That may explain why IOPS split in half. High CPU usage growth I guess may be result of very deep requests queue, that has to be sorted at the high rate, and high lock congestion. You may try to create separate LUNs/datastores for different VMs and retry the test.

Jason Keller · May 6, 2017

mav@ said:
IIRC on my tests 120K IOPS was a peak for single target LUN, since it is mostly handled by one CPU core. That may explain why IOPS split in half. High CPU usage growth I guess may be result of very deep requests queue, that has to be sorted at the high rate, and high lock congestion. You may try to create separate LUNs/datastores for different VMs and retry the test.

Thanks @mav@; unfortunately creating an additional ZVOL/extent/datastore made zero difference with the VMs split between them. Do I need to define a whole second iSCSI target as well to place the additional LUN? Or can they be on the same target?

If I go to 64k reads, CPU hovers at about 19% and almost maxes out the wire (strangely, one lane seems lagged behind). Latency per 64k is about 2ms.

mav@ · May 6, 2017

Second target is not related to LUN bottleneck. But having it would make VMware to open second iSCSI/TCP connection, which is another possible congestion point to try to avoid.

Jason Keller · May 6, 2017

Would any other offload help alleviate this? I remember you mentioned in another thread forthcoming iSCSI offload? Also I've added 'toe' to the options (toe mtu 9000) for additional testing; do I need a reboot to make this take effect?

mav@ · May 6, 2017

TOE offload indeed may help with TCP congestion with little potential blood, so you may try it. Rebooting is probably the easiest way, but it should be enough to enable toe flag on the interface with `ifconfig cxgbe0 toe` command, and then check its presence with `ifconfig cxgbe0`, after which you need restart ctld with `service ctld restart` or iscsi service in UI, and check that it uses toe sockets instead of usual tcp with `netstat -an |grep 3260`.

Full iSCSI offload is a special case of above TOE may indeed give some more performance, but it has more potential side effects, and possibly not polished enough now in 9.10.2. It got some more polishing for 11.0, but we still don't enable it by default yet.

Jason Keller · May 6, 2017

mav@ said:
TOE offload indeed may help with TCP congestion with little potential blood, so you may try it. Rebooting is probably the easiest way, but it should be enough to enable toe flag on the interface with `ifconfig cxgbe0 toe` command, and then check its presence with `ifconfig cxgbe0`, after which you need restart ctld with `service ctld restart` or iscsi service in UI, and check that it uses toe sockets instead of usual tcp with `netstat -an |grep 3260`.

Full iSCSI offload is a special case of above TOE may indeed give some more performance, but it has more potential side effects, and possibly not polished enough now in 9.10.2. It got some more polishing for 11.0, but we still don't enable it by default yet.

Spoiler for the rubber neck crowd - just reboot after safely detaching like @mav@ says above. Otherwise you will have the fastest PDL of a datastore you've ever seen in ESXi (oh, and in 9.10.2 U3, it crashed the OS too).

@mav@ this isn't normal is it?

mav@ · May 6, 2017

It is hardly normal. I don't remember this specific panic, but update_clip_table() is a function in the Chelsio's TOE offload code. I'd retry it again on 11.0.

Jason Keller · May 6, 2017

mav@ said:
It is hardly normal. I don't remember this specific panic, but update_clip_table() is a function in the Chelsio's TOE offload code. I'd retry it again on 11.0.

Any quick way to revert that change during boot so I can get it started again? Would make it easier for me to roll up to 11 to test again.

mav@ · May 6, 2017

The easiest way would be to boot previous boot environment, if it was created not too long ago. Removing the NIC or disabling it with some loader tunable can also possibly do the trick.

Jason Keller · May 7, 2017

I got it back up by disabling cxl0 and cxl1 via kernel parameters on boot to undo the TOE change. Upgraded to 11-RC and some interesting results abound - CPU reduction of about 6% (mainly in interrupts) with 8K writes increasing a bit, however larger read and write strides became very unpredictable even with just a single host hitting the LUN (sometimes 6000/sec, sometimes tens of thousands per second) in IOmeter, with lower CPU usage. You can see the effect here...

Otherwise, it didn't really impact performance hardly at all (certainly not enough to justify the side effects), and it doesn't alleviate the main issue.

Any other ideas?

Jason Keller · May 7, 2017

Just tried NFS, and while it seems less impacted on CPU, wow is the performance terrible compared to iSCSI unless you're at 64k reads. With 12 NFS servers at 8k not only is it bad (about 40k IOPS read, about 2.3Gbit/s or so), it too gets the same 1/2 performance cut when the second host hits it with IOmeter. 64k reads can flood the wire, and at that point I don't see any performance falloff from the second host hitting it (but CPU is about 50%).

@mav@ is there any way forward for increasing performance with two or more nodes hitting pool storage via iSCSI that I haven't yet tried yet?

mav@ · May 8, 2017

What block/record size are you using for your ZVOLs/datasets? Too small ones increase CPU overhead on large reads and especially writes, since more allocation work is required.

What is your pool topology? I hope it is mirror, not some RAIDZ?

Also "using IOPS=1 RR multipathing" may be confusing for ZFS prefetching, that may also cause performance regressions. If you have many initiator hosts, let them distribute over the links more naturally and statically.

Jason Keller · May 8, 2017

mav@ said:
What block/record size are you using for your ZVOLs/datasets? Too small ones increase CPU overhead on large reads and especially writes, since more allocation work is required.

What is your pool topology? I hope it is mirror, not some RAIDZ?

Also "using IOPS=1 RR multipathing" may be confusing for ZFS prefetching, that may also cause performance regressions. If you have many initiator hosts, let them distribute over the links more naturally and statically.

Using 16k blocks on the ZVOL (default), definitely mirrors ;)

I locked each test host on one path (one lane each) and it made no difference in performance or CPU usage on the head. I also tried my direct-wire environment with all Chelsio cards and now I see it exhibits the same behavior as I see above, when earlier on 9.3 that setup had no performance issues. Something drastically different between 9.3 and 9.10 in ctld?

mav@ · May 8, 2017

There are changes all the time, and when it goes about 10G+ at small blocks, there is no minor changes. Between 9.3 and 9.10 there was a FreeBSD version change from 9.3-STABLE to 10.3-STABLE. Do you still see the issue with 11.0-RC, which is FreeBSD 11.0-STABLE?

Jason Keller · May 8, 2017

Yes I've confirmed this is still happening in 11.0-RC

mav@ · May 8, 2017

It may need deeper investigation, which we are not really providing on forums and for free, but to try quickly diagnose CPU load in case it is something obvious you could provide outputs of `top -SHIz` and `pmcstat -n 10000000 -TS unhalted-cycles`.

Jason Keller · May 8, 2017

From our very newest storage head (christened today), with dual deca-core Intel(R) Xeon(R) E5-2648L v2 CPUs and 96GB of memory...(verified exhibiting the exact same behavior, running 9.10.2 U3 same as the others...)

Code:

top -SHIz

last pid: 16051;  load averages:  9.28,  2.29,  0.84																									 up 0+06:35:21  20:12:49
1597 processes:63 running, 1437 sleeping, 97 waiting
CPU:  0.0% user,  0.0% nice, 45.7% system,  7.1% interrupt, 47.2% idle
Mem: 97M Active, 595M Inact, 12G Wired, 4796K Cache, 81G Free
ARC: 9389M Total, 1927M MFU, 7314M MRU, 100K Anon, 139M Header, 9593K Other
Swap:

  PID USERNAME   PRI NICE   SIZE	RES STATE   C   TIME	WCPU COMMAND
   12 root	   -92	-	 0K  1616K CPU2	2  11:46  91.80% intr{irq267: bxe0:fp0}
	0 root		 8	-	 0K 20656K CPU36  36  11:34  90.19% kernel{cblk0 taskq_7}
	0 root		 8	-	 0K 20656K -	  37  11:34  90.19% kernel{cblk0 taskq_10}
	0 root		 8	-	 0K 20656K CPU10  10  11:35  90.09% kernel{cblk0 taskq_11}
	0 root		 8	-	 0K 20656K CPU25  25  11:34  90.09% kernel{cblk0 taskq_2}
	0 root		 8	-	 0K 20656K CPU18  18  11:34  89.89% kernel{cblk0 taskq_5}
	0 root		 8	-	 0K 20656K CPU34  34  11:35  89.79% kernel{cblk0 taskq_3}
	0 root		 8	-	 0K 20656K CPU32  32  11:34  89.70% kernel{cblk0 taskq_1}
	0 root		 8	-	 0K 20656K -	  16  11:34  89.45% kernel{cblk0 taskq_4}
   12 root	   -92	-	 0K  1616K CPU3	3  11:49  88.67% intr{irq273: bxe1:fp0}
	0 root		 8	-	 0K 20656K CPU4	4  11:34  87.70% kernel{cblk0 taskq_13}
	0 root		 8	-	 0K 20656K CPU39  39  11:34  87.06% kernel{cblk0 taskq_6}
	0 root		 8	-	 0K 20656K -	  17  11:34  84.08% kernel{cblk0 taskq_0}
	0 root		 8	-	 0K 20656K CPU30  30  11:34  83.98% kernel{cblk0 taskq_8}
	0 root		 8	-	 0K 20656K CPU28  28  11:35  83.15% kernel{cblk0 taskq_12}
	0 root		 8	-	 0K 20656K CPU9	9  11:34  82.67% kernel{cblk0 taskq_9}
	6 root		-8	-	 0K   176K CPU4	4   9:45  71.39% zfskern{dbuf_evict_threa}
	5 root	   -16	-	 0K   192K CPU26  26   9:59  71.29% ctl{work0}
	0 root	   -16	-	 0K 20656K icl_rx 17   7:06  53.17% kernel{cfiscsirx}
	0 root	   -16	-	 0K 20656K icl_tx 12   6:56  52.78% kernel{cfiscsitx}
   12 root	   -92	-	 0K  1616K RUN	 1   7:05  51.86% intr{irq266: bxe0:fp0}
   12 root	   -92	-	 0K  1616K CPU1	1   6:49  51.07% intr{irq271: bxe1:fp0}
	0 root	   -92	-	 0K 20656K CPU20  20   6:50  50.68% kernel{cfiscsitx}
	0 root	   -16	-	 0K 20656K icl_rx 20   7:01  50.20% kernel{cfiscsirx}
	5 root	   -16	-	 0K   192K CPU25  25   4:17  32.67% ctl{work5}
	0 root	   -16	-	 0K 20656K CPU20  20   4:26  32.18% kernel{cfiscsirx}
	5 root	   -16	-	 0K   192K -	  32   4:20  32.18% ctl{work8}
	0 root	   -16	-	 0K 20656K CPU33  33   4:29  31.69% kernel{cfiscsitx}
	0 root	   -16	-	 0K 20656K CPU31  31   4:25  30.27% kernel{cfiscsitx}
	0 root	   -16	-	 0K 20656K icl_rx  6   4:15  27.98% kernel{cfiscsirx}
	5 root	   -16	-	 0K   192K -	  15   3:10  24.27% ctl{work6}
	5 root	   -16	-	 0K   192K -	  14   3:12  23.68% ctl{work7}
	0 root	   -92	-	 0K 20656K -	  10   0:16   1.27% kernel{bxe0_fp2_tq}
12354 root		20	0 30164K  8412K CPU23  23   1:09   1.17% top
	0 root	   -92	-	 0K 20656K -	   8   0:14   0.98% kernel{bxe1_fp3_tq}
	0 root	   -92	-	 0K 20656K -	   7   0:05   0.20% kernel{bxe1_fp1_tq}
	6 root		-8	-	 0K   176K tx->tx 21   0:15   0.10% zfskern{txg_thread_enter}
	0 root	   -92	-	 0K 20656K -	  29   0:06   0.10% kernel{bxe0_fp1_tq}


pmcstat -n 10000000 -TS unhalted-cycles

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 23680 (100.0%) , 0 unresolved

%SAMP IMAGE	  FUNCTION			 CALLERS
28.9 kernel	 lz4_decompress	   arc_decompress
  6.0 kernel	 __rw_wlock_hard	  tcp_usr_rcvd:2.8 in_pcblookup_hash:1.7 tcp_usr_send:1.4
  5.4 kernel	 __mtx_lock_sleep	 ctl_work_thread:2.5 ctl_serseq_done:2.4
  3.9 kernel	 cpu_search_highest   cpu_search_highest
  3.1 kernel	 memcpy			   icl_pdu_append_data
  1.8 kernel	 _sx_xlock_hard	   buf_hash_find
  1.5 kernel	 ctl_work_thread	  fork_exit
  1.5 kernel	 uma_zalloc_arg
  1.4 kernel	 cpu_idle			 sched_idletd
  1.3 kernel	 sched_switch		 mi_switch
  1.2 kernel	 ctl_check_ooa		ctl_work_thread
  1.2 kernel	 bzero
  1.1 kernel	 cpu_search_lowest	cpu_search_lowest
  1.1 kernel	 tcp_output		   tcp_usr_send
  1.0 kernel	 bcopy
  0.9 kernel	 __rw_rlock
  0.8 kernel	 bxe_tx_encap		 bxe_tx_mq_start_locked
  0.8 kernel	 thread_lock_flags_
  0.8 kernel	 _mtx_lock_spin_cooki
  0.8 kernel	 uma_zfree_arg
  0.8 kernel	 cpu_switch		   mi_switch
  0.7 kernel	 dbuf_create		  dbuf_hold_impl
  0.7 kernel	 sleepq_lock
  0.7 kernel	 mb_free_ext
  0.6 kernel	 tdq_notify		   sched_add
  0.6 kernel	 sched_idletd		 fork_exit
  0.6 kernel	 uma_find_refcnt	  mb_ctor_clust
  0.6 kernel	 buf_hash_find		arc_read
  0.6 kernel	 tcp_do_segment	   tcp_input
  0.5 kernel	 ctl_serseq_done	  ctl_be_block_dispatch_zvol

Jason Keller · May 8, 2017

@mav@, is there any chance that compressed ARC has something to do with this? I've been trying to disable it for testing via loader tunable...

vfs.zfs.compressed_arc_enabled=0

However on boot, this is always a 1. I've checked to ensure that this is added to /boot/loader.conf, but to no avail.

mav@ · May 9, 2017

I'd say it is possible, if your test assumes reading from ARC, since compressed ARC holds only small part of data uncompressed. Are you trying this tunable on FreeNAS 9.10 or 11? I haven't tested, but I'm afraid it indeed may not work on 9.10. It seems to be a bug caused by difference in sysctl code in FreeBSD 10.x and 11.x.

For the ARC read test you may try to either read/write zeroes, so that they would compress into nothing and so could "uncompress" much faster, or in opposite disable compression for the dataset, since compressed ARC does not compress the data by itself, just keeping the data the same way they are stored on disks.

Important Announcement for the TrueNAS Community.

Massive CPU usage on concurrent access (iSCSI)

Explorer

iXsystems

Explorer

iXsystems

Explorer

iXsystems

Explorer

iXsystems

Explorer

iXsystems

Explorer

Explorer

iXsystems

Explorer

iXsystems

Explorer

iXsystems

Explorer

Explorer

iXsystems

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Massive CPU usage on concurrent access (iSCSI)"

Similar threads