iSCSI tests suggest a problem with read performance

Status
Not open for further replies.

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
I have been doing some tests for using iSCSI without a bare metal hypervisor.

What I'm trying to do:
I'm trying to use my FreeNAS as a small "home SAN". My goal is to convert my desktops into "thin clients" which have a small and fast SSD for the OS and apps, but no more HDDs for data. Instead of HDDs the desktops rely entirely on iSCSI block storage as secondary drives.

Acceptance criteria:
Using 2 NICs I want to archive about 180 MB/s throughput for sequential reads and writes to match a fast desktop HDD.
Compared to physical HDDs I also expect fast random writes (curtsey of the asynchronous behavior of iSCSI and ZFS) and more or less slow random reads (curtsey of sluggish vdev seek times plus network latency).

The problem:
Well, my problem is entirely with the sequential read performance. While sequential write performance is very good and as expected, it seems that I'm stuck hitting a performance barrier for reads. I can hardly exceed 60 MB/s even with multiple concurrent connections.

What I tested:
I tested the new CTL iSCSI target in FreeNAS 9.3 as well as 9.2.1.8 against 3 different iSCSI initiators.

The test setup:
The server:
  • Hardware:
    • Intel Xeon 1230 v3,
    • Supermicro X10-something-something with 2 onboard Intel 1GbE NICs
    • 16 GB RAM
    • IBM M1015 HBA
    • 3 mirror vdevs with WD REDs (writes: ~320 MB/s, reads: ~460 MB/s)
  • Software:
    • FreeNAS 9.2.1.8 and 9.3
    • CTL iSCSI target in both cases
    • ZVOL extents with 4kB block size
    • LZ4 compression enabled
The client:
  • Hardware:
    • Intel Xeon 1231 v3
    • 8 GB RAM
    • 256 GB Samsung SSD 850 Pro
    • 2 Intel 1GbE NICs (1 onboard, 1 PCIe)
  • Software:
    • Windows Server 2008 R2 SP1, Windows iSCSI Initiator w/ MPIO
    • Gnomebuntu Linux 14.10, open-iscsi w/ multipath-tools for MPIO
    • Windows 7 SP1, Windows iSCSI Initiator
Test 1: Windows Server 2008 R2 SP1 with FreeNAS 9.2.1.8 and FreeNAS 9.3
First I wanted to test MPIO performance under Windows. Since Windows 7 does not support MPIO, I used Windows Server 2008 R2 for this test. I set up MPIO with 2 connections to a single target with round robin load balancing.

The resulting random read and random write performance was good.
The sequential write performance was excellent with about 200 MB/s and traffic spread evenly over both NICs.
But unfortunately the sequential read performance was just bad:

Traffic was load balanced over both NICs so neither of them had much to do. Throughput did never exceed 60 MB/s no matter if I used a synthetic benchmark or just copied a big video file off the iSCSI drive.

When I first encountered this phenomenon with 9.2.1.8 I assumed something was wrong with the still experimental CTL. So I decided to give the brand new 9.3 a spin, because I read that lots of improvements went into CTL with 9.3. Setting up a new server with 9.3 was a piece of cake, but unfortunately I could not overcome the 60 MB/s read barrier with it. While IOPS, random reads and random writes improved, sequential reads remained equally poor.

At that point I assumed that there must be something wrong with the Microsoft iSCSI Initiator. So I decided to give a recent Linux a try and kicked the Windows Server in the bin.

Test 2: Gnomebuntu 14.10 with FreeNAS 9.3
I setup open-scsi with MPIO and round robin load balancing according to this excellent tutorial.
Code:
root@Mirakulu:~# multipath -ll
   cygnus-san-test (36589cfc000000fafb2155bc92e60c776) dm-0 FreeBSD,iSCSI Disk
   size=60G features='0' hwhandler='0' wp=rw
   `-+- policy='round-robin 0' prio=1 status=active
    |- 6:0:0:0 sdb 8:16 active ready running
    `- 7:0:0:0 sdc 8:32 active ready running


The results were not much different to the Windows Server 2008 results, tho:
Code:
root@Mirakulu:/mnt# dd if=/dev/urandom of=testfile.dat bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes (21 GB) copied, 1541,49 s, 13,6 MB/s

root@Mirakulu:/mnt# cd cygnus-san-test/

root@Mirakulu:/mnt/cygnus-san-test# dd if=../testfile.dat of=testfile.dat bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes (21 GB) copied, 109,907 s, 191 MB/s

root@Mirakulu:/mnt/cygnus-san-test# dd if=testfile.dat of=/dev/null bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes (21 GB) copied, 237,857 s, 88,2 MB/s


Sequential writes from FreeNAS 9.3 were fine as ever, but - even though slightly improved - sequential reads remained bad and far from my acceptance criteria threshold.

After the test on Linux it was rather clear that the problem was not the iSCSI initiator. The problem seems to be with FreeNAS and/or CTL.
myself said:
What's the point of using MPIO when I can only read 60 MB/s?! I don't need fancy MPIO for that. I can have 60 MB/s slowness with a single NIC, too :/
Disappointed I put Windows 7 back on my desktop and the USB stick with FreeNAS 9.2.1.8 back into the server.

Test 3: Windows 7 with FreeNAS 9.2.1.8
So I decided to conduct one last test. In order to clear the multi path handling in CTL I connected each of the 2 NICs to a single, independent target. I then created a dynamic stripe in Windows (software raid0) spanning over both targets.
I was a little bit surprised to see that this "multi path raid" did work ok, but nevertheless the read results were the same. Using this pseudo MPIO I hit the 60 MB/s read barrier just the same as with real MPIO or even a single connection target.

Questions:
  • What is determining the sequential read performance of CTL in theory? Thread contention, kernel memory, IO subsystem latencies...?
  • Why is the CTL read throughput so low in comparison to other "file shares" accessing the pool? I don't believe it's the ZFS pool as CIFS and FTP work very well and seem only restricted by the 1GbE connection.
  • Is there anything that can be done from the user side in terms of configuration? I'd gladly trade some IOPS for throughput.
  • What is your experience with CTL? Can you get adequate sequential reads?
Any comments, suggestions, remarks and rants are welcome.

/bb
 

zambanini

Patron
Joined
Sep 11, 2013
Messages
479
nice test setup and documentation.

i get over 220mbyte/s sequential r/w with mpio and two nics but my setup is a little bit bigger setup (24 sas disks as stripped mirrors, 96gb ram)

did you test your network speed with iperf? (both directions)

test with a bigger recordsize, also take a look at the cli output of zpool iostat -v1 while you test

did you also test another iscsi target like your windows server or nexenta, nappit ?
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
I can second above comment about 4K ZVOL block size. While 4K is perfect for avoiding read-modify-write patterns, it can increase CPU overhead on sequential operations, and data fragmentation over the time. What is more important in your case, depends on your specific workload.

But I think that your MPIO problem may be result of ZFS prefetch being confused by request reorder over the different paths. Have you measured single connection performance before setting up MPIO? What was the linear read speed over the single connection?

Can you setup experiment where all ZVOL content could fit in ARC cache after several test runs? That would move ZFS prefetcher and all the disk subsystem out of equation. Will it saturate both links of MPIO in that case?
 

sfcredfox

Patron
Joined
Aug 26, 2014
Messages
340
OP,

Unless I missed it, were all your tests using iscsi over the network? Did you actually test your disk subsystem first?

Something like:
dd if=/dev/zero of=/mnt/poolname/test.file bs=1048576
dd if=/mnt/poolname/test.file of=/dev/null bs=1048576

dd if=/dev/zero of=/mnt/poolname/test.file bs=2M count=50000
dd if=/mnt/poolname/test.file of=/dev/null bs=2M count=50000

(with compression disabled)

Maybe you already ruled out the disks, and you know it's capable of performing as needed?
It's the first test I think of with complex scenarios, then the iperf as Zam mentioned, then involving protocols and initiators.
 
Last edited:

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
did you test your network speed with iperf? (both directions)
I did this some time ago and network throughput was ok. I can also get line speed with FTP and even CIFS :)
Still I made a new measurement just to be sure:
Code:
[root@Cygnus] ~# iperf -c 172.22.22.11 -t 60
------------------------------------------------------------
Client connecting to 172.22.22.11, TCP port 5001
TCP window size: 65.0 KByte (default)
------------------------------------------------------------
[  3] local 172.22.22.22 port 31485 connected with 172.22.22.11 port 5001
[ ID] Interval  Transfer  Bandwidth
[  3]  0.0-60.0 sec  6.57 GBytes  941 Mbits/sec

[root@Cygnus] ~# iperf -c 172.23.23.11 -t 60
------------------------------------------------------------
Client connecting to 172.23.23.11, TCP port 5001
TCP window size: 65.0 KByte (default)
------------------------------------------------------------
[  3] local 172.23.23.22 port 28661 connected with 172.23.23.11 port 5001
[ ID] Interval  Transfer  Bandwidth
[  3]  0.0-60.0 sec  6.56 GBytes  939 Mbits/sec

C:\Program Files (x86)\iperf-2.0.5-3-win32>iperf -c 172.22.22.22 -w 65k -t 60
------------------------------------------------------------
Client connecting to 172.22.22.22, TCP port 5001
TCP window size: 65.0 KByte
------------------------------------------------------------
[  3] local 172.22.22.11 port 1065 connected with 172.22.22.22 port 5001
[ ID] Interval  Transfer  Bandwidth
[  3]  0.0-60.0 sec  6.52 GBytes  934 Mbits/sec

C:\Program Files (x86)\iperf-2.0.5-3-win32>iperf -c 172.23.23.22 -w 65k -t 60
------------------------------------------------------------
Client connecting to 172.23.23.22, TCP port 5001
TCP window size: 65.0 KByte
------------------------------------------------------------
[  3] local 172.23.23.11 port 1092 connected with 172.23.23.22 port 5001
[ ID] Interval  Transfer  Bandwidth
[  3]  0.0-60.0 sec  6.57 GBytes  941 Mbits/sec


test with a bigger recordsize
bigger record sizes as in the 8KB default or do you mean really big record sizes a la >64KB?
also take a look at the cli output of zpool iostat -v1 while you test
Ok, will do.
Did you also test another iscsi target like your windows server or nexenta, nappit ?
nope. I have not tested any other iSCSI target. I'm also a bit reluctant to give my precious FreeNAS pool to nexenta or nappit. Who knows what they are going to do to it :p
 
Last edited:

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
I can second above comment about 4K ZVOL block size. While 4K is perfect for avoiding read-modify-write patterns, it can increase CPU overhead on sequential operations, and data fragmentation over the time. What is more important in your case, depends on your specific workload.
I chose the block size more or less based on the data that should be put on the disk. The iSCSI disks in my scenario could either hold media content like videos and music, or applications and games. I planned on using 8KB block size for media disks and 4KB block size for application disks. I chose 4KB for application disks, because I read that 4KB is the most common IO size for a system, so I tried to match that.
But I think that your MPIO problem may be result of ZFS prefetch being confused by request reorder over the different paths. Have you measured single connection performance before setting up MPIO? What was the linear read speed over the single connection?
No, I did not really pay much attention to the single connection target. I was very focused on getting both NICs to work, but I'll do such a test and report back.
Can you setup experiment where all ZVOL content could fit in ARC cache after several test runs? That would move ZFS prefetcher and all the disk subsystem out of equation. Will it saturate both links of MPIO in that case?
No most of the content was not cached. I'll try to setup another test in which I make sure that the data is coming from the ARC.
 
Last edited:

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
I'm back with a new set of tests.

Test 4: Block size comparison on Windows 7 with FreeNAS 9.2.1.8
With this test I wanted to check the influence of different block sizes on the read performance.
Therefore I created 6 ZVOLs with different block sizes: 4 KB, 8 KB, 16 KB, 32 KB, 64 KB and 128 KB.
I then run sequential read tests via Iometer against each one of them. In Iometer I created 6 "runs" a 30 seconds with different IO sizes (4 KB, 16 KB, 32 KB, 64 KB, 128 KB and 256 KB) at a queue depth of 1. The iobw.tst test file was about 5 GB in order to fit into the ARC. The resulting "seq.read.icf" file is attached.
All read tests were executed 2 times in a row. The first time to prime the ARC and then a second time for measuring the ARC reads.

Here are the results:
doIIzDy.png

AG1r2Sb.png

y7WVDKr.png

33A4HqH.png

T4PHCA3.png

67JzF6L.png

Conclusion
:
  • Average response time increases with IO size.
  • IOPS decrease with IO size.
  • Throughput is directly related to the number of IOPS (IOPS peak == throughput peak).
  • Average response time is not depending on the block size.
  • Throughput seems to be independent from the block size.
With this small excursion I decided to use a block size of 16 KB and go one with another test.

Test 5: ARC read comparison on Windows 7 with FreeNAS 9.2.1.8
With this test I wanted to check if there is a difference in throughput between reading from the pool and reading from the ARC.
In order to check that I merely copy 2 files from the iSCSI drive to my SSD.
The first file is 14 GB, does not fit into the ARC and is read entirely from storage.
The second file is 7 GB, fits into the ARC and is read entirely from cache.




Conclusion:
  • There is a 20 MB/s difference in throughput.
  • I can get good throughput when reading from ARC.
  • The system seems not to be stressed out in either scenario.
  • When reading from the pool, load is merely moderate and distributed evenly among the 3 vdevs.
 

Attachments

  • seq.read.icf.zip
    1.4 KB · Views: 288
Last edited:

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
Did you run the Test 5 with single link or MPIO? Because 108MB/s is perfect for single link, but not for MPIO.
 

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
Yes, test 4 and test 5 are using single link connections. Windows 7 does not support MPIO (only MC/S which seems not to be supported by CTL).
 

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
Here is how it looks like, when I'm reading the file of test 5 via Samba:


difference:
  • more bandwidth is used from the pool (80 MB/s vs. 120 MB/s)
  • CPU is more stressed with Samba
  • Pool is more stressed with iSCSI
    • queue length: 3 vs. 0
    • disk busy time: 38% vs. 22%
    • disk reads: 380 vs. 160 per second
    • arc reads: 16000 vs. 3600 per 5 seconds
If I had to interpret the difference, then I would guess that Samba is using variable block sizes, focusing on big 128 KB blocks, while iSCSI is constantly using small blocks of size 16 KB. (*)
I should probably repeat test 4 and check for the influence of block sizes when actually reading from the pool... (**)

(*) Is there a way to create a histogram for the block sizes ZFS actually uses?
(**) Is there a command to manually clear the ARC like say "zfs flush arc"? I would consider it very useful for testing.
 
Status
Not open for further replies.
Top