Low Read Performance On Multipath iSCSI With Ubuntu Initiator

Ariel_ · Dec 9, 2014

I compiled complete information on the attached text file, formatting this post given the limitation of the forum gave me a headache. Alternative links to the text file:

Greetings,

This is my first FreeNAS and iSCSI build, and I have low READ performance
problem using Ubuntu client/initiator over multipathed iSCSI setup.

For starters these are the numbers, from dd on initiator:

Code:

    WRITE

        root@stolab002:~# echo 3 > /proc/sys/vm/drop_caches
        root@stolab002:~# dd if=/dev/zero of=/var/tmp/foo/zero.raw bs=1M count=40K
        40960+0 records in
        40960+0 records out
        42949672960 bytes (43 GB) copied, 77.6016 s, 553 MB/s

    READ

        root@stolab002:~# echo 3 > /proc/sys/vm/drop_caches
        root@stolab002:~# dd if=/var/tmp/foo/zero.raw of=/dev/null bs=1M
        40960+0 records in
        40960+0 records out
        42949672960 bytes (43 GB) copied, 333.679 s, 129 MB/s

The number 553MB/s for write and 129MB/s for read seems counterintuitive, as I
expect the write to be slower. It seems the write operation manage to saturate
the data connection, while read seems only use ~0.25 of expected bandwidth.

My guess is the iSCSI on read operations limited to the bandwidth of one link
and that limit it is distributed to all the interfaces which gave a bit of
extra speed over one link.

I also try running some bonnie++ benchmarks and got rather similar result:

Code:

    Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
    Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    stolab002.lab.jkt 63G  1820  98 473788  28 98977   6 +++++ +++ 161032   6  1050  35
    Latency              5156us     307ms     195ms    1872us    2679us    2079us
    Version  1.97       ------Sequential Create------ --------Random Create--------
    stolab002.lab.jkt.cps -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
    Latency               387us     846us     225us     401us      27us     320us
    1.97,1.97,stolab002.lab.jkt,1,1418175159,63G,,1820,98,473788,28,98977,6,+++++,+++,161032,6,1050,35,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,5156us,307ms,195ms,1872us,2679us,2079us,387us,846us,225us,401us,27us,320us

Still the read performance is far from write's.

Any insight from anyone would be appreciated, thank you.

Due to the length of my post, please read the attached text file or one of these links:

Thank you

mav@ · Dec 10, 2014

This benchmark is full of questions. For the first, you've written that you've used only one disk for test. I would ask how would you expect to NAS speeds above speed of its single disk? :) I would start from configuring your ZFS pool a way you plan for production.

But then I noticed another fact: you are using /dev/zero to generate the data. FreeNAS by default has LZ4 compression enabled. So unless you disabled it, it should probably compress all your data into nothing. That may be how you've got such a write speed. So you should either use some more random test data, or disable compression for the time of the test.

But all above does not explain the read speed. I don't know how implemented round-robin policy in Linux multipath driver. There is a chance, that it chooses link with some additional affinity, alike to what VMWare does. Is its round-robin policy is completely random or it may have some optimizations to send consecutive reads through the same link? If the last is true -- that explains...

... and if not, there is another guess: It may be that your multipath configuration caused reorder for requests coming through different paths, that confused ZFS read prefetcher. We've seen alike 3-4x slowdowns when before 9.3-RELEASE requests coming even through one link could reorder. We've fixed that for case of one connection and got proper read speeds. But I am not sure how can we prevent reorder over multiple links. Can you try to create some file that is small enough to fit FreeNAS cache and read it several times (still flushing initiator cache)? If this guess is right, then on second and following reads you should get full line rate since all data should be in cache and prefetch is not needed.

sfcredfox · Dec 10, 2014

Another observation, I'm using a phone today, but thought I saw your interfaces all on the same subnet. While I think that's fine for some platforms, many experts assert that freebsd does not like this. They suggest that you use separate subnets for each path. There are other posts on here with all the explanations as to why.

my multipathing between freenas and esx servers works flawlessly using the technique described for ip addressing. Maybe it could help, or if not, rule out something people will highlight as a possible cause preventing you from finding the real cause.

cyberjock · Dec 10, 2014

Yeah, you have made a whole lot of mistakes in your setup. Your network isn't setup properly, you don't seem to grasp the fundamentals of benchmarks (and you should have recognized that compression invalidates your testing outright).

To be frank, because of all of the mistakes you've made your values are meaningless. So to me you've provided no numbers that we can actually say mean something that validate your "low read performance" accusation.

sfcredfox · Dec 10, 2014

cyberjock said:
Yeah, you have made a whole lot of mistakes in your setup. Your network isn't setup properly, you don't seem to grasp the fundamentals of benchmarks (and you should have recognized that compression invalidates your testing outright).

Now, read this condescending reply in the voice of Jeff Albertson (Comic Book Guy) from the Simpsons. **Edit** "Worst...benchmark...ever!" ** It gets a lot funnier. Since you were supposed to have known everything about everything, you shouldn't have needed to post in this forum.

However, the comment is true. Once you disable compression and re-run your tests (if you haven't already), I bet you'll see a much lower and more accurate reading on writes. Search the forum for some posts by jgreco. He has some really goes ones about burn-in. He describes testing your individual disks (if that's what you were trying to do), and then testing your pool in the same manner to compare the two for establishing basic disk performance and testing your desired disk configuration.

You'll find a ton of posts about iSCSI - example:
https://forums.freenas.org/index.php?threads/another-person-banging-his-head-against-iscsi.23317/

They'll tell you to do mirrors if performance is really important, loads of memory, and SLOG/L2ARC, but you need to advance your education in the product quite a bit before playing with the last mentioned.

One thing I would keep in mind - It's not a matter of the product (FreeNAS) being an issue (typically), it's the hardware you used, and how you configured it. People get really butt hurt if you imply that FreeNAS is somehow at fault rather than your own poor choices of hardware/implementation.

Ariel_ · Dec 10, 2014

Thank you the replies,

I was just about to put my SAMSUNG SSD as storage and one of the CORSAIR
Neutron as L2ARC, until I got this result. I was thinking even if I go all the
way SSD and manage to saturate my link it seems kind of wasteful of expensive
hardware if read is slower than write as my expected workload will have more
(continuous) reads.

I'm still on 1 spinner stripe + 1 ssd l2arc setup, I havent got enough free
hardware yet (at least probably until new year, I think) to emulate proper disk
configuration for real usage (ie. RAID10, striped mirrored vdev).

I did try disable the compression the first time I try FreeNAS on the zvol,
then I try compression, got that 553 MB/s, iftop showing all the way to 120MB
on all four interface. I use /dev/zero and aware of the compression just to see
the that my link got saturated which was happening on write. I was expecting
the other way would be the same, apparently not, my fault, this is me trying to
understand why that is not the case, somewhere between iSCSI, multipath, the
two OSes and my mistakes must be causing this.

So again I disable it now and this is what I got trying to write a bunch of
zeroes through multipathed iSCSI:

Code:

    root@stolab002:~# echo 3 > /proc/sys/vm/drop_caches
    root@stolab002:~# dd if=/dev/zero of=/var/tmp/foo/zero.raw bs=1M count=40K
    40960+0 records in
    40960+0 records out
    42949672960 bytes (43 GB) copied, 288.272 s, 149 MB/s

It bottlenecks on the disk, just as expected. This is what I got after trying
to read zero.raw back:

Code:

    root@stolab002:~# echo 3 > /proc/sys/vm/drop_caches
    root@stolab002:~# dd if=/var/tmp/foo/zero.raw of=/dev/null bs=1M
    40960+0 records in
    40960+0 records out
    42949672960 bytes (43 GB) copied, 422.036 s, 102 MB/s

As you can see, still a tad slower than write.

As per your suggestion I generate a file called RANDOM.raw by pooling
/dev/urandom, its size is 5.7G (6118137856 bytes), drop OS caches every time I
start reading it through iSCSI.

Code:

    root@stolab002:~# md5sum RANDOM.raw
    c8e059b86541ffbabb950d246c525694  RANDOM.raw

    root@stolab002:~# hd -n 256 RANDOM.raw
    00000000  82 8a 69 73 08 29 4d 59  99 46 58 26 a4 67 df 01  |..is.)MY.FX&.g..|
    00000010  61 97 0e f8 db eb 57 3e  a0 e3 28 27 a5 7f cb ab  |a.....W>..('....|
    00000020  f0 16 d8 2a 4b 75 16 68  bc 17 73 e4 55 ab 2a 8f  |...*Ku.h..s.U.*.|
    00000030  aa c4 c4 9d 21 ea 1c 23  2b 3f d1 5a 8c 35 12 9d  |....!..#+?.Z.5..|
    00000040  6f b6 a7 30 16 b4 1b df  cc 5e c0 d2 58 8a 5a 62  |o..0.....^..X.Zb|
    00000050  87 dc 1b e5 31 ec 8e 7b  8c 5b 93 57 fc a3 9d 8d  |....1..{.[.W....|
    00000060  7b fa 35 80 1c 9a 09 2c  a4 aa 83 99 bb 61 1c 21  |{.5....,.....a.!|
    00000070  57 4d 09 f2 ac d9 fa a6  0d 30 e6 61 a2 55 5b ec  |WM.......0.a.U[.|
    00000080  a6 8b e8 88 61 0c 08 d1  0d b7 57 0e 3a 2e aa f1  |....a.....W.:...|
    00000090  84 86 6c ae ea 5a 54 dc  92 ca ff 04 37 23 06 02  |..l..ZT.....7#..|
    000000a0  c5 ad 33 28 45 5f ac 8d  5e f9 57 aa f9 0e bb 79  |..3(E_..^.W....y|
    000000b0  62 dc e6 c8 ea 61 28 14  7f 86 ae d0 69 88 3d 5d  |b....a(.....i.=]|
    000000c0  11 37 f7 b5 41 39 7f 7a  41 68 bb 80 a1 ca c0 98  |.7..A9.zAh......|
    000000d0  7c d2 33 0b 92 ab b5 4e  81 fc 78 c7 a8 c7 b0 39  ||.3....N..x....9|
    000000e0  c0 9b b5 05 a0 8f 4e 4a  d1 55 f9 d2 05 56 23 9d  |......NJ.U...V#.|
    000000f0  06 d0 4c 4a 27 d6 e8 d4  4e ea 84 22 7f 99 f3 45  |..LJ'...N.."...E|
    00000100

So a couple of writes using RANDOM.raw goes like this:

Code:

    root@stolab002:~# rm /var/tmp/foo/RANDOM.raw
    root@stolab002:~# echo 3 > /proc/sys/vm/drop_caches
    root@stolab002:~# dd if=RANDOM.raw of=/var/tmp/foo/RANDOM.raw bs=1M
    5834+1 records in
    5834+1 records out
    6118137856 bytes (6.1 GB) copied, 14.698 s, 416 MB/s

My / partition is on a (Linux's RAID-1) CORSAIR Neutron SSD, it should be fast,
but what the hell, 416 MB/s is there some ZFS/iSCSI/*BSD cache going on?

This is the output of mount, just to see if I'm still writing over iSCSI:

Code:

    root@stolab002:~# mount
    /dev/md0 on / type ext4 (rw,noatime,discard,user_xattr,errors=remount-ro)
    proc on /proc type proc (rw,noexec,nosuid,nodev)
    sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
    none on /sys/fs/cgroup type tmpfs (rw)
    none on /sys/fs/fuse/connections type fusectl (rw)
    none on /sys/kernel/debug type debugfs (rw)
    none on /sys/kernel/security type securityfs (rw)
    none on /sys/firmware/efi/efivars type efivarfs (rw)
    udev on /dev type devtmpfs (rw,mode=0755)
    devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
    tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
    none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
    none on /run/shm type tmpfs (rw,nosuid,nodev)
    none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
    none on /sys/fs/pstore type pstore (rw)
    /dev/sdb1 on /boot/efi type vfat (rw)
    rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw)
    systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
    /dev/mapper/iSCSITestDisk-part1 on /var/tmp/foo type ext2 (rw)

This is the last output of `arcstat.py 1` on writing:

Code:

        time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
    11:17:19     2     0      0     0    0     0    0     0    0    23G   23G
    11:17:20     8     0      0     0    0     0    0     0    0    23G   23G
    11:17:21    13     1      7     1    7     0    0     0    0    23G   23G
    11:17:22   13K     3      0     3    0     0    0     0    0    23G   23G
    11:17:23   22K   443      1   443    1     0    0   442    2    23G   23G
    11:17:24   12K   356      2   356    2     0    0   351    2    24G   24G
    11:17:25   16K   461      2   459    2     2   25   453    2    24G   24G
    11:17:27   19K   533      2   524    2     9   27   524    2    24G   23G
    11:17:28   18K   537      2   524    2    13   38   522    2    23G   23G
    11:17:29   23K   514      2   514    2     0    0   514    2    23G   23G
    11:17:30   29K   496      1   496    1     0    0   496    1    23G   23G
    11:17:31   34K   453      1   453    1     0    0   453    1    23G   23G
    11:17:32     0     0      0     0    0     0    0     0    0    23G   23G
    11:17:33   21K   288      1   288    1     0    0   288    1    23G   23G
    11:17:34   16K   217      1   217    1     0    0   217    1    23G   23G
    11:17:36   11K   121      1   121    1     0    0   121    1    23G   23G
    11:17:37  6.6K    61      0    61    0     0    0    60    0    23G   23G
    11:17:38    27     0      0     0    0     0    0     0    0    23G   23G
    11:17:39    61     0      0     0    0     0    0     0    0    23G   23G
    11:17:40     0     0      0     0    0     0    0     0    0    23G   23G
    11:17:41   11K   258      2   258    2     0    0   258    2    23G   23G
        time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
    11:17:42   28K   352      1   352    1     0    0   352    1    23G   23G
    11:17:43  1.2K    16      1    16    1     0    0    16    1    23G   23G
    11:17:44   26K   251      0   251    0     0    0   251    0    23G   23G
    11:17:45     0     0      0     0    0     0    0     0    0    23G   23G
    11:17:47   12K   136      1   136    1     0    0   136    1    23G   23G
    11:17:48     0     0      0     0    0     0    0     0    0    23G   23G
    11:17:49   30K   321      1   321    1     0    0   321    1    23G   23G
    11:17:50     0     0      0     0    0     0    0     0    0    23G   23G
    11:17:51     0     0      0     0    0     0    0     0    0    23G   23G

I read it back a few times and basically got this:

Code:

    root@stolab002:~# for x in {1..3}; do echo 3 > /proc/sys/vm/drop_caches; dd if=/var/tmp/foo/RANDOM.raw of=/dev/null bs=1M; done
    5834+1 records in
    5834+1 records out
    6118137856 bytes (6.1 GB) copied, 38.1842 s, 160 MB/s
    5834+1 records in
    5834+1 records out
    6118137856 bytes (6.1 GB) copied, 38.0737 s, 161 MB/s
    5834+1 records in
    5834+1 records out
    6118137856 bytes (6.1 GB) copied, 38.1503 s, 160 MB/s

This is the last output of `arcstat.py 1` on reading:

Code:

        time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
    11:44:09   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:10   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:11   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:12   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:13   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:15   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:16   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:17   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:18   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:19   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:20   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:21   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:22   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:23   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:25   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:26   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:27   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:28   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:29   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:30   21K     0      0     0    0     0    0     0    0    23G   23G
    11:44:31   21K     0      0     0    0     0    0     0    0    23G   23G
        time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
    11:44:32  2.2K     0      0     0    0     0    0     0    0    23G   23G
    11:44:33     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:34     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:36     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:37     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:38     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:39     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:40     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:41     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:42     0     0      0     0    0     0    0     0    0    23G   23G
    11:44:43     0     0      0     0    0     0    0     0    0    23G   23G

I'm new to ZFS but I thought it will exhaust ARC first then start using L2ARC?
Well, I have L2ARC on a SAMSUNG SSD (should be fast enough, right?) anyway but
being 5.7GB in a 23G ARC with miss 0% my RANDOM.raw should reside somewhere in
ARC right? Am I wrong?

My guess is that it seems continous read still not as good as continuous write,
even though I have enough ARC *and* L2ARC. What I'm still not sure about is
that how come I can get 416 MB/s continous write of random data on an
uncompressed zvol.

From these numbers it seems backing up big files (ie. pg_basebackup dumps)
seems great, but reading it back can be slow.

I also try bonnie++ which I think more accurate of a benchmark than dd-ing
bytes back and forth:

Code:

    root@stolab002:~# echo 3 > /proc/sys/vm/drop_caches
    root@stolab002:~# bonnie++ -d /var/tmp/foo -u root
    Using uid:0, gid:0.
    Writing a byte at a time...done
    Writing intelligently...done
    Rewriting...done
    Reading a byte at a time...done
    Reading intelligently...done
    start 'em...done...done...done...done...done...
    Create files in sequential order...done.
    Stat files in sequential order...done.
    Delete files in sequential order...done.
    Create files in random order...done.
    Stat files in random order...done.
    Delete files in random order...done.
    Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
    Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    stolab002.lab.jkt 63G  1789  98 65869   4 47182   3  4557  65 128130   4 486.3  17
    Latency             16177us    4567ms    9921ms     144ms    2019ms     141ms
    Version  1.97       ------Sequential Create------ --------Random Create--------
    stolab002.lab.jkt   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
    Latency               232us     917us     240us      67us      55us      24us
    1.97,1.97,stolab002.lab.jkt.cpssoft,1,1418274410,63G,,1789,98,65869,4,47182,3,4557,65,128130,4,486.3,17,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,16177us,4567ms,9921ms,144ms,2019ms,141ms,232us,917us,240us,67us,55us,24us

At least it looks `normal` that is write much slower than read, and its read is
near the WD-Reds spec. Random-seeks number seems right for a spinning disk. :D

Anyway, I forgot to include arc_summary output the last time, so here it is:

Code:

    System Memory:
   
            0.08%   25.86   MiB Active,     1.11%   349.39  MiB Inact
            93.26%  28.79   GiB Wired,      0.05%   17.34   MiB Cache
            5.49%   1.70    GiB Free,       0.00%   916.00  KiB Gap
   
            Real Installed:                         32.00   GiB
            Real Available:                 99.45%  31.82   GiB
            Real Managed:                   97.00%  30.87   GiB
   
            Logical Total:                          32.00   GiB
            Logical Used:                   93.58%  29.95   GiB
            Logical Free:                   6.42%   2.05    GiB
   
    Kernel Memory:                                  1.04    GiB
            Data:                           97.78%  1.01    GiB
            Text:                           2.22%   23.59   MiB
   
    Kernel Memory Map:                              28.00   GiB
            Size:                           72.85%  20.39   GiB
            Free:                           27.15%  7.60    GiB
                                                                    Page:  1
    ------------------------------------------------------------------------
   
    ARC Summary: (HEALTHY)
            Storage pool Version:                   5000
            Filesystem Version:                     5
            Memory Throttle Count:                  0
   
    ARC Misc:
            Deleted:                                39.35m
            Recycle Misses:                         6.49m
            Mutex Misses:                           12.81k
            Evict Skips:                            12.81k
   
    ARC Size:                               79.82%  23.84   GiB
            Target Size: (Adaptive)         79.92%  23.87   GiB
            Min Size (Hard Limit):          12.50%  3.73    GiB
            Max Size (High Water):          8:1     29.87   GiB
   
    ARC Size Breakdown:
            Recently Used Cache Size:       92.61%  22.11   GiB
            Frequently Used Cache Size:     7.39%   1.76    GiB
   
    ARC Hash Breakdown:
            Elements Max:                           15.73m
            Elements Current:               99.94%  15.73m
            Collisions:                             37.18m
            Chain Max:                              16
            Chains:                                 3.68m
                                                                    Page:  2
    ------------------------------------------------------------------------
   
    ARC Total accesses:                                     99.19m
            Cache Hit Ratio:                79.18%  78.55m
            Cache Miss Ratio:               20.82%  20.65m
            Actual Hit Ratio:               76.91%  76.29m
   
            Data Demand Efficiency:         99.25%  39.79m
            Data Prefetch Efficiency:       15.06%  23.67m
   
            CACHE HITS BY CACHE LIST:
              Most Recently Used:           35.25%  27.69m
              Most Frequently Used:         61.88%  48.60m
              Most Recently Used Ghost:     3.11%   2.44m
              Most Frequently Used Ghost:   0.17%   136.33k
   
            CACHE HITS BY DATA TYPE:
              Demand Data:                  50.28%  39.49m
              Prefetch Data:                4.54%   3.56m
              Demand Metadata:              45.16%  35.48m
              Prefetch Metadata:            0.02%   15.29k
   
            CACHE MISSES BY DATA TYPE:
              Demand Data:                  1.44%   297.87k
              Prefetch Data:                97.39%  20.11m
              Demand Metadata:              1.16%   239.08k
              Prefetch Metadata:            0.01%   2.85k
                                                                    Page:  3
    ------------------------------------------------------------------------
   
    L2 ARC Summary: (HEALTHY)
            Passed Headroom:                        793.36k
            Tried Lock Failures:                    123.44k
            IO In Progress:                         0
            Low Memory Aborts:                      39
            Free on Write:                          382.61k
            Writes While Full:                      10.37k
            R/W Clashes:                            565
            Bad Checksums:                          0
            IO Errors:                              0
            SPA Mismatch:                           17.20m
   
    L2 ARC Size: (Adaptive)                         96.83   GiB
            Header Size:                    2.53%   2.45    GiB
   
    L2 ARC Evicts:
            Lock Retries:                           0
            Upon Reading:                           0
   
    L2 ARC Breakdown:                               20.65m
            Hit Ratio:                      0.93%   192.39k
            Miss Ratio:                     99.07%  20.45m
            Feeds:                                  24.49k
   
    L2 ARC Buffer:
            Bytes Scanned:                          1.34    TiB
            Buffer Iterations:                      24.49k
            List Iterations:                        1.50m
            NULL List Iterations:                   195.71k
   
    L2 ARC Writes:
            Writes Sent:                    100.00% 15.00k
                                                                    Page:  4
    ------------------------------------------------------------------------
   
    File-Level Prefetch: (HEALTHY)
    DMU Efficiency:                                 32.47m
            Hit Ratio:                      91.31%  29.65m
            Miss Ratio:                     8.69%   2.82m
   
            Colinear:                               2.82m
              Hit Ratio:                    0.02%   694
              Miss Ratio:                   99.98%  2.82m
   
            Stride:                                 3.10m
              Hit Ratio:                    99.57%  3.09m
              Miss Ratio:                   0.43%   13.47k
   
    DMU Misc:
            Reclaim:                                2.82m
              Successes:                    0.60%   16.98k
              Failures:                     99.40%  2.80m
   
            Streams:                                26.61m
              +Resets:                      0.19%   51.14k
              -Resets:                      99.81%  26.56m
              Bogus:                                0
                                                                    Page:  5
    ------------------------------------------------------------------------
   
    VDEV Cache Summary:                             120.80k
            Hit Ratio:                      45.14%  54.53k
            Miss Ratio:                     50.44%  60.93k
            Delegations:                    4.42%   5.34k
                                                                    Page:  6
    ------------------------------------------------------------------------
   
    ZFS Tunable (sysctl):
            kern.maxusers                           2372
            vm.kmem_size                            33146601472
            vm.kmem_size_scale                      1
            vm.kmem_size_min                        0
            vm.kmem_size_max                        329853485875
            vfs.zfs.l2c_only_size                   87086353920
            vfs.zfs.mfu_ghost_data_lsize            16799485952
            vfs.zfs.mfu_ghost_metadata_lsize        110067712
            vfs.zfs.mfu_ghost_size                  16909553664
            vfs.zfs.mfu_data_lsize                  6137709568
            vfs.zfs.mfu_metadata_lsize              5758976
            vfs.zfs.mfu_size                        6152352768
            vfs.zfs.mru_ghost_data_lsize            3843860480
            vfs.zfs.mru_ghost_metadata_lsize        422701056
            vfs.zfs.mru_ghost_size                  4266561536
            vfs.zfs.mru_data_lsize                  14188542464
            vfs.zfs.mru_metadata_lsize              201785344
            vfs.zfs.mru_size                        15272706560
            vfs.zfs.anon_data_lsize                 0
            vfs.zfs.anon_metadata_lsize             0
            vfs.zfs.anon_size                       32768
            vfs.zfs.l2arc_norw                      1
            vfs.zfs.l2arc_feed_again                1
            vfs.zfs.l2arc_noprefetch                1
            vfs.zfs.l2arc_feed_min_ms               200
            vfs.zfs.l2arc_feed_secs                 1
            vfs.zfs.l2arc_headroom                  2
            vfs.zfs.l2arc_write_boost               8388608
            vfs.zfs.l2arc_write_max                 8388608
            vfs.zfs.arc_meta_limit                  8018214912
            vfs.zfs.arc_meta_used                   5275740552
            vfs.zfs.arc_shrink_shift                5
            vfs.zfs.arc_average_blocksize           8192
            vfs.zfs.arc_min                         4009107456
            vfs.zfs.arc_max                         32072859648
            vfs.zfs.dedup.prefetch                  1
            vfs.zfs.mdcomp_disable                  0
            vfs.zfs.nopwrite_enabled                1
            vfs.zfs.zfetch.array_rd_sz              1048576
            vfs.zfs.zfetch.block_cap                256
            vfs.zfs.zfetch.min_sec_reap             2
            vfs.zfs.zfetch.max_streams              8
            vfs.zfs.prefetch_disable                0
            vfs.zfs.delay_scale                     500000
            vfs.zfs.delay_min_dirty_percent         60
            vfs.zfs.dirty_data_sync                 67108864
            vfs.zfs.dirty_data_max_percent          10
            vfs.zfs.dirty_data_max_max              4294967296
            vfs.zfs.dirty_data_max                  3417086771
            vfs.zfs.free_max_blocks                 131072
            vfs.zfs.no_scrub_prefetch               0
            vfs.zfs.no_scrub_io                     0
            vfs.zfs.resilver_min_time_ms            3000
            vfs.zfs.free_min_time_ms                1000
            vfs.zfs.scan_min_time_ms                1000
            vfs.zfs.scan_idle                       50
            vfs.zfs.scrub_delay                     4
            vfs.zfs.resilver_delay                  2
            vfs.zfs.top_maxinflight                 32
            vfs.zfs.mg_fragmentation_threshold      85
            vfs.zfs.mg_noalloc_threshold            0
            vfs.zfs.condense_pct                    200
            vfs.zfs.metaslab.bias_enabled           1
            vfs.zfs.metaslab.lba_weighting_enabled  1
            vfs.zfs.metaslab.fragmentation_factor_enabled1
            vfs.zfs.metaslab.preload_enabled        1
            vfs.zfs.metaslab.preload_limit          3
            vfs.zfs.metaslab.unload_delay           8
            vfs.zfs.metaslab.load_pct               50
            vfs.zfs.metaslab.min_alloc_size         33554432
            vfs.zfs.metaslab.df_free_pct            4
            vfs.zfs.metaslab.df_alloc_threshold     131072
            vfs.zfs.metaslab.debug_unload           0
            vfs.zfs.metaslab.debug_load             0
            vfs.zfs.metaslab.fragmentation_threshold70
            vfs.zfs.metaslab.gang_bang              16777217
            vfs.zfs.spa_load_verify_data            1
            vfs.zfs.spa_load_verify_metadata        1
            vfs.zfs.spa_load_verify_maxinflight     10000
            vfs.zfs.ccw_retry_interval              300
            vfs.zfs.check_hostid                    1
            vfs.zfs.spa_asize_inflation             24
            vfs.zfs.deadman_enabled                 1
            vfs.zfs.deadman_checktime_ms            5000
            vfs.zfs.deadman_synctime_ms             1000000
            vfs.zfs.recover                         0
            vfs.zfs.space_map_blksz                 32768
            vfs.zfs.trim.max_interval               1
            vfs.zfs.trim.timeout                    30
            vfs.zfs.trim.txg_delay                  32
            vfs.zfs.trim.enabled                    1
            vfs.zfs.txg.timeout                     5
            vfs.zfs.min_auto_ashift                 9
            vfs.zfs.max_auto_ashift                 13
            vfs.zfs.vdev.trim_max_pending           64
            vfs.zfs.vdev.trim_max_bytes             2147483648
            vfs.zfs.vdev.metaslabs_per_vdev         200
            vfs.zfs.vdev.cache.bshift               16
            vfs.zfs.vdev.cache.size                 25165824
            vfs.zfs.vdev.cache.max                  16384
            vfs.zfs.vdev.larger_ashift_minimal      0
            vfs.zfs.vdev.bio_delete_disable         0
            vfs.zfs.vdev.bio_flush_disable          0
            vfs.zfs.vdev.trim_on_init               1
            vfs.zfs.vdev.mirror.non_rotating_seek_inc1
            vfs.zfs.vdev.mirror.non_rotating_inc    0
            vfs.zfs.vdev.mirror.rotating_seek_offset1048576
            vfs.zfs.vdev.mirror.rotating_seek_inc   5
            vfs.zfs.vdev.mirror.rotating_inc        0
            vfs.zfs.vdev.write_gap_limit            4096
            vfs.zfs.vdev.read_gap_limit             32768
            vfs.zfs.vdev.aggregation_limit          131072
            vfs.zfs.vdev.trim_max_active            64
            vfs.zfs.vdev.trim_min_active            1
            vfs.zfs.vdev.scrub_max_active           2
            vfs.zfs.vdev.scrub_min_active           1
            vfs.zfs.vdev.async_write_max_active     10
            vfs.zfs.vdev.async_write_min_active     1
            vfs.zfs.vdev.async_read_max_active      3
            vfs.zfs.vdev.async_read_min_active      1
            vfs.zfs.vdev.sync_write_max_active      10
            vfs.zfs.vdev.sync_write_min_active      10
            vfs.zfs.vdev.sync_read_max_active       10
            vfs.zfs.vdev.sync_read_min_active       10
            vfs.zfs.vdev.max_active                 1000
            vfs.zfs.vdev.async_write_active_max_dirty_percent60
            vfs.zfs.vdev.async_write_active_min_dirty_percent30
            vfs.zfs.snapshot_list_prefetch          0
            vfs.zfs.version.ioctl                   4
            vfs.zfs.version.zpl                     5
            vfs.zfs.version.spa                     5000
            vfs.zfs.version.acl                     1
            vfs.zfs.debug                           0
            vfs.zfs.super_owner                     0
            vfs.zfs.cache_flush_disable             0
            vfs.zfs.zil_replay_disable              0
            vfs.zfs.sync_pass_rewrite               2
            vfs.zfs.sync_pass_dont_compress         5
            vfs.zfs.sync_pass_deferred_free         2
            vfs.zfs.zio.use_uma                     1
            vfs.zfs.vol.unmap_enabled               1
            vfs.zfs.vol.mode                        2
                                                                    Page:  7
    ------------------------------------------------------------------------

So, what am I doing wrong?

Ariel_ · Dec 10, 2014

sfcredfox said:
Another observation, I'm using a phone today, but thought I saw your interfaces all on the same subnet. While I think that's fine for some platforms, many experts assert that freebsd does not like this. They suggest that you use separate subnets for each path. There are other posts on here with all the explanations as to why.

my multipathing between freenas and esx servers works flawlessly using the technique described for ip addressing. Maybe it could help, or if not, rule out something people will highlight as a possible cause preventing you from finding the real cause.

Well, on FreeNAS server I setup the 4 interfaces like this:

igb0: 10.10.121.1/24
igb1: 10.10.122.1/24
igb2: 10.10.123.1/24
igb3: 10.10.124.1/24

While on Ubuntu initiator the setup for the 4 interfaces are:

p4p1: 10.10.121.2/24
p4p2: 10.10.122.2/24
p4p3: 10.10.123.2/24
p4p4: 10.10.124.2/24

Each interface pair (igb0 and p4p1) are in its subnet own (with last byte .1 for server and .2 for initiator), each pair/path also separated by untagged VLANs on the switch. But I'm not sure about the "on the same subnet" part, care to elaborate?

sfcredfox · Dec 11, 2014

Ariel_ said:
Well, on FreeNAS server I setup the 4 interfaces like this:

igb0: 10.10.121.1/24

igb1: 10.10.122.1/24

igb2: 10.10.123.1/24

igb3: 10.10.124.1/24

While on Ubuntu initiator the setup for the 4 interfaces are:

p4p1: 10.10.121.2/24

p4p2: 10.10.122.2/24

p4p3: 10.10.123.2/24

p4p4: 10.10.124.2/24

Each interface pair (igb0 and p4p1) are in its subnet own (with last byte .1 for server and .2 for initiator), each pair/path also separated by untagged VLANs on the switch. But I'm not sure about the "on the same subnet" part, care to elaborate?

Indeed, disregard. I re-read your post and you are correct. I thought for some reason you wrote x.x.x.121-124. Well done.

As for the use of L2ARC, I've read in a few places that it's counter productive to use one until you have more around >=64GB RAM.
https://forums.freenas.org/index.php?threads/another-person-banging-his-head-against-iscsi.23317/
https://forums.freenas.org/index.ph...-esxi-5-5-slow-performance.24404/#post-150731

I don't personally use L2ARC because of this, so I'll leave the L@ARC support to someone else.

Anyone more qualified: I'm wondering if his read performance isn't unexpected for a pool with one disk?

cyberjock · Dec 13, 2014

It is very counterproductive to use an L2ARC with less than about 64GB of RAM. Ask the hotshots that bought $200 SSDs for their system with less than 32GB of RAM and saw performance decrease. Yes, it actually went down. :P I've got a system that maxes out at 32GB of RAM, and despite having hardware that I could put in the system as a very effective L2ARC I have deliberately made the decision to leave it out. Why? Because I don't have enough RAM to make it a net gain in performance.

You also touched on the subject briefly when you asked what is cached and where. This is yet another side of ZFS (and caches in general) where benchmarking is usurped by caches. If you aren't an expert of where the caches are, how much they cache, how they function interally, etc you'll lose your game before you start. I'm not saying that benchmarking is pointless, but again, unless you are pro you'll be unable to get numbers that actually mean what you think they mean. Add to that the hybrid nature of L2ARC and slogs and it's basically impossible to benchmark ZFS without being a professional in ZFS.

sfcredfox · Dec 13, 2014

cyberjock said:
If you aren't an expert of where the caches are, how much they cache, how they function interally, etc you'll lose your game before you start. I'm not saying that benchmarking is pointless, but again, unless you are pro you'll be unable to get numbers that actually mean what you think they mean. Add to that the hybrid nature of L2ARC and slogs and it's basically impossible to benchmark ZFS without being a professional in ZFS.

OK, it's hard...And now that we've spent plenty of time explaining how hard it is, maybe something useful like to pointers for understanding it. :)

Here's a link that was helpful:
https://forums.freenas.org/index.php?threads/notes-on-performance-benchmarks-and-cache.981/
It will help you get some numbers that are useful by taking into account some of those caches.

I know there's a ton more.

OP, cruise into the performance section and check out the stickies. JGreco posts tons are extremely useful posts. They'll also tell you 'stuff is hard', but also explain how to do things. Hope this helps.

cyberjock · Dec 13, 2014

That link is from 2011, and things have changed somewhat since then. There is no "guide" to benchmarking ZFS because it would easily be a 300+ page book sold on Amazon for $100. It's not trivial to even think about writing such things.

sfcredfox · Dec 13, 2014

cyberjock said:
That link is from 2011, and things have changed somewhat since then. There is no "guide" to benchmarking ZFS because it would easily be a 300+ page book sold on Amazon for $100. It's not trivial to even think about writing such things.

There's enough smart people around here and enough articles, I'm sure we can get home users in the ball park without the sub-atomic dissertation. :)

They just need to know simple things like turn off compression when you're writing zeros, I don't think they're trying to tweak a three rack system of a couple hundred disks for an enterprise like your pro support customers.

cyberjock · Dec 13, 2014

sfcredfox said:
There's enough smart people around here and enough articles, I'm sure we can get home users in the ball park without the sub-atomic dissertation. :)

They just need to know simple things like turn off compression when you're writing zeros, I don't think they're trying to tweak a three rack system of a couple hundred disks for an enterprise like your pro support customers.

That's what the dd is for. It's about the simplest you can get without having to get a dissertation.

There's a reason why I don't try to benchmark pools and I definitely don't try to explain how to make it work. People will take those numbers that are ill-conceived to begin with and then try to make the argument that FreeNAS sucks because they did the same benchmark on NTFS or ext4 and got higher numbers.

Reality check: if you're buying hardware that we recommend and you are using 1Gb LAN, your bottleneck is almost certainly going to be your LAN. PERIOD. Benchmarking be damned because your bottleneck isn't the pool.

If you're going to argue that you want a document for "home users that gets in the ball park without the sub-atomic dissertation" then we did that. Buy the hardware we recommend and your Gb LAN will be saturated. No benchmarking needed.

sfcredfox · Dec 13, 2014

cyberjock said:
If you're going to argue that you want a document for "home users that gets in the ball park without the sub-atomic dissertation" then we did that.

I see your point, that's usually why I include links to stuff written. I'm always looking for links to send people to instead of having to explain these things over and over, I can see how you'd get tired of it.

cyberjock · Dec 13, 2014

There is a certain amount of "tired of it" involved with my posts. The real problem is that people don't try to search. Then you add in all the BS like "I know I bought the wrong hardware and don't even meet the minimum requirements, but clearly I can tune this box with 4Gb of RAM to do 10Gb speeds, right?" and then the ID10T errors, user errors, and flat out bugs that exist. It's a nightmare.

Most people aren't remotely appreciative of how complex ZFS is. I talk the talk and walk the walk, but you get some of the other guys like the ones I met at the Meet FreeBSD conference in November and I looked like a damn-lame noob. The rabbit hole is *so* deep and you're left with two options: don't look deeper and let the fact that this stuff works keep you happy, or try to jump down the rabbit hole (you'd better be a C programmer if you plan to do this).

If you want to look deeper, feel free. Don't expect forum support with this if you really want to go that deep.

If you want to keep it shallow, then fee free to do that too. Our stickies pretty much assume you want the shallow approach. If you follow the stickies and build based off the stickies alone you'll get a very workable and very well-performing build.

But people cry that they don't want the shallow, but aren't programmers, but aren't willing to pay the extra cost for appropriate hardware. So they do everything wrong, want a free degree in ZFS knowledge, and want it yesterday. Yeah, a forum, any forum, won't really cater to those crowds. ;) I ignore those people as soon as I figure out who they are. I know I can't provide that, I don't want to provide that, and I've got better things to do than argue with them over it. :)

Important Announcement for the TrueNAS Community.

Low Read Performance On Multipath iSCSI With Ubuntu Initiator

Ariel_

Cadet

Attachments

mav@

iXsystems

sfcredfox

Patron

cyberjock

Inactive Account

sfcredfox

Patron

Ariel_

Cadet

Ariel_

Cadet

sfcredfox

Patron

cyberjock

Inactive Account

sfcredfox

Patron

cyberjock

Inactive Account

sfcredfox

Patron

cyberjock

Inactive Account

sfcredfox

Patron

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Low Read Performance On Multipath iSCSI With Ubuntu Initiator

Cadet

Attachments

iXsystems

Patron

Inactive Account

Patron

Cadet

Cadet

Patron

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Low Read Performance On Multipath iSCSI With Ubuntu Initiator"

Similar threads