Slow reads from PCIe L2ARC

Status
Not open for further replies.

woyteck

Dabbler
Joined
Jul 10, 2018
Messages
13
I have a problem with my L2ARC.

It seems to have reads capped at ~175MB/s.
Why am I bothered? - Well, it's a 2TB Intel SSD - P3520, PCIe based NVMe drive.

When testing reads directly from /dev/nvd0, I easily and consistently get aroung 600MB/s

# dd if=/dev/nvd0 of=/dev/null bs=128k count=100000
100000+0 records in
100000+0 records out
13107200000 bytes transferred in 21.553133 secs (608134336 bytes/sec)


I however noticed, that with smaller block size, the output is slower from this SSD:
# dd if=/dev/nvd0 of=/dev/null bs=4k count=10000
10000+0 records in
10000+0 records out
40960000 bytes transferred in 0.169981 secs (240968499 bytes/sec)


Is there something that can adjust the configuration so this drive has better throughput?
The problem is, that, if data is read from ARC, I get line speed at 10Gbps interface (about 1.1GB/s via NFS).
When data is read from L2ARC, I get between 250-300MB/s, only because it also partially reads from disks.

I've done some tests and I've ran everything 3 times and gave you the middle results, on a lightly used server.
When data is in ARC and filesystem is set to primarycache=all and secondarycache=all
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 10.6273 s, 901 MB/s


When filesystem is set to primarycache=none and secondarycache=none:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 44.2833 s, 216 MB/s


When filesystem is set to primarycache=metadata and secondarycache=none and metadata is in ARC:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 25.4434 s, 376 MB/s


When filesystem is set to primarycache=none and secondarycache=metadata and metadata is in L2ARC:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.1836 s, 180 MB/s


When filesystem is set to primarycache=none and secondarycache=all and metadata and data are in L2ARC:

# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.8619 s, 178 MB/s


Also, maximum number of read transactions I've ever seen on this cache is 1.52K
I've seen 3K on spinning disks, so just wondering why?

cache - - - - - -
gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0 141G 1.68T 1.52K 0 195M 0



I found this as well:
vfs.zfs.arc_average_blocksize: 8192

Would changing it change the speed access to L2ARC??


Hardware:
Supermicro 36Bay server
Dual Xeon Scallable 8core/2.1GHz
128GB RAM
12x 10TB Hitachi Disks in RAIDZ2
LSI 9300 HBA 12Gbps
200GB Hitachi SAS as SLOG
2TB Intel P3520 as cache (the one we are talking about).
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
What version of FreeNAS are you running? There were some bugs in early 9.10 releases I believe that would cause reduced performance for large L2ARC devices.

Is there something that can adjust the configuration so this drive has better throughput?

While I'm not familiar with this specific device, the benchmarks I've seen in a quick bit of research seem to indicate that it has poor small-block read performance. If your workload is all small reads (eg: VM disks) then that could be a partial cause.

From HotHardware's review, you can see that the 4K read throughput on the P3520 is very low. I realize that "ATTO inside of Windows" isn't the same as "L2ARC reads in FreeBSD" but it may suggest that the device itself can't deliver the needed bandwidth.

atto2.png


With regards to your tunable, it's possible. From the FreeBSD 10.1-RELEASE release notes:

The default zfs(8) ARC hash table size has been increased, and a new loader(8) tunable, vfs.zfs.arc_average_blocksize, has been added. Previously, the hash table could be too small, which would lead to long hash chains and limit performance for cached reads. The vfs.zfs.arc_average_blocksize tunable allows overriding the default block size. The previous default was 65536, and default of the new loader(8) tunable is 8192. [r269846]

Adjusting this may have an impact on overall read performance though as it will impact your ARC as well as L2ARC.
 

woyteck

Dabbler
Joined
Jul 10, 2018
Messages
13
Thanks for the graph, it explains the lower speeds that I am seeing quite well.

The system I have now is FreeNAS-11.1-U5

I have already changed vfs.zfs.arc_average_blocksize to 65536, but The highest throughput I've seen was about 190MB/s.
I will try with even higher today and report back.
 
Last edited:

woyteck

Dabbler
Joined
Jul 10, 2018
Messages
13
So, I have been doing further tests and came to a disturbing conclusion.


As it went:
  • I have tested the vfs.zfs.arc_average_blocksize at 8k, 64, 256k and 512k (with reboots), with no noticeable changes in behavior.
  • I have noticed that although I managed to write a lot to the cache - (even 900MB/s, using large vfs.zfs.l2arc_write_boost and vfs.zfs.l2arc_write_max), I was always able to get ~185MB/s on average out of it (rarely above 190MB/s)

  • It was doing about 1.5k read IOPS, so thought that maybe there is a limit to IOPS somewhere in the code dealing with cache.
  • To check this theory, I changed the recordsize on my test filesystem to 1M. Then reads which used L2ARC cache from this filesystem still are limited at 185MB/s but with only 185 read IOPS.
I then researched the idea that the code dealing with NVME may be broken somewhere and found this thread:
https://lists.freebsd.org/pipermail/svn-src-head/2016-March/083406.html
(Do read the entire thread - very interesting, however it seems that this change is not there (so there is no nvme_max_optimal_sectorsize) :(

However the most important part of the read was this, I think:
Code:
-/*
- * Intel NVMe controllers have a slow path for I/Os that span a 128KB
- * stripe boundary but ZFS limits ashift, which is derived from
- * d_stripesize, to 13 (8KB) so we limit the stripesize reported to
- * geom(8) to 4KB by default.
- *
- * This may result in a small number of additional I/Os to require
- * splitting in nvme(4), however the NVMe I/O path is very efficient
- * so these additional I/Os will cause very minimal (if any) difference
- * in performance or CPU utilisation.
- */


However, this pointed me towards checking how actual nvme driver for those Intel cards behaves, so I've done some tests.

# nvmecontrol perftest -n 16 -o read -s 512 -t 30 nvme0ns1
Threads: 16 Size: 512 READ Time: 30 IO/s: 289317 MB/s: 141
# nvmecontrol perftest -n 16 -o read -s 4096 -t 30 nvme0ns1
Threads: 16 Size: 4096 READ Time: 30 IO/s: 297781 MB/s: 1163
# nvmecontrol perftest -n 1 -o read -s 4096 -t 30 nvme0ns1
Threads: 1 Size: 4096 READ Time: 30 IO/s: 26887 MB/s: 105
# nvmecontrol perftest -n 1 -o read -s 65536 -t 30 nvme0ns1
Threads: 1 Size: 65536 READ Time: 30 IO/s: 3755 MB/s: 234
# nvmecontrol perftest -n 16 -o read -s 65536 -t 30 nvme0ns1
Threads: 16 Size: 65536 READ Time: 30 IO/s: 42393 MB/s: 2649
# nvmecontrol perftest -n 16 -o read -s 65536 -t 10 nvme0ns1
Threads: 16 Size: 65536 READ Time: 10 IO/s: 36372 MB/s: 2273
# nvmecontrol perftest -n 16 -o read -s 4096 -t 10 nvme0ns1
Threads: 16 Size: 4096 READ Time: 10 IO/s: 306383 MB/s: 1196
# nvmecontrol perftest -n 8 -o read -s 4096 -t 10 nvme0ns1
Threads: 8 Size: 4096 READ Time: 10 IO/s: 327323 MB/s: 1278
# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 23705 MB/s: 185


Then it struck me - 185MB/s - that's what I keep seeing.

So I did some more testing:


# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 22810 MB/s: 178
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads: 2 Size: 8192 READ Time: 10 IO/s: 133169 MB/s: 1040


Wait a minute! single thread gets 178MB/s but two threads get six time as much throughput?!?!?!
So, I'm checking further:


# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 22761 MB/s: 177
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads: 2 Size: 8192 READ Time: 10 IO/s: 129932 MB/s: 1015
# nvmecontrol perftest -n 4 -o read -s 8192 -t 10 nvme0ns1
Threads: 4 Size: 8192 READ Time: 10 IO/s: 235470 MB/s: 1839
# nvmecontrol perftest -n 8 -o read -s 8192 -t 10 nvme0ns1
Threads: 8 Size: 8192 READ Time: 10 IO/s: 232484 MB/s: 1816

Looks like I hit IOPS limit on the device. Time to check 64k reads:

# nvmecontrol perftest -n 1 -o read -s 65536 -t 10 nvme0ns1
Threads: 1 Size: 65536 READ Time: 10 IO/s: 3664 MB/s: 229
# nvmecontrol perftest -n 2 -o read -s 65536 -t 10 nvme0ns1
Threads: 2 Size: 65536 READ Time: 10 IO/s: 35995 MB/s: 2249
# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
Threads: 4 Size: 65536 READ Time: 10 IO/s: 48641 MB/s: 3040
# nvmecontrol perftest -n 8 -o read -s 65536 -t 10 nvme0ns1
Threads: 8 Size: 65536 READ Time: 10 IO/s: 47784 MB/s: 2986


Wow, multi-threading to the rescue, really - ten times more throughput while using two threads! on 64k chunks.

I have also checked it for 128k:

# nvmecontrol perftest -n 1 -o read -s 131072 -t 10 nvme0ns1
Threads: 1 Size: 131072 READ Time: 10 IO/s: 2276 MB/s: 284
# nvmecontrol perftest -n 2 -o read -s 131072 -t 10 nvme0ns1
Threads: 2 Size: 131072 READ Time: 10 IO/s: 20692 MB/s: 2586
# nvmecontrol perftest -n 4 -o read -s 131072 -t 10 nvme0ns1
Threads: 4 Size: 131072 READ Time: 10 IO/s: 25473 MB/s: 3184
# nvmecontrol perftest -n 8 -o read -s 131072 -t 10 nvme0ns1
Threads: 8 Size: 131072 READ Time: 10 IO/s: 23149 MB/s: 2893


Same behavior observed.
And for 256k, but here something I expected:

# nvmecontrol perftest -n 1 -o read -s 262144 -t 1 nvme0ns1
Threads: 1 Size: 262144 READ Time: 1 IO/s: 1 MB/s: 0


Zero. Can't read it this way.

Now. I've done some dd reads from my Intel SSD, just to see how it behaves via nvd driver.


# dd if=/dev/nvd0 of=/dev/null bs=8192 count=10000
10000+0 records in
10000+0 records out
81920000 bytes transferred in 0.171515 secs (477624808 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes transferred in 1.162961 secs (563527174 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 2.283765 secs (573929503 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 18.330676 secs (572033472 bytes/sec)


It, for some reason reaches ceiling of ~600MB/s. Not perfect, but I would be more satisfied with it than with 185MB/s (which I get).

I have done one more test, based on presumption that if my recordsize is 128k, there is a chance that actual reads from L2ARC are actually chopped, into smaller chunks as this may have some overhead for metadata.

So, I have changed recordsize to 64k, and then ran my test which filled up L2ARC, and continued with my read test.
At moments I was seeing the throughput to reach 287MB/s from cache

cache - - - - - -
gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0 471G 1.36T 4.48K 14 287M 13.1M
-------------------------------------- ----- ----- ----- ----- ----- -----


This sort of aligns itself with actual 128k sector reads from testing below (for single thread).


The question is: Is L2ARC read process single-threaded?
How can we check? And is it possible to use more than one thread (miminum 2, preferably 4-8) ???

That would have helped SO much in my case here.

Also, what next? Should I raise a bug/feature request with FreeNAS? or FreeBSD maybe?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That's some excellent sleuthing and replication instructions. I think this would probably be a feature request at the OpenZFS level, if it's truly the case that the L2ARC reads are single-threaded.

Since it looks like you're still in the testing phase and this isn't production hardware yet, is it possible to create a pool with ashift=13 (8KB) as I believe that SSD uses 8KB NAND pages?
 

woyteck

Dabbler
Joined
Jul 10, 2018
Messages
13
I was thinking about raising a bug with FreeNAS, but as you mentioned, it's probably more for OpenZFS people.

Unfortunately it's an early stages production already, which caught us out a bit. However I have spare disk slots in this machine and will be able to test the ashift=13 at some point.
 

woyteck

Dabbler
Joined
Jul 10, 2018
Messages
13
I have raise it with FreeNAS for now, but I think I'll go to the FreeBSD bug report as I am unable to find any OpenZFS related bug reporting place - Maybe someone knows it?

Also, just to strengthen the case for this:

I also have another server with Intel SSD L2ARC cache - INTEL SSD DC P4500
(Supermicro chassis, 2x E5 CPUs, 16x 10TB SAS disks, 128GB ram)

And just out of curiosity tried the nvmecontrol tests.
Here are the results:

# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 28842 MB/s: 225
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads: 2 Size: 8192 READ Time: 10 IO/s: 161266 MB/s: 1259
# nvmecontrol perftest -n 4 -o read -s 8192 -t 10 nvme0ns1
Threads: 4 Size: 8192 READ Time: 10 IO/s: 323923 MB/s: 2530

Six times speed-up on adding second thread...


# nvmecontrol perftest -n 1 -o read -s 65536 -t 10 nvme0ns1
Threads: 1 Size: 65536 READ Time: 10 IO/s: 1564 MB/s: 97
# nvmecontrol perftest -n 2 -o read -s 65536 -t 10 nvme0ns1
Threads: 2 Size: 65536 READ Time: 10 IO/s: 40204 MB/s: 2512
# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
Threads: 4 Size: 65536 READ Time: 10 IO/s: 50010 MB/s: 3125

25 times increase in speed!

I do not have access to any other NVMe hardware, but maybe someone is able to run similar tests?[/CMD][/CMD][/CMD][/CMD]
 
Status
Not open for further replies.
Top