Slow reads from PCIe L2ARC

woyteck · Jul 10, 2018

I have a problem with my L2ARC.

It seems to have reads capped at ~175MB/s.
Why am I bothered? - Well, it's a 2TB Intel SSD - P3520, PCIe based NVMe drive.

When testing reads directly from /dev/nvd0, I easily and consistently get aroung 600MB/s

 # dd if=/dev/nvd0 of=/dev/null bs=128k count=100000

100000+0 records in

100000+0 records out

13107200000 bytes transferred in 21.553133 secs (608134336 bytes/sec)

I however noticed, that with smaller block size, the output is slower from this SSD:

 # dd if=/dev/nvd0 of=/dev/null bs=4k count=10000

10000+0 records in

10000+0 records out

40960000 bytes transferred in 0.169981 secs (240968499 bytes/sec)

Is there something that can adjust the configuration so this drive has better throughput?
The problem is, that, if data is read from ARC, I get line speed at 10Gbps interface (about 1.1GB/s via NFS).
When data is read from L2ARC, I get between 250-300MB/s, only because it also partially reads from disks.

I've done some tests and I've ran everything 3 times and gave you the middle results, on a lightly used server.
When data is in ARC and filesystem is set to primarycache=all and secondarycache=all

 # echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M

9129+1 records in

9129+1 records out

9573093376 bytes (9.6 GB, 8.9 GiB) copied, 10.6273 s, 901 MB/s

When filesystem is set to primarycache=none and secondarycache=none:

 # echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M

9129+1 records in

9129+1 records out

9573093376 bytes (9.6 GB, 8.9 GiB) copied, 44.2833 s, 216 MB/s

When filesystem is set to primarycache=metadata and secondarycache=none and metadata is in ARC:

 # echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M

9129+1 records in

9129+1 records out

9573093376 bytes (9.6 GB, 8.9 GiB) copied, 25.4434 s, 376 MB/s

When filesystem is set to primarycache=none and secondarycache=metadata and metadata is in L2ARC:

 # echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M

9129+1 records in

9129+1 records out

9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.1836 s, 180 MB/s

When filesystem is set to primarycache=none and secondarycache=all and metadata and data are in L2ARC:

 

# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null bs=1M

9129+1 records in

9129+1 records out

9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.8619 s, 178 MB/s

Also, maximum number of read transactions I've ever seen on this cache is 1.52K
I've seen 3K on spinning disks, so just wondering why?

 

cache									   -	  -	  -	  -	  -	  -

  gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0   141G  1.68T  1.52K	  0   195M	  0

I found this as well:
vfs.zfs.arc_average_blocksize: 8192

Would changing it change the speed access to L2ARC??

Hardware:
Supermicro 36Bay server
Dual Xeon Scallable 8core/2.1GHz
128GB RAM
12x 10TB Hitachi Disks in RAIDZ2
LSI 9300 HBA 12Gbps
200GB Hitachi SAS as SLOG
2TB Intel P3520 as cache (the one we are talking about).

HoneyBadger · Jul 10, 2018

What version of FreeNAS are you running? There were some bugs in early 9.10 releases I believe that would cause reduced performance for large L2ARC devices.

woyteck said:
Is there something that can adjust the configuration so this drive has better throughput?

While I'm not familiar with this specific device, the benchmarks I've seen in a quick bit of research seem to indicate that it has poor small-block read performance. If your workload is all small reads (eg: VM disks) then that could be a partial cause.

From HotHardware's review, you can see that the 4K read throughput on the P3520 is very low. I realize that "ATTO inside of Windows" isn't the same as "L2ARC reads in FreeBSD" but it may suggest that the device itself can't deliver the needed bandwidth.

With regards to your tunable, it's possible. From the FreeBSD 10.1-RELEASE release notes:

The default zfs(8) ARC hash table size has been increased, and a new loader(8) tunable, vfs.zfs.arc_average_blocksize, has been added. Previously, the hash table could be too small, which would lead to long hash chains and limit performance for cached reads. The vfs.zfs.arc_average_blocksize tunable allows overriding the default block size. The previous default was 65536, and default of the new loader(8) tunable is 8192. [r269846]

Adjusting this may have an impact on overall read performance though as it will impact your ARC as well as L2ARC.

woyteck · Jul 11, 2018

Thanks for the graph, it explains the lower speeds that I am seeing quite well.

The system I have now is FreeNAS-11.1-U5

I have already changed vfs.zfs.arc_average_blocksize to 65536, but The highest throughput I've seen was about 190MB/s.
I will try with even higher today and report back.

woyteck · Jul 11, 2018

So, I have been doing further tests and came to a disturbing conclusion.

As it went:

I have tested the vfs.zfs.arc_average_blocksize at 8k, 64, 256k and 512k (with reboots), with no noticeable changes in behavior.
I have noticed that although I managed to write a lot to the cache - (even 900MB/s, using large vfs.zfs.l2arc_write_boost and vfs.zfs.l2arc_write_max), I was always able to get ~185MB/s on average out of it (rarely above 190MB/s)

It was doing about 1.5k read IOPS, so thought that maybe there is a limit to IOPS somewhere in the code dealing with cache.
To check this theory, I changed the recordsize on my test filesystem to 1M. Then reads which used L2ARC cache from this filesystem still are limited at 185MB/s but with only 185 read IOPS.

I then researched the idea that the code dealing with NVME may be broken somewhere and found this thread:
https://lists.freebsd.org/pipermail/svn-src-head/2016-March/083406.html
(Do read the entire thread - very interesting, however it seems that this change is not there (so there is no nvme_max_optimal_sectorsize) :(

However the most important part of the read was this, I think:

Code:

-/*
- * Intel NVMe controllers have a slow path for I/Os that span a 128KB
- * stripe boundary but ZFS limits ashift, which is derived from
- * d_stripesize, to 13 (8KB) so we limit the stripesize reported to
- * geom(8) to 4KB by default.
- *
- * This may result in a small number of additional I/Os to require
- * splitting in nvme(4), however the NVMe I/O path is very efficient
- * so these additional I/Os will cause very minimal (if any) difference
- * in performance or CPU utilisation.
- */

However, this pointed me towards checking how actual nvme driver for those Intel cards behaves, so I've done some tests.

 

# nvmecontrol perftest -n 16 -o read -s 512 -t 30 nvme0ns1

Threads: 16 Size:	512  READ Time:  30 IO/s:  289317 MB/s:  141

# nvmecontrol perftest -n 16 -o read -s 4096 -t 30 nvme0ns1

Threads: 16 Size:   4096  READ Time:  30 IO/s:  297781 MB/s: 1163

# nvmecontrol perftest -n 1 -o read -s 4096 -t 30 nvme0ns1

Threads:  1 Size:   4096  READ Time:  30 IO/s:   26887 MB/s:  105

# nvmecontrol perftest -n 1 -o read -s 65536 -t 30 nvme0ns1

Threads:  1 Size:  65536  READ Time:  30 IO/s:	3755 MB/s:  234

# nvmecontrol perftest -n 16 -o read -s 65536 -t 30 nvme0ns1

Threads: 16 Size:  65536  READ Time:  30 IO/s:   42393 MB/s: 2649

# nvmecontrol perftest -n 16 -o read -s 65536 -t 10 nvme0ns1

Threads: 16 Size:  65536  READ Time:  10 IO/s:   36372 MB/s: 2273

# nvmecontrol perftest -n 16 -o read -s 4096 -t 10 nvme0ns1

Threads: 16 Size:   4096  READ Time:  10 IO/s:  306383 MB/s: 1196

# nvmecontrol perftest -n 8 -o read -s 4096 -t 10 nvme0ns1

Threads:  8 Size:   4096  READ Time:  10 IO/s:  327323 MB/s: 1278

# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1

Threads:  1 Size:   8192  READ Time:  10 IO/s:   23705 MB/s:  185

Then it struck me - 185MB/s - that's what I keep seeing.

So I did some more testing:

 

# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1

Threads:  1 Size:   8192  READ Time:  10 IO/s:   22810 MB/s:  178

# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1

Threads:  2 Size:   8192  READ Time:  10 IO/s:  133169 MB/s: 1040

Wait a minute! single thread gets 178MB/s but two threads get six time as much throughput?!?!?!
So, I'm checking further:

 

# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1

Threads:  1 Size:   8192  READ Time:  10 IO/s:   22761 MB/s:  177

# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1

Threads:  2 Size:   8192  READ Time:  10 IO/s:  129932 MB/s: 1015

# nvmecontrol perftest -n 4 -o read -s 8192 -t 10 nvme0ns1

Threads:  4 Size:   8192  READ Time:  10 IO/s:  235470 MB/s: 1839

# nvmecontrol perftest -n 8 -o read -s 8192 -t 10 nvme0ns1

Threads:  8 Size:   8192  READ Time:  10 IO/s:  232484 MB/s: 1816

Looks like I hit IOPS limit on the device. Time to check 64k reads:

 

# nvmecontrol perftest -n 1 -o read -s 65536 -t 10 nvme0ns1

Threads:  1 Size:  65536  READ Time:  10 IO/s:	3664 MB/s:  229

# nvmecontrol perftest -n 2 -o read -s 65536 -t 10 nvme0ns1

Threads:  2 Size:  65536  READ Time:  10 IO/s:   35995 MB/s: 2249

# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1

Threads:  4 Size:  65536  READ Time:  10 IO/s:   48641 MB/s: 3040

# nvmecontrol perftest -n 8 -o read -s 65536 -t 10 nvme0ns1

Threads:  8 Size:  65536  READ Time:  10 IO/s:   47784 MB/s: 2986

Wow, multi-threading to the rescue, really - ten times more throughput while using two threads! on 64k chunks.

I have also checked it for 128k:

 

# nvmecontrol perftest -n 1 -o read -s 131072 -t 10 nvme0ns1

Threads:  1 Size: 131072  READ Time:  10 IO/s:	2276 MB/s:  284

# nvmecontrol perftest -n 2 -o read -s 131072 -t 10 nvme0ns1

Threads:  2 Size: 131072  READ Time:  10 IO/s:   20692 MB/s: 2586

# nvmecontrol perftest -n 4 -o read -s 131072 -t 10 nvme0ns1

Threads:  4 Size: 131072  READ Time:  10 IO/s:   25473 MB/s: 3184

# nvmecontrol perftest -n 8 -o read -s 131072 -t 10 nvme0ns1

Threads:  8 Size: 131072  READ Time:  10 IO/s:   23149 MB/s: 2893

Same behavior observed.
And for 256k, but here something I expected:

 

# nvmecontrol perftest -n 1 -o read -s 262144 -t 1 nvme0ns1

Threads:  1 Size: 262144  READ Time:   1 IO/s:	   1 MB/s:	0

Zero. Can't read it this way.

Now. I've done some dd reads from my Intel SSD, just to see how it behaves via nvd driver.

 

# dd if=/dev/nvd0 of=/dev/null bs=8192 count=10000

10000+0 records in

10000+0 records out

81920000 bytes transferred in 0.171515 secs (477624808 bytes/sec)

# dd if=/dev/nvd0 of=/dev/null bs=64k count=10000

10000+0 records in

10000+0 records out

655360000 bytes transferred in 1.162961 secs (563527174 bytes/sec)

# dd if=/dev/nvd0 of=/dev/null bs=128k count=10000

10000+0 records in

10000+0 records out

1310720000 bytes transferred in 2.283765 secs (573929503 bytes/sec)

# dd if=/dev/nvd0 of=/dev/null bs=1M count=10000

10000+0 records in

10000+0 records out

10485760000 bytes transferred in 18.330676 secs (572033472 bytes/sec)

It, for some reason reaches ceiling of ~600MB/s. Not perfect, but I would be more satisfied with it than with 185MB/s (which I get).

I have done one more test, based on presumption that if my recordsize is 128k, there is a chance that actual reads from L2ARC are actually chopped, into smaller chunks as this may have some overhead for metadata.

So, I have changed recordsize to 64k, and then ran my test which filled up L2ARC, and continued with my read test.
At moments I was seeing the throughput to reach 287MB/s from cache

 

cache									   -	  -	  -	  -	  -	  -

  gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0   471G  1.36T  4.48K	 14   287M  13.1M

--------------------------------------  -----  -----  -----  -----  -----  -----

This sort of aligns itself with actual 128k sector reads from testing below (for single thread).

The question is: Is L2ARC read process single-threaded?
How can we check? And is it possible to use more than one thread (miminum 2, preferably 4-8) ???

That would have helped SO much in my case here.

Also, what next? Should I raise a bug/feature request with FreeNAS? or FreeBSD maybe?

HoneyBadger · Jul 11, 2018

That's some excellent sleuthing and replication instructions. I think this would probably be a feature request at the OpenZFS level, if it's truly the case that the L2ARC reads are single-threaded.

Since it looks like you're still in the testing phase and this isn't production hardware yet, is it possible to create a pool with ashift=13 (8KB) as I believe that SSD uses 8KB NAND pages?

woyteck · Jul 12, 2018

I was thinking about raising a bug with FreeNAS, but as you mentioned, it's probably more for OpenZFS people.

Unfortunately it's an early stages production already, which caught us out a bit. However I have spare disk slots in this machine and will be able to test the ashift=13 at some point.

woyteck · Jul 12, 2018

I have raise it with FreeNAS for now, but I think I'll go to the FreeBSD bug report as I am unable to find any OpenZFS related bug reporting place - Maybe someone knows it?

Also, just to strengthen the case for this:

I also have another server with Intel SSD L2ARC cache - INTEL SSD DC P4500
(Supermicro chassis, 2x E5 CPUs, 16x 10TB SAS disks, 128GB ram)

And just out of curiosity tried the nvmecontrol tests.
Here are the results:

 # nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1

Threads:  1 Size:   8192  READ Time:  10 IO/s:   28842 MB/s:  225

# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1

Threads:  2 Size:   8192  READ Time:  10 IO/s:  161266 MB/s: 1259

# nvmecontrol perftest -n 4 -o read -s 8192 -t 10 nvme0ns1

Threads:  4 Size:   8192  READ Time:  10 IO/s:  323923 MB/s: 2530

Six times speed-up on adding second thread...

 

# nvmecontrol perftest -n 1 -o read -s 65536 -t 10 nvme0ns1

Threads:  1 Size:  65536  READ Time:  10 IO/s:	1564 MB/s:   97

# nvmecontrol perftest -n 2 -o read -s 65536 -t 10 nvme0ns1

Threads:  2 Size:  65536  READ Time:  10 IO/s:   40204 MB/s: 2512

# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1

Threads:  4 Size:  65536  READ Time:  10 IO/s:   50010 MB/s: 3125

25 times increase in speed!

I do not have access to any other NVMe hardware, but maybe someone is able to run similar tests?[/CMD][/CMD][/CMD][/CMD]

Important Announcement for the TrueNAS Community.

Slow reads from PCIe L2ARC

woyteck

Dabbler

HoneyBadger

actually does care

woyteck

Dabbler

woyteck

Dabbler

HoneyBadger

actually does care

woyteck

Dabbler

woyteck

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Slow reads from PCIe L2ARC

woyteck

Dabbler

HoneyBadger

actually does care

woyteck

Dabbler

woyteck

Dabbler

HoneyBadger

actually does care

woyteck

Dabbler

woyteck

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Slow reads from PCIe L2ARC"

Similar threads