Slow sequential reads on mirrored vdevs

aheadley · Mar 22, 2019

I have a 6x2x3TB mirrored zpool (on FreeNAS 11.2U2) that can't seem to get higher than ~600MiB/s on sequential reads. The individual disks are capable of at least 150MiB/s, and I'm trying to reach 1GiB/s sequential reads which this setup should be more than capable of. During a scrub ZFS will read from all the disks at near max speed, just not when doing normal sequential reads.

The pool is on a r710 (2xE5620@2.40GHz, 60GB RAM) with a LSI 9200-8e connected to a Lenovo SA120 with SATA drives, in case that matters. ashift is 12 for all vdevs, and I've tested with 8K, 128K, and 1M recordsizes with 128K seeming to do the best but without much of a difference.

Does anyone have any idea what could be limiting me here? It feels suspiciously like there is some global tunable limit that I am running into but I have no idea what it could be.

example raw disk read (~180MB/s):

Code:

$ sudo dd if=/dev/da2p2 of=/dev/null bs=128k count=80k
81920+0 records in
81920+0 records out
10737418240 bytes transferred in 58.288367 secs (184212028 bytes/sec)

example zvol read (~510MB/s):

Code:

$ sudo dd if=/dev/zvol/tank0/perf-test/test-zvol of=/dev/null bs=128k count=80k
81920+0 records in
81920+0 records out
10737418240 bytes transferred in 20.745362 secs (517581627 bytes/sec)

example file read (~600MB/s):

Code:

$ sudo dd if=/mnt/tank0/perf-test/test.file of=/dev/null bs=128k count=80k
81920+0 records in
81920+0 records out
10737418240 bytes transferred in 17.677490 secs (607406268 bytes/sec)

zpool layout:

Code:

$ zpool list -v tank0
NAME                                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank0                                   16.3T  2.56T  13.8T        -         -     0%    15%  1.00x  ONLINE  /mnt
  mirror                                2.72T   445G  2.28T        -         -     0%    15%
    gptid/e4e7e6b6-4a53-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
    gptid/05e02b8c-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
  mirror                                2.72T   446G  2.28T        -         -     0%    16%
    gptid/0b979830-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
    gptid/0fb474a8-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
  mirror                                2.72T   448G  2.28T        -         -     0%    16%
    gptid/13c3954d-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
    gptid/17febc9f-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
  mirror                                2.72T   451G  2.28T        -         -     0%    16%
    gptid/1c04408e-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
    gptid/20648952-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
  mirror                                2.72T   407G  2.32T        -         -     0%    14%
    gptid/249ad9d2-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
    gptid/2fdfb369-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
  mirror                                2.72T   420G  2.31T        -         -     0%    15%
    gptid/3fd5c7be-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
    gptid/477aed6d-4a54-11e9-83d1-842b2b066720      -      -      -        -         -      -      -
log                                         -      -      -         -      -      -
  mirror                                9.50G   180K  9.50G        -         -     0%     0%
    gptid/68b6b008-fbd1-11e6-aa96-782bcb779bf8      -      -      -        -         -      -      -
    gptid/68605033-fbd1-11e6-aa96-782bcb779bf8      -      -      -        -         -      -      -

zpool iostat during reads:

Code:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank0       2.56T  13.8T  33.4K      0   531M      0

iostat during reads:

Code:

device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b
da2         2028       0  59182.1      0.0     0     0     0     0    0  29
da3         1903       0  61723.3      0.0     0     0     0     0    0  28
da4         1636       0  44180.6      0.0     0     0     0     0    0  25
da5         1726       0  44053.0      0.0     0     0     0     0    0  15
da6         1913       0  57406.3      0.0     0     0     0     0    2  17
da7         1542       0  44386.6      0.0     0     0     0     0    2  18
da8         1869       0  49341.3      0.0     0     0     0     0    0  21
da9         1766       0  53491.6      0.0     0     0     0     0    0  32
da10        1905       0  60173.1      0.0     0     0     0     0    0  28
da11        1064       0  45770.0      0.0     2     0     0     2    0  60
da12        1223       0  46780.6      0.0     0     0     0     0    3  33
da13        1716       0  51019.1      0.0     0     0     0     0    0  26

iostat during raw simultaneous dd from all drives:

Code:

device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b 
da2         1125       0 144088.9      0.0     0     0     0     0    1  86
da3         1125       0 144088.9      0.0     0     0     0     0    1  89
da4         1500       0 192118.5      0.0     0     0     0     0    1 105
da5         1313       0 168103.7      0.0     0     0     0     0    1 100
da6         1313       0 168103.7      0.0     0     0     0     0    1  99
da7         1313       0 168103.7      0.0     0     0     0     0    1 100
da8         1313       0 168103.7      0.0     0     0     0     0    1  91
da9         1313       0 168103.7      0.0     0     0     0     0    1  95
da10        1500       0 192118.5      0.0     0     0     0     0    1 109
da11         938       0 120074.1      0.0     1     0     0     1    0  99
da12        1125       0 144088.9      0.0     0     0     0     0    1 105
da13        1125       0 144088.9      0.0     0     0     0     0    1  87

CPU utilization/load during tests:

Code:

Avg:  0.5% sy:  8.3% ni:  0.0% hi:  0.2% si:  0.0% wa:  0.0%
Load average: 2.36 1.58 1.13

aheadley · Mar 26, 2019

Bump, would really love to get this solved. Increasing vfs.zfs.zfetch.max_distance to 32MB has increased the top speed a bit to ~700MB/s but that's still pretty far below what this should be capable of.

MikeyG · Mar 26, 2019

I went through this as well when I had my 8X8TB disks in mirrors. Each one got over 200MBps, so figured that I should easily hit 1GB. In reality I hit the same limit you did. Somewhere between 500-700MBps.

I think what I discovered (and hopefully someone wiser than me can correct me if I'm wrong) is that for ZFS, when they say your total streaming bandwidth is MBps*number of disks what they really mean is that is the total potential. So for me, 8 disks at 200MBps gives me 1600MBps. The problem is that for single threaded applications, or 1 application requesting access to a large file, the requests bounce from one disk to another in a mirror vdev. ZFS can't break the request into it's constituent parts and read part of a stripe from one disk in a vdev and part of a stripe from another (it can only do this across vdevs). You can see this in action if you run gstat -p and watch the disk activity. Only half the disks are actually used during a large read operation at any given time.

However, when accessing two large files at the same time, all disks become active, and I was able to get about 500MBps from each transfer, and just about saturate my 10Gb connection via smb.

So for a single transfer though, my theoretical limit was actually cut in half, to about 800MBps. In reality, with some level of fragmentation, and some drive use so that more inner sectors on the drive were in use, plus occasional access of the disks by other processes (and other overhead I'm not smart enough to name), that ate up some of my theoretical bandwidth and I was back down to about 500-700MBps in reality for single process transfers.

Since I have an SSD pool for VMs and high IO applications, I switched the 8TB drives to a RAIDZ2 as they are mostly used to archive large files. My transfer speeds went up a bit, into the 700-900MBps range for both read and write.

You have more disks than me, but they appear to be a bit slower. While my single drives got about 240MBps in DD testing (and like 170MBs in reality), yours get about 184MBps. It's probably safe to say that in reality you are going to max out at 100-150MBps for each. So with 6 vdevs, that's between 600MBps-900MBps. Seems like that's what you are experiencing.

MikeyG · Mar 26, 2019

Also, I did discover the SSD's in mirrors do not seem to have this problem. Large read transfers from my 4 SSDs in mirrors maxes them out and I get about 1.8GBps. I'm not sure if this is because of the significantly reduced latency of SSDs. Perhaps the lack of delay effectively allows them to use their full bandwidth when round-robining because each request can be completed so quickly that the process requesting data never has to wait for a relevant amount of time for a disk to return data before issuing another request. Would love it if someone else could shed some light.

Chris Moore · Mar 26, 2019

aheadley said:
I have a 6x2x3TB mirrored zpool

I have tried this with sixteen disks (eight mirrors) and still was not able to get much faster.

aheadley said:
The pool is on a r710

Someone suggested that the problem I was having might be a PCI-E bottleneck. Can you confirm is your controller or the slot on the board PCI-E 2.0?

aheadley · Mar 26, 2019

I think what I discovered (and hopefully someone wiser than me can correct me if I'm wrong) is that for ZFS, when they say your total streaming bandwidth is MBps*number of disks what they really mean is that is the total potential. So for me, 8 disks at 200MBps gives me 1600MBps. The problem is that for single threaded applications, or 1 application requesting access to a large file, the requests bounce from one disk to another in a mirror vdev. ZFS can't break the request into it's constituent parts and read part of a stripe from one disk in a vdev and part of a stripe from another (it can only do this across vdevs). You can see this in action if you run gstat -p and watch the disk activity. Only half the disks are actually used during a large read operation at any given time

I believe that used to be true in older ZFS versions, but it is supposed to be smarter about distributing reads across mirrors now: http://open-zfs.org/wiki/Features#Improve_N-way_mirror_read_performance

However, when accessing two large files at the same time, all disks become active, and I was able to get about 500MBps from each transfer, and just about saturate my 10Gb connection via smb

I do not see this behavior, when reading from two different test files on different datasets (in the same pool) the read bandwidth seems to be split between them rather than giving me more bandwidth.

So for a single transfer though, my theoretical limit was actually cut in half, to about 800MBps. In reality, with some level of fragmentation, and some drive use so that more inner sectors on the drive were in use, plus occasional access of the disks by other processes (and other overhead I'm not smart enough to name), that ate up some of my theoretical bandwidth and I was back down to about 500-700MBps in reality for single process transfers

Unfortunately, I'm not even getting that. With 6 vdevs and a conservative 150MB/s per disk I should be seeing at least 900MB/s for a straight no-frills sequential read on these test datasets with no fragmentation, and it's falling short by around 200MB/s.

aheadley · Mar 26, 2019

Also, I did discover the SSD's in mirrors do not seem to have this problem. Large read transfers from my 4 SSDs in mirrors maxes them out and I get about 1.8GBps

I just tested with a 3x2x120GB SSD pool, and was able to get ~1GB/s, but these are kind of crappy SSDs and I believe are limited to SATA2 because of the disk shelf they are in (MD1220) so I'm reluctant to draw any conclusions based on that.

My end goal is to eventually create an 3x2x1TB SSD pool and have a 900p as a SLOG/L2ARC but I don't want to move forward on that until I figure out why this simple test case is underperforming.

aheadley · Mar 26, 2019

Someone suggested that the problem I was having might be a PCI-E bottleneck. Can you confirm is your controller or the slot on the board PCI-E 2.0?

The HBA is definitely PCIe 2.0, and dmidecode reports that the slots are as well, though I'm not sure of a way to check what speeds they are actually linked at. The HBA is an 8x card and I believe is plugged into a 8x slot, though it might be 4x sized as 8x (can't check till I get home tonight). Though even if it was a 4x slot, I would still expect it not to cause problems as that should give ~2GB/s of bandwidth (modulo SAS/SATA overhead).

I have tried this with sixteen disks (eight mirrors) and still was not able to get much faster.

That is not very encouraging :(

MikeyG · Mar 26, 2019

The results here: https://calomel.org/zfs_raid_speed_capacity.html would seem to indicate that performance does not scale much by adding disks, or at least not linearly with ZFS. For them going from 6 mirrors to 12 actually decreased performance and they stayed at 900MBps. Seems there's some built in limit or overhead with spinning disks.

aheadley · Mar 26, 2019

Seems there's some built in limit or overhead with spinning disks

Yes, though that doesn't really make any sense.

MikeyG · Mar 26, 2019

aheadley said:
Yes, though that doesn't really make any sense.

I agree. I can't seem to figure out what the performance limitations really are for ZFS. The consensus seems to be that if you need high performance, ZFS isn't necessarily the way to go as it's more built for data safety. Or get SSDs.

TimoJ · Mar 29, 2019

My system also has this problem. Individual disks seem to read only 30-40MB, still during a resilver all were at 100-150MB. Earlier I had 7x6TB and 7x8TB raidz2 system and I changed to dual 8x8TB raidz2. Read speed changed only a little. Still getting 500-700MB.
Write speeds seem to be much better, 1000MB all the time. Using SMB, Windows machine, M.2 disks.

Important Announcement for the TrueNAS Community.

Slow sequential reads on mirrored vdevs

aheadley

Cadet

aheadley

Cadet

MikeyG

Patron

MikeyG

Patron

Chris Moore

Hall of Famer

aheadley

Cadet

aheadley

Cadet

aheadley

Cadet

MikeyG

Patron

aheadley

Cadet

MikeyG

Patron

TimoJ

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Slow sequential reads on mirrored vdevs

Cadet

Cadet

Patron

Patron

Hall of Famer

Cadet

Cadet

Cadet

Patron

Cadet

Patron

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Slow sequential reads on mirrored vdevs"

Similar threads