Mirror Pool Performance

MikeyG · Mar 23, 2018

I have the following set up with an drive mirror pool:

4X ST8000NM0055 drives
4X ST2000NM0033 drives
All drives configured in a giant pool connected to the SAS controller. I verified they are all registering at 6Gbps. Started with 8X of the 2TB drives and I'm in the process of replacing them one at a time.

Supermicro X11SSH-CTF
i3-7320 CPU
2X Crucial CR16G4WFD824 16GB ECC
Seasonic Focus 650W 80 Plus Gold
FreeNAS 11.1-U3

The pool looks like this (after writing a 100GB test file):

Code:

NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
MirrorPool							  18.1T  8.06T  10.1T		 -	 5%	44%  1.00x  ONLINE  /mnt
  mirror								7.25T  2.80T  4.45T		 -	 2%	38%
	gptid/d1dc71a4-18b2-11e8-99f6-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
	gptid/d2b64d10-18b2-11e8-99f6-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
  mirror								7.27T  1.83T  5.43T		 -	 4%	25%
	gptid/30148048-2a48-11e8-960e-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
	gptid/44832cbb-2ec0-11e8-82c3-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
  mirror								1.81T  1.73T  82.1G		 -	23%	95%
	gptid/aaf52166-1dd3-11e8-b73e-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
	gptid/86c7d519-1e7f-11e8-b73e-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
  mirror								1.81T  1.69T   121G		 -	 6%	93%
	gptid/6ccdcc92-1d8f-11e8-92a7-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -
	gptid/6d78058b-1d8f-11e8-92a7-ac1f6b1a8ee0	  -	  -	  -		 -	  -	  -

Speed tests:

Code:

root@nas:/mnt/MirrorPool/Test # dd if=/dev/zero of=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 181.127446 secs (578916129 bytes/sec)
root@nas:/mnt/MirrorPool/Test # dd of=/dev/null if=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 182.765586 secs (573727265 bytes/sec)

No other activity taking place during the time of testing.

570MBps read and write is certainly not bad, and I know that 2 of my mirror sets are pretty full and fragmented. But, I'm wondering if I shouldn't be able to get higher throughput? The 2TB drives can do between 140-170MBps and the 8TB drives can do about 200-230MBps. Even at 150MBps each for read, I would be over 1GBps. When doing the test, especially the read section, CPU usage stays below 20%, and drive utilization hovers around 70%. When resilvering the 8TB disks, I get a solid 220MBps read and write through 2GB of data, so I know the drives are capable of that much.

I do have 10gbps connections set up and store pretty large files, so while this is slightly academic, the speeds are relevant to my usage.

For comparison, on the SATA connectors, I have a 4 drive SSD mirror pool:

Code:

root@nas:~ # cd /mnt/SSDPool/test/
root@nas:/mnt/SSDPool/test # dd if=/dev/zero of=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 109.355935 secs (958865198 bytes/sec)
root@nas:/mnt/SSDPool/test # dd of=/dev/null if=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 55.585854 secs (1886408017 bytes/sec)

Much better and to me indicates it's not CPU bound, and I'm assuming the SAS connectors are just as fast.

With the spinning drives, are the slower 2TB drives slowing me down because the 8TB drives are waiting on those stripes to be read/written? I previously had those drives in a synology getting 900MBps off of them in a RAID5 config, which is where my expectations are coming from. What affects FreeNAS in terms of scaling performance when more drives are added? Am I perhaps configuring something wrong or not tuning something that I should?

Thanks for any insight!

kdragon75 · Mar 24, 2018

So you are replacing the 2TB drives with 8TB one at a time and resilvering? Keep in mind that a vdev and to some extent a pool will only be as fast as the slowest drive.

mgittelman said:
I previously had those drives in a synology getting 900MBps off of them in a RAID5

Keep in mind (depending on your Synology setup) a RAID 5 will be faster than the same RAIDz1 as we are doing a lot of extra work to keep that data safe.

One thing people forget (especially in storage) is that even with an underutilized CPU you may have a comput bottleneck of sorts. ZFS has a number of steps to write to disk and each step add latency. If each step needs to be completed (or even just some of them) this will manifest as lower throughput. The same thing applies to networking and is the reason bridging interfaces is so slow. Its not just the CPU but he number of steps that adds overhead.

On a side note, I have heard of some people gaining performance by disabling the hyperthreading on their CPUs this also decreased performance for some.

kdragon75 · Mar 24, 2018

Some basic tuning tips disable atime, enable lz4 compression (on by default), and make sure all devices in a vdev (pref. the whole pool) have the same block size. Oh and use all SSDs... :D

MikeyG · Mar 24, 2018

Thanks kdragon!

Yes, I am replacing the 2TB drives and resilvering. I'm aware that a vdev will be constrained by the slowest drive, but wasn't sure how that effect spreads out to the pool. I guess part of my curiosity is that even if they were all functioning only as fast as the 2TB drives allowed, it's still much slower that I would expect.

I'm also making the assumption here that since I'm not using a RAIDz level, that whatever overhead that creates I would not be experiencing. Would the extra steps you are referring to (and I'm curious what those are) only apply to spinning disks? As you can see, the SSD drives I have in mirror scale exactly as I would expect.

atime was already disabled and lz4 compression is on for the pool, although off on the dataset where I tested to make the results more accurate. I have not altered block size which is set to inherit.

kdragon75 · Mar 24, 2018

I don't know enough about how/where on the disk ZFS stores its block checksums but this could cause some slowdown too. If pending writes are not full sectors then you need to read whats on disk to update checksums before writing. By writing, im not 100% sure if this is as its writing to memory (TXGs) or when writing the TXG to disk... I dont know I'm a little out of my depth here.
I would love to see the CPU usage with a stripe mirror on SSD and HDD to compare CPU/MBps and stripe mirrors with 1, 2, and 3 vdevs to see it it does scale linearly to 100% CPU or if there are diminishing returns (that would be me guess for a number of reason)

MikeyG · Mar 24, 2018

I ended up wiping out the pool, and recreating using only the 8TB drives. Still getting some unexpected results.

Code:

root@nas:/mnt/MirrorPool/test # dd if=/dev/zero of=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 252.159046 secs (415839136 bytes/sec)
root@nas:/mnt/MirrorPool/test # dd of=/dev/null if=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 213.621053 secs (490857988 bytes/sec)

It seems to alternate reads from the disks in the vdevs. One moment it's 100% from one disk, one moment it's 100% from another. When reading from mirrored vdevs, aren't all drives supposed to be read from at once?

Code:

 L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
	0	  2	  2	 12	2.7	  0	  0	0.0	0.5| da0
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da2
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da3
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da4
	0	  1	  1	  4	0.1	  0	  0	0.0	0.0| da5
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da6
   12   1787   1787 228749   11.3	  0	  0	0.0  100.0| da7
	0	  1	  1	  3	0.9	  0	  0	0.0	0.1| da8
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3
   19   1761   1761 225428	8.5	  0	  0	0.0   89.2| da1

For comparison, here is the same disks in a stripe:

Code:

root@nas:/mnt/test # dd if=/dev/zero of=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 119.960849 secs (874098519 bytes/sec)
root@nas:/mnt/test # dd of=/dev/null if=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 142.411224 secs (736301514 bytes/sec)

To me this shows there is no problem with the bandwidth of the controller or the disks. Although I wonder why reading would be slower than writing here.

kdragon75 · Mar 25, 2018

mgittelman said:
It seems to alternate reads from the disks in the vdevs. One moment it's 100% from one disk, one moment it's 100% from another.

Again this is a bit beyond me at this point but dependeding on how/where checksums and metadata are stored and read, it may be reading checksums and data separately. Those are some solid numbers. Consider the following; in a striped mirror of 4 disks you can read or write to two disks at a time. You mentioned the following about your 8TB drives

mgittelman said:
the 8TB drives can do about 200-230MBps.

So that seems inline even with the seemingly odd reporting of disk activity. That said with a queue depth of 12-19 you drives should show a busy time of near 100% and they do. (at least in the snapshot)

mgittelman said:
Although I wonder why reading would be slower than writing here.

This still looks solid. (4x-1 for parity)230=690 plus factor in the ARC, the fact that the final TXG (or two) may not be committed to disk, and if you left LZ4 on, that's about the max you would expect give or take a few percent. This is actually pretty darn good considering the parity penalties. ZFS seems to do a good job of mitigation on that point though this may not be the case with small IO.

mgittelman said:
To me this shows there is no problem with the bandwidth of the controller or the disks

Agreed. The mixed vdevs seem to be the issue here. At this point I can't speculate further.

kdragon75 · Mar 25, 2018

Just want to add this here as I had a misunderstanding of how parity works in ZFS. This may be useful in understanding the performance characteristics when using RAID-Z1/2

RAID-Z is a data and parity organization, similar to
RAID-5, but using the dynamic segment size. In fact, each
logical block of the file system is a RAID-Z segment, regardless of
the block size. This means that each RAID-Z entry is a
full segment record . Add to this copying when writing in
ZFS transactional semantics, and you completely get rid of the "
vulnerability window " RAID. In addition, RAID-Z is faster than conventional RAID, as
it does not need to read data, change them, and then write again.

Jeff Bonwick's Blog - RAID-Z

MikeyG · Mar 25, 2018

I was under the impression that with 4 disks in mirrored vdevs, you would get the write bandwidth of 2 and the read bandwidth of 4 disks. This does appear to happen with my SSD drive test.

Also, you mention the overhead of parity calculation with RAID-Z, but my understanding is that with mirrors, there is no parity. The data is replicated on the drives instead of parity being created for redundancy, which is what should allow for twice the theoretical read bandwidth as write bandwidth - because the data can be read from two drives at once.

Also, again all tests were done with compression turned off to avoid distortion of results (otherwise I get like 4GBps writing 100GB of zeroes).

kdragon75 · Mar 26, 2018

mgittelman said:
and the read bandwidth of 4 disks.

I must have been low on coffee. That makes sense to an extent. Keep in mind with still have the extra (small) IO of reading checksums and checking the blocks as re read them. Therefore it is still not as efficient as traditional RAID 10 but still far more robust.
I don't see a way to monitor metadata/checksum IO performance to see if that is accounting for the reduced performance of the spinners but if you are still seeing high (>4) queue depths and near 100% busy time, that is about all you are going to get.

The SSDs (as we know) are great at small random IO, the kind of IO reading checksums and metadata produce so there would be little impact in doing that mixed with the actual data IO.
HDDs are terrible at small random IO and when you mix that with your actual data blocks it may cause noticeable performance reduction.

Someone let me know if I'm wrong in this. I'm not an expert. This is just my understanding of ZFS and storage.

MikeyG · Mar 26, 2018

After looking more closely at the pages in this thread: https://forums.freenas.org/index.php?threads/notes-on-performance-benchmarks-and-cache.981/ is seems that many of the tests have results lower than what I would expect for the hardware. For example even 8 raptor drives or pools with 24+ disks seem to max out between 500-700MBps. Is there something about ZFS that I'm not understanding that makes it inherently slow? kdragon, not expecting your response on this as I appreciate you answering my questions to the best of your ability. But if anyone is willing to share resources, articles, or explanations. All my research has turned up very little as to why this would be the case. Tuning options that I could find seem to be specific to 10gbe networking, and not disk performance itself.

MikeyG · Mar 26, 2018

Looks like the lack of speed scaling via adding more disks is normal: https://calomel.org/zfs_raid_speed_capacity.html You can see (at least with mirrors) that initially the read speeds nearly double when going from 1 drive to a 2 disk mirror. Going from a 2 disk mirror to 2X 2 disk mirrors (4 disks) increases reads from 488MBps to 644MBps - far less than double.

On this page: https://constantin.glez.de/2010/06/04/a-closer-look-zfs-vdevs-and-performance/ mirror vdev access is described as round-robin:

"When reading from mirrored vdevs, ZFS will read blocks off the mirror's individual disks in a round-robin fashion, thereby increasing both IOPS and bandwidth performance: You'll get the combined aggregate IOPS and bandwidth performance of all disks.

But from the point of view of a single application that issues a single write or read IO, then waits until it is complete, the time it will take will be the same for a mirrored vdev as for a single disk: Eventually it has to travel down one disk's path, there's no shortcut for that."

Also from here: https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/AQQOxHC9Bj0

"When you have fewer disks, there is a greater chance of operations being linear on each disk even after round-robin rotation. Say you have 2 disks and want to read a large file. You'll read block 1 from disk 1, block 2 from disk 2, block 3 from disk 1, etc. But blocks 1 and 3 have a pretty good chance of having been written in adjecent locations on disk (unless the disk was getting very full). So block 3 will be getting prefetched by disk following the read of block 1, while block 2 was being read.

As you add disks, you are reducing the number of sequential operations on each disk, so you are getting closer to the 15MB/s/disk figure. The IOPS capacity will go up linearly, but the throughput on sequential transfers will not."

I had thought that somehow when reading from a mirrored vdev, ZFS was able to reading different parts from a single file from different disks in parralel. Apparently, as I've seen, that is not the case and now I'm wondering if that explains the lack of scalable performance, at least with mirrors for reading single large files. Not sure how SSDs aren't displaying this behavior. Maybe SSDs are fast enough to complete reads from one drive, return a block (or part of one), and move onto the next quickly enough to make it look like they are reading from both at the same time? I can see how this would increase overall speed for multiple simultaneous read requests though for the spinning drives.

Hopefully I've kind of figured this out and it's helpful to someone else with the same questions, although if anyone can correct or validate my findings that would be great.

kdragon75 · Mar 27, 2018

kdragon75 said:
Jeff Bonwick's Blog - RAID-Z

looking at how parity is distributed, this RR and non-linear sequential block read scaling makes sense. Good information you found!

mgittelman said:
I had thought that somehow when reading from a mirrored vdev, ZFS was able to reading different parts from a single file from different disks in parralel.

This part is largely up to the application and how reads are implemented. Basically all operations in ZFS are on the block level and it doesnt care what "file" you are reading, just the blocks requested by the application.

Perhaps there is some zfs prefetch tuning that can be done to mitigate this?
At any rate I'm only writing this for my benefit.

Juan Manuel Palacios · Apr 8, 2018

@mgittelman Hey! Here are my numbers for a 2-way mirror pool with a single 4TB vdev:

https://forums.freenas.org/index.ph...e-benchmarks-and-cache.981/page-7#post-450419
https://forums.freenas.org/index.ph...e-benchmarks-and-cache.981/page-7#post-450502

In summary, I'm getting 144.15 MB/s write & 183.57 MB/s read speeds directly on my test filesystem, and 45.96 MB/s write & 66.02 MB/s read over CIFS. Not having followed very closely the details of all the tests you performed, do you think the large difference with respect to your initial 570 MB/s read & write numbers is because I have one single vdev, and you have four? I'd have thought that, but then you referenced some articles that explain the lack of speed scaling with a growing number of vdevs, and that's where I got a little lost...

MikeyG · Apr 8, 2018

@Juan Manuel Palacios to me those read numbers still don't quite make sense. If that manufacturer says 200MBps, and write is 144MBps, that's not bad. But then why is the read only 183? If it's supposed to read from both disks in a mirror vdev, I would think it would be at least 300MBps. What I posted about about the access being round robin is the only thing I can think of, and you should in fact be able to see this while doing a disk test by running gstat -p.

Important Announcement for the TrueNAS Community.

Mirror Pool Performance

MikeyG

Patron

kdragon75

Wizard

kdragon75

Wizard

MikeyG

Patron

kdragon75

Wizard

MikeyG

Patron

kdragon75

Wizard

kdragon75

Wizard

MikeyG

Patron

kdragon75

Wizard

MikeyG

Patron

MikeyG

Patron

kdragon75

Wizard

Juan Manuel Palacios

Contributor

MikeyG

Patron

Similar threads

Important Announcement for the TrueNAS Community.

Mirror Pool Performance

Patron

Wizard

Wizard

Patron

Wizard

Patron

Wizard

Wizard

Patron

Wizard

Patron

Patron

Wizard

Contributor

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Mirror Pool Performance"

Similar threads