Creating a Set of Benchmarks

bollar · Nov 23, 2012

I've decided to redo my production server to take advantage of 4K blocks and RAIDz3. Since I'm taking it offline and allowing failover to the backup server, this seems like a good time to try a series of performance tests that others might find helpful as they are building and configuring their systems.

The questions I'm trying to answer are raw performance related:

512/4K sectors
mirror/striped mirror/parity
RAIDz1/z2/z3
Number of drives in an a Vdev (The n^2-parity question)

I really don't want to try to get into protocol related performance -- just how the ZFS file system performs on a fairly high-end system.

What test methodology would you suggest? I have seen several in the forum and I'd prefer to use one that makes the tests useful to others who are trying to compare. Some options I've found:

IOZONE as suggested in ZFS and 4K. I'm willing to do a couple of these, but since they take so long, I can realistically only it for the potential best and worst case scenarios.

A 100GB dd transfer as suggested here: Slow performance, 36MB/s, E350 w/ 8GB RAM

A variation on dd that adds an extra 'sync' in an effort to improve accuracy: Weird raidz1 and raidz2 performance with 4 drives - any explanation?

These dd tests are quite fast for me to execute, so I'm inclined to use them. However, if you have any suggestions that would improve the usefulness of these tests, please feel free to share.

cyberjock · Nov 23, 2012

While a dd does remove the protocol related performance, for a system with 2 or 4 cores, you are talking about a performance penalty from a dd test to real world networking of approximately 50% and 33% respectively. Your CPU plays a HUGE part in the dd performance. More than any other component as long as you don't have another bottleneck.

When I had a 8 drive RAID6 zpool that was empty DD tests were 5-10% faster when using 4k sector zpool on drives that had 512byte sectors.

To be completely honest, I don't even know where to point you. There's so many possible limitations and so many factors that affect overall performance.

I think a dd is about the best you're going to get for raw ZFS performance. For instance, my main server has something like 450MB/sec with a dd test with 2 vdevs of 8 drives in a RAIDZ2(Note: I didn't follow the n^2 parity thumbrule because we already had these drives). Since I have 2x1Gb Intel NICs, even maxing out both would yield only 266MB/sec. So technically, my CPU is far more powerful than needed for any expected load. But if you throw in trying to do a scrub WHILE using the system, that can affect performance too. I typically don't try to do heavy loading when I know a scrub is going on.

The iozone test is probably more thorough because it tests the entire disk surface(at least.. that's my understanding of it). dd will give you a performance indicator of what you have right this minute. Of course, the zpool is empty so the file is likely to be created at the beginning of the disks where performance is roughly 50% faster than at the end of the disk.

I would say that having a dd test that is 200%+ above your maximum possible network bandwidth is optimal. This allows for delays due to seek times and hopefully still yield the highest network performance.

joeschmuck · Nov 24, 2012

@noobsauce80
I thought that in a ZFS RAIDZ1 or RAIDZ2 that the data was striped across the drives in random locations on the drive vice filling up the front first. I may have been mislead in the past so could you tell me if you know, does the front of the drive really get populated first or is it random placement throughout the entire drive structure? I do agree that normally a drive is filled from outer edge towards the center but I'm only questioning it with respect to ZFS.

Hey bollar,
Feel free to send me that other CPU and RAM you are removing, I'm sure I could put it to good use ;)

-Mark

cyberjock · Nov 24, 2012

Actually, it goes from the outside in. You have a "longer" track on the outside, hence more sectors on the outside, hence faster speed. My understanding is that ZFS is semi-random. It tries to choose smart places, but still loads more towards the beginning than the end. If someone has a deeper understanding or a correction I'm definitely all ears. It seems that trying to get technical info on ZFS is.. hard to come by. I assume that is partly because Oracle somewhat "owns" ZFS now and they are only interesting in things that make them big money.

Edit: Obviously, if you assume worst case that ZFS is completely stupid and loads beginning to end then the dd test would not necessarily be a good indicator of future performance. So my assumption that you want 200%+ would be a good starting point. If ZFS isn't completely stupid(which I think we agree is the case) then dd would be a better indicator of future performance.

Also, I understand that ZFS makes smart decisions where it can. If you copy a 4GB file to ZFS it knows that the file will be 4GB in size and will find a good place for the file. On the other hand if you use the dd command the file "grows" to the final size and ZFS has no clue how big the file will get, so makes a guess that might not be the best place for it.

jgreco · Nov 24, 2012

noobsauce80 said:
My understanding is that ZFS is semi-random. It tries to choose smart places, but still loads more towards the beginning than the end. If someone has a deeper understanding or a correction I'm definitely all ears. It seems that trying to get technical info on ZFS is.. hard to come by.

No, it's not hard to come by, it's just hard to find the relevant bits when you need it.

Ironically I just posted a link to a description of ZFS block allocation strategies in the other busy thread this morning. Since I'm either lazy or busy (or both!) I'll just leave you with that for now.

joeschmuck · Nov 24, 2012

noobsauce80 said:
Actually, it goes from the outside in.

I knew that, my wife must have conked me on the head because I'd never say that of sound mind.

bollar · Nov 24, 2012

Thanks for your thoughts. The machine will have 32G RAM, Xeon 4C E5 2609 2.4 GHz 4 LGA 2011, and Dual LSI 9207-8i HBA. If I really come across a CPU or RAM constraint, I can put in another CPU and double the RAM for a couple of tests. I'm taking them out to run the failover server, since they seemed basically unused in a production environment.

My control test of the current array (2 CPU, 64G RAM, RAiDz2, 2-8 drive Vdevs, one 2TB/512 drives and one 3TB/4K drives) yielded write of 1,219MB/sec & read of 2,850MB/sec.

Code:

[root@freenas] /mnt/bollar/test# dd if=/dev/zero of=tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 83.981580 secs (1278544442 bytes/sec)
[root@freenas] /mnt/bollar/test# dd of=/dev/zero if=tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 35.932821 secs (2988192411 bytes/sec)
[root@freenas] /mnt/bollar/test#

paleoN · Nov 24, 2012

I guess you missed the [thread=981]performance sticky[/thread] in the Storage forum? It's the dd test joeschmuck used in the one thread. The dd tests make a poor benchmark, unless your workload is a single, streaming, large block size writer/reader, but a good baseline to have.

bollar said:
My control test of the current array (2 CPU, 64G RAM, RAiDz2, 2-8 drive Vdevs, one 2TB/512 drives and one 3TB/4K drives) yielded write of 1,219MB/sec & read of 2,850MB/sec.

Do you have compression enabled on the dataset?

Never mind, I just noticed the 64GB of RAM. The file is only 100G. If you are trying to hit the disks, up the file size or reduce the available memory. See [post=43638]jpaetzel's post[/post] to do so without having to physically remove it.

bollar · Nov 24, 2012

paleoN said:
I guess you missed the [thread=981]performance sticky[/thread] in the Storage forum? It's the dd test joeschmuck used in the one thread. The dd tests make a poor benchmark, unless your workload is a single, streaming, large block size writer/reader, but a good baseline to have.

Do you have compression enabled on the dataset?

Never mind, I just noticed the 64GB of RAM. The file is only 100G. If you are trying to hit the disks, up the file size or reduce the available memory. See [post=43638]jpaetzel's post[/post] to do so without having to physically remove it.

Ah, compression is on -- thanks for that tip! I had read the performance sticky, but must have zoned out by the time I got to the last page.

I need to think about what to do with RAM -- probably I'll do a couple of tests to see, but one of the questions I want to answer is if RAM and CPU can mitigate the potential performance issues caused by multiple parity, irregular sector sizes, etc. I am a little surprised that I haven't found similar A/B testing in my searches -- maybe I don't know the right keywords.

The dd without compression yielded write of 320MB/sec & read of 366MB/sec

Code:

[root@freenas] /mnt/bollar/temp# dd if=/dev/zero of=tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 320.022525 secs (335520703 bytes/sec)
[root@freenas] /mnt/bollar/temp# dd of=/dev/zero if=tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 279.758637 secs (383810071 bytes/sec)

cyberjock · Nov 24, 2012

Bollar, do you have compression enabled for the zpool? If you do then that will make your performance seem amazingly high since zeros compress so well.

bollar · Nov 25, 2012

So, it turns out that ZFSGuru has a benchmarking component that generates charts like these:

Basically the set of charts prove out that there's a slight performance drop by using a sub-optimal number of drives with 4K sectors, but the difference is probably not significant enough for most of us to worry about.

See this thread for more info: Testing ZFS RAID-Z performance with 4K sector drives

cyberjock · Nov 25, 2012

And someday you realize that benchmarks don't tell the whole picture daniel-son. But the data is interesting nonetheless.

Stephens · Nov 25, 2012

noobsauce80 said:
And someday you realize that benchmarks don't tell the whole picture daniel-son. But the data is interesting nonetheless.

Daniel-san

I think you're referring to generic benchmarks, as opposed to benchmarks designed to mirror and quantify real-world usage scenarios.

cyberjock · Nov 25, 2012

I'm actually referring to the fact that each persons(or business') loading is unique. If you know you are writing very small quantities of data you want to adjust your stripe size accordingly. Just like if you are streaming HD movies the loading is significantly different from database. Even choosing to use a database benchmark because you intend to use a database is not an indicator of what your performance will be. The best benchmarks are to actually use it in the exact fashion you will be using it in.

With regard to the sub-optimal performance with odd numbers of hard drives when using 4k sector size, if you are writing a very large file the ZFS write cache in conjunction with the hard drive cache amd also including a RAID controller's onboard write cache if enabled may hide the fact that you are writing off-size data because the writes are combined into clean even sized sequential writes. So what may theoretically be a 1GB of mis-sized writes may turn into all properly sized writes except for the last 128KB or less because of caching. In this case the performance penalty may not be noticeable in the big scheme of things.

So I've never subscribed to the fact that performance is killed just because someone is using 4K sectors with an off-aligned vdev. My first FreeNAS server has 2 vdevs of 8 drives in RAIDZ2s and the performance is, in my opinion, pretty amazing. I've had no performance issues at all despite zero tweaking exept to CIFS and almost all if not all of my tweaks are now samba default settings.

Benchmarks are getting more and more irrelevant as technology becomes more and more complex. Despite being a power-user all of my machines are 1st gen i7s because they all have SSDs. Yes the benchmarks say that my computer is slow by today's standards. If I had been using regular hard drives I would have certainly upgraded by now. But adding an SSD results in some incredible increases in system efficiency and responsiveness. I know people with far superior systems to mine that are not using SSDs and they wonder why my 2-3 year old CPU is crushing them.

Benchmarks shouldn't be taken at face value. They give some theoretical data based on the theoretical loading of the benchmark. But they are no indicator at all of how your machine will work for your purpose. I know someone that is constantly buying faster SSDs because he wants his machine to be faster. He has lots of benchmarks to show on how his new SATA3 SSD is 50% faster than his SATA2 SSD, and he has no problem trying to make it very clear that my "aging" Intel G2 SSD is "slow" and I should upgrade. He's completely failing at realizing what SSDs do to make them so wonderful. It's all about those microsecond seek times and less about having a saturated SATA3 channel speed for sequential writes. Buying that new SATA3 drives does not promise a savings in seek time equivalent from going from HDD(3-7ms) to SSD (sub 0.1ms). I've saved alot of money by keeping my G2s because they work great for me. I'm not copying 30GB of data on my desktop all day long, so having super high write rates means nothing to me. All he wants to do is tell me how he can copy and paste 25GB of data from his SSD to his SSD and do it in 45 seconds. My "slow" Intel SSD takes quite a bit longer, but for my loading it doesn't matter one single bit. Benchmarks can be damned.

Unless you have a very deep understanding of what the benchmark does, how it does it, what assumptions its making, and how it correlates or fails to correlate with your setup should be just as important as what the results of the benchmark are.

bollar · Nov 25, 2012

My preference is to tune a system layer-by-layer and in my case, this means understanding the configurations that should be fastest reading and writing to the drives before I try to move to the next step and evaluate how this applies to my iSCSI and AFP workloads.

I thought there was a lot of interesting information on this guy's system presented in the graph.

- RAID1+0 writes are slow. RAID1+0 read performance doesn't significantly beat RAIDz1 until there are 10 RAID1+0 drives
- RAIDz1 read performance is basically identical to RAID0 up to four drives plus parity.
- RAIDz1 write performance penalty isn't IMO particularly significant compared to RAID0.
- You can get some great read performance with RAID0 and RAID1+0 if you have enough drives. Even with the Core2Duo used in this test.
- Striping Vdevs didn't give as much of a write boost as I would have expected.

From this chart and the other charts on that thread, there are probably some conclusions that could be drawn:

- On modest systems, a five drive RAIDz1 (or six drive Raidz2) gives good performance and is worth considering over smaller array sizes.
- There are diminishing returns on RAIDz1 performance after five drives.
- RAID1+0 is not always faster than RAIDz1 and you'll want to understand the implications before you make that decision. I think this is important, because conventional wisdom suggests mirroring/striping is faster than parity.
- 4K blocks are not a big deal for most of us -- at least until drives come out that only support 4K blocks.

cyberjock · Nov 25, 2012

bollar said:
- There are diminishing returns on RAIDz1 performance after five drives.
- RAID1+0 is not always faster than RAIDz1 and you'll want to understand the implications before you make that decision. I think this is important, because conventional wisdom suggests mirroring/striping is faster than parity.
- 4K blocks are not a big deal for most of us -- at least until drives come out that only support 4K blocks.

My problem with the benchmarks he gave with the Core 2 Duo is that he could have been CPU bottlenecked. So he could have hit the CPU limit before he could have seen the performance difference between the two. If he was becoming CPU bottlenecked and in conjunction with the disk caching those could have hid the true performance hit. I'm not sure its fair to make those assumptions with such low powered hardware unless you are actually using hardware that is that old.

Also, I saw a motherboard somewhere that used 2xPCIe lanes for 6 SATA3 ports. I got a laugh because a single SATA3 port could theoretically max out your PCIe lanes. In the manufacturers forum lots of people were complaining because some people bought 2 top of the line SSDs and when they started copying from 1 drive to the other or 1 to the same they had a hard time getting above 250MB/sec. But if you looked in the manual the block diagram made it pretty clear what the bottleneck was. Eventually the manufacturer agreed that it was the 2xPcie lanes. Alot of people cried foul, but how can you be upset when the manufacturer included that information in the manual. I don't know about anyone else but I always look at the block diagram when I'm building a system for a specific purpose to make sure I'm not trying to do anything that will be motherboard bottlenecked.

What I'd really like to do is see some case studies with high powered hardware as well as low powered hardware(perhaps the same CPUs underclocked?) and compare the results.

bollar · Nov 25, 2012

noobsauce80 said:
My problem with the benchmarks he gave with the Core 2 Duo is that he could have been CPU bottlenecked.

Yes. Like I said, there are lots of tests in that thread and most of them are for modest systems that will be constrained by some combination of factors. However, the guy whose test I copied above eventually upgraded his CPU, bus, HBA and RAM and had significant performance gains (of course), eventually getting to the max bandwidth of his HBA:

Still, I'm surprised at his RAID 1+0 results. They're slower than parity until you get out to fourteen drives.
RAIDz1 read performance still matches RAID 0 performance up to five drives and only then starts to drop off.

Aside from the fact that these benchmarks apparently use dd and have the limitations we already know about, I guess the other caution I would add is that those tests are now two years old and are using much older versions of the FreeBSD ZFS.

Also, what I haven't found yet is anyone who'd done the random read/write tests on a high-end system -- only on modest systems. RAID 0 and RAID 1+0 have better performance on random tests than any RAIDz, but the performance was quite bad <100MB/sec on all of them.

bollar · Nov 25, 2012

Very interesting article from a former Sun guy on how ZFS performance works. As a bonus, he answers a reader comment about the N^2-parity question:

... there are at least two things that ZFS does to deal with alignment of blocks to RAID-sizes. First, blocks are aggregated in the ARC buffer before they are written to disk. So in normal operation, your 4k transactions may well end up being grouped with transactions from many other threads, then written together as a continuous stream of blocks across any RAID-Z size. Second, ZFS supports variable stripe sizes. So for certain cases, even if the RAID-Z size is 5+p (as an example), it may decide to still write in a 4+p pattern. This is because in ZFS, each RAID-Z block is translated into its own stripe at variable sizes. So, the number of disks you give to ZFS for a RAID-Z setup is only an upper bound.

So, assuming a reasonable amount of traffic, there's no need to consider block sizes when choosing RAID-Z stripe widths. (And if the traffic is so scarce that only occasionally there are writes and all of these writes are just single 4k blocks, there won't be any significant impacts in performance anyway.)

A Closer Look at ZFS, Vdevs and Performance

cyberjock · Nov 25, 2012

Hey, what that guy said about the ARC "fixing" write performance for zpools that are odd-sized disks is what I thought was going on :P

He also mentions why NFS or iSCSI doesn't necessarily get the same performance as a file copy.

My favorite part of the whole page:

In practice, you'll almost always see better numbers, because of many possible reasons:

You were lucky because the disk head happened to have been near the position it needed to be.
You were lucky and your app uses large IOs that were split up into multiple smaller IOs by the system which could be handled in parallel.
You were lucky and your app uses some portion of asynchronous IO operations so the system could take advantage of caching and other optimizations that rely on async IO.
You were lucky and your app's performance is more dependent on disk bandwidth than latency.
You were lucky and your app has a bottleneck elsewhere.
You're benchmarking your ZFS pool in a way that has nothing to do with real-world performance.

Emphasis is mine. :D

It does make a one thing clear and now I realize it. CoW file systems may never be good long term performers for bootable drives(even local).

bollar · Nov 26, 2012

noobsauce80 said:
Hey, what that guy said about the ARC "fixing" write performance for zpools that are odd-sized disks is what I thought was going on :P

He also mentions why NFS or iSCSI doesn't necessarily get the same performance as a file copy.

Well, now you can say so with authority! :P

In his RAID-Greed article, he has almost convinced me to use eight-2 disk mirrors.

Important Announcement for the TrueNAS Community.

Creating a Set of Benchmarks

Patron

Inactive Account

Old Man

Inactive Account

Resident Grinch

Old Man

Patron

Wizard

Patron

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Patron

Patron

Inactive Account

Patron

Similar threads