Pool performance scaling at 1J QD1

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Hello,

I am still trying to build a FreeNas box which will ultimately serve as datastore for an ESXi cluster. Since earlier attempts did not yield the desired/expected performance this time I am taking it slow and try to establish solid performance at each layer. Unfortunately I hit a snag right at the beginning - pool performance.

Now I realize my use case is special since I am looking to get good write performance at 1 Job QD1 and that's rarely used.
[Actually that use case is the most common one in homelabs I assume, but most home users don't have my expectation level nor the will to throw a ton of hardware at it. Enterprise level is never looking into this/providing numbers that since not the typical business use case]

Now the question I am looking to get answered is based on the "general knowledge" (+ articles like https://www.ixsystems.com/blog/zfs-pool-performance-1/) that a pools (write) performace increases with the number of vdev's in the pool. So if a single vdev is capable of 300 MB/s [streaming writes at lets say 128K] a second vdev [stripe, second mirror pair or second raidz vdev] should increase that to [theoretically] 600 MB/s. Given some overhead the expectation I'd have would be maybe ~500 MB/s.

The question is now - how would one expect this to scale? Being realistic I'd assume to get diminishing returns on each new vdev due to increased overhead, so that at some point it's not worth adding further vdevs - the exact number of vdevs probably would depend on the type of drive being used.

(O/c it will also depend on the other hardware in the box, especially cpu single thread performance, drive attachement options [sata,sas,nvme] etc)

Its not scaling well in my tests

Looking forward to hear feedback, will provide testresults later (after we came up with a common understanding of expectations;))


edit: Changed thread title since we didnt do expectation management at all;)
 
Last edited:

jro

iXsystems
iXsystems
Joined
Jul 16, 2018
Messages
80
Heya! For the sake of those stumbling on this conversation, I wanted to copy over my response from your Jira ticket (found here):

As I mentioned in the article, there are many factors that go into determining how a pool will perform in production; vdev performance and scaling are just one of those factors.

To answer your first question, as far as pure pool performance based on vdev configuration is concerned, sequential writes should scale even with QD1/J1. This is because all writes are aggregated in memory into a transaction group (txg) before being flushed to disk. Default ZFS tuning causes the txg to be flushed every five seconds or when the txg hits a certain size (whichever comes first). If you can manage to fill the txg with QD1/J1 at the same rate as with a deeper queue depth and more concurrent jobs, sequential write performance of the pool should be the same. Filling the txg with QD1/J1 is the challenge... there are a lot of other factors in there that can cause the sort of diminishing returns you're seeing. You'll have to do some research on ZFS tunables to see what might be appropriate in your use case (but tread carefully, messing tunables can hinder performance too).

iXsystems has a dedicated system performance testing team that works very closely with our development team to remove scaling bottlenecks as they're discovered. If you aren't already on the latest version of FreeNAS, I would recommend updating and trying your tests again. v11.3 might bring some performance improvements as well. The answer to your second question is obviously extremely complex and it's exactly what our performance team works to answer.

To address your follow-up, we have seen better scaling in our lab, but we typically run tests on iSCSI rather than with fio because it's more representative of the production workloads our users run. Obviously, we also test NFS and SMB, but iSCSI tends to be the highest performing of the three. On our iSCSI tests, performance with J1 does seem to get "stuck" right around where your workload plateaus. We use external load generator hardware to run multiple I/O threads; in this way, pool performance continues to scale nicely up to ~16-32 threads (aka, J16-32). Beyond that, adding more I/O threads doesn't seem to increase overall performance on current builds.

Sequential tests do tend to exhibit diminishing returns before random I/O tests because they can be thread-bound and CPUs can only handle so many billions of operations per second. Some of those operations can be offloaded to the NIC, but you still run into a wall pretty quickly. This is a phenomenon the industry noticed a while ago and people have started to move towards RDMA via iWARP and now RoCE. I think that's certainly a potential future direction for FreeNAS, but protocol support for RDMA is either very new (NFS and iSCSI/iSER) or basically non-existent (Samba/SMB-Direct). Random I/O will benefit from RDMA as well, but we can get pretty solid random performance already over traditional Ethernet already.

In the last year or so, we've been putting a lot of resources into performance testing and we plan to continue to do so. There are a lot of moving parts in these systems (the sharing protocol, the network stack, and OpenZFS itself) but we've made great progress recently. I expect that progress to continue, if not accelerate, so keep checking for updates and read the release notes! :)
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Cheers - thanks for picking this up :)

I've used local fio testing to establish a baseline of pool capability without the overhead/interference of network or higher level protocols. Of course that will become relevant at a point before I am done, thus its very good to hear that you made progress and look into advanced techniques - very much looking forward to seeing those in action :)

Since my final setup will be providing storage for ESXi streaming writes will not be the workload to optimize for; thus its good to hear that scaling on random I/O might be better. I'll update this thread with some random I/O test results and would very much appreciate it if you can have a look at those numbers then to let me know if they are in the same ballpark as your's :)
 

jro

iXsystems
iXsystems
Joined
Jul 16, 2018
Messages
80
For random I/O, you'll want to make sure you're using mirrors/RAID10 and a fast SLOG device. The 280GB Optane 900p AIC is a great option.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Still on async stripes to get the theoretical maximums :)
Have been playing around with some really nice slogs too; some trouble there as well, but the top pick seems to be working, so no biggie
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
So here is a first random write result ... same disks, stripe, sync disabled, 128K

Edit2: This chart is incorrect - after a thorough wipe of the drives the inconsistency went away, see next post for updated chart
1574634418911.png


Will rerun to see if that is consistent with different blocksizes/disktypes but still not what I'd expect from adding more drives...;)


Edit:
So I have been observing gstat during a set of testruns and I can clearly see that 2 drives are 100% used on async writes. With more drives the utilization per disc becomes lower [85% (4), 66% (6) ...] the question is why? I have not seen WCPU peak much over 50% at any time, fio reported max_user_cpu is 44% with 7% sys cpu at the same run, so in line with top values ...



Some different discs: Micron DC630 3.2 TB -
1574635704427.png


They have seen way less testing than the SS300's so maybe I need to do a full wipe on these to get rid of the variations... will need to find a cli command for that;)
 
Last edited:

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Here is an updated chart with freshly wiped drives (less inconsistency)

1574753116033.png


This one has sync disabled, sync always and sync always with slog - as we see there is basically no scaling at all. This is fio with 100% Random Write.

@jro Not to be difficult but that is definitely not what one would expect on multiple vdevs (let alone hope for). While I do appreciate that this is an uncommon setup I am quite disappointed since "common knowledge" & your article suggested that this should do better.
It might be beneficial for other people to point out that in order to utilize multiple vdevs concurrent utilization will be needed.
 

jro

iXsystems
iXsystems
Joined
Jul 16, 2018
Messages
80
I can bring this up at our next performance testing meeting next week, but without being able to dive into your hardware and the configuration used, it's really hard to say what's causing the bottleneck. Were you watching CPU utilization? Sometimes stuff like this gets core bound.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Well its not as if I dont have that information;)
Please find attached a csv sheet (as rar) with everything fio dumps out for a whole lot of tests (including cpu stats) where the ones above are included
If you want it I also should have a complete log of all tests in that sheet including regular gstat, iostat and ps aux calls- but thats too big to attach here;)

If you need any other info (config dump) I am happy to provide that as well - or run any tests you want, its a test box at this point in time.
 

Attachments

  • report_fio_948.rar
    742.6 KB · Views: 258
Top