Benchmarking ZFS performance on TrueNAS

nemesis1782 · Mar 8, 2021

Hi All,

I would like to make to switch to ZFS and TrueNAS. I'm in the process of gaining knowledge, aqcuiring hardware and asking question. (Thread about that: https://www.truenas.com/community/t...-first-treunas-setup.91526/page-2#post-634858)

Now from what I've seen there are many conflicting stories especially on performance and benchmarking it is a bit of mixed bag because of the way ZFS works.

What I'm thinking is writing something to do a proper benchmarking tool for TrueNAS / FreeBSD to "benchmark" a ZFS pool. First of because of the way ZFS works a benchmark would be time consuming and would put your hardware through it's pases for a long time! So I'm looking to make a "cheap" array to test with. SInce a 6x1TB disk vdev will probably scale in performance and behavior compared to a 6x8TB enterprise disks vdev.

How will I be testing (this approach will probably not work when using dedup and compression):
Note: Sample means sample and log.
- Have a second disk array of fast read storage with static files of differing sizes

Archiving performance:
- Test A1: Archiving large chunks of data, single source
--> Start with a clean zPool containing 1 vDEV
--> Start writing a single stream of Large chunks of data until full and sample the throughput (every 10 seconds or so), include important zPool and vDEV statistics for each sample, still need to see what I'll need to be logging
--> Test read at certain intervals
- Test A2: Archiving small chunks of data, single source
--> Start with a clean zPool containing 1 vDEV
--> Start writing a single stream of small chunks of data until full and sample the throughput (every 10 seconds or so), include important zPool and vDEV statistics for each sample, still need to see what I'll need to be logging
--> Test read at certain intervals
- Test A3: Archiving large chunks of data, multi source
--> Start with a clean zPool containing 1 vDEV
--> Start writing multiple (10x) streams of Large chunks of data simulataniously until full and sample the throughput, total and per stream (every 10 seconds or so), include important zPool and vDEV statistics for each sample, still need to see what I'll need to be logging
--> Test read at certain intervals
- Test A4: Archiving small chunks of data, multi source
--> Start with a clean zPool containing 1 vDEV
--> Start writing multiple (10x) stream of small chunks of data until full and sample the throughput (every 10 seconds or so), include important zPool and vDEV statistics for each sample, still need to see what I'll need to be logging
--> Test read at certain intervals
- Test A5: Archiving varying chunks of data, multi source
--> Start with a clean zPool containing 1 vDEV
--> Start writing multiple (10x) stream of varying sizes of chunks of data until full and sample the throughput (every 10 seconds or so), include important zPool and vDEV statistics for each sample, still need to see what I'll need to be logging
--> Test read at certain intervals

Ever changing data disk. For instance download destination:
I will add this later, since I really need to get to work now.

Block devices:
I will add this later, since I really need to get to work now.

I am a experienced Software engineer with C oriented languages. Maily C++(Linux), C#(Windows and Linux). However do not yet have any experience developing for FreeBSD or TrueNAS.

Any input would be greatly appreciated!

sretalla · Mar 8, 2021

nemesis1782 said:
until full

ZFS can be considered full at 80% of pool capacity. There's no need to test to 100% (which will kill the pool).

nemesis1782 said:
still need to see what I'll need to be logging

zpool iostat -v
zpool list -v (interesting to see data distribution amongst VDEVs,

nemesis1782 · Mar 8, 2021

sretalla said:
ZFS can be considered full at 80% of pool capacity.

Thnx. For the reply that is what everyone keeps telling me. I'll make it a configurable, however I will be verifying that claim.

sretalla said:
There's no need to test to 100% (which will kill the pool).

That still flaber gasts me :P

sretalla · Mar 8, 2021

nemesis1782 said:
Thnx. For the reply that is what everyone keeps telling me. I'll make it a configurable, however I will be verifying that claim.

That still flaber gasts me :P

ZFS is a COW filesystem. It needs to allocate new data into unused blocks and does so in transaction groups that involve contiguous blocks wherever possible. As a pool fills, fewer such contiguous blocks will remain, so operations become less efficient and more difficult to perform. (I find this article a good explanation: https://arstechnica.com/information...01-understanding-zfs-storage-and-performance/)

A pool will not die until it is actually full (100%), but there's nothing stopping you from putting data on a 99.9% full pool to tip it over the edge.

For good IOPS/block storage performance, keeping below 50% full is a recommendation.

If you feel that it's necessary to "verify" any of it, go ahead.

nemesis1782 · Mar 8, 2021

As a start. I'm not trying to annoy you or saying you do not know what you are talking about. I am however noticing a lot missing factual and measurable information on the topic. I do very much understand that my knowledge on the subject is limited, one of the goals is to increase this.

sretalla said:
ZFS is a COW filesystem. It needs to allocate new data into unused blocks and does so in transaction groups that involve contiguous blocks wherever possible. As a pool fills, fewer such contiguous blocks will remain, so operations become less efficient and more difficult to perform. (I find this article a good explanation: https://arstechnica.com/information...01-understanding-zfs-storage-and-performance/)

Well, yes I understand how ZFS works to some degree and it's performance degradation is reliant on many factors. Nice article btw I'll give that a read tonight. However from what I understand at this point 80% is well wrong. It depends on the size of your files, the amount of changes and a lot of other factors. For instance mixing small and large files in heavily a changing data set can actually make that more like 40%. That is the nature of fragmentation. There are however many other factors as well that contribute to the performance of the system.

Can you list them and their exact impact for me?

sretalla said:
A pool will not die until it is actually full (100%), but there's nothing stopping you from putting data on a 99.9% full pool to tip it over the edge.

I'm sorry but that is just sloppy/bad design. You should not be able to kill it with normal usage! I hope that we can at least on that much ;)

sretalla said:
For good IOPS/block storage performance, keeping below 50% full is a recommendation.

Well here you basicly say ok. We have the 80% but.... Now I think there are some more buts hidden in a number of places. So why not find out and know instead of just guessing and throwing the dice.

sretalla said:
If you feel that it's necessary to "verify" any of it, go ahead.

To this point I've just seen guestimates. No hard numbers no clear explanations other than this is how it works. I understand that but what is the ACTUAL impact. From what I see I'm not the only one that has these questions.

Further more just going with the flow and hoping for the best is not the approach I would like to choose for such a sizable investment.

sretalla · Mar 8, 2021

nemesis1782 said:
Can you list them and their exact impact for me?

I'm not going to do that.
People (not limited to this: https://www.oracle.com/technical-re...cture/admin-sto-recommended-zfs-settings.html) have already done it.

nemesis1782 said:
I'm sorry but that is just sloppy/bad design. You should not be able to kill it with normal usage! I hope that we can at least on that much ;)

So you're calling the same fault on every car ever made... not one I'm aware of that can't be driven off a cliff using the accelerator and steering wheel normally.

nemesis1782 said:
Now I think there are some more buts hidden in a number of places. So why not find out and know instead of just guessing and throwing the dice.

They are known and documented in this forum in addition to elsewhere.

nemesis1782 said:
Further more just going with the flow and hoping for the best is not the approach I would like to choose for such a sizable investment.

Fair point, as I said, test/prove it if you want to.

nemesis1782 · Mar 8, 2021

sretalla said:
I'm not going to do that.
People (not limited to this: https://www.oracle.com/technical-re...cture/admin-sto-recommended-zfs-settings.html) have already done it.

Lol. So yeah sorry for the confusion. Of course I do not ask you to regurgitate the information for me. This seems to be only for vsphere related applications though.

sretalla said:
So you're calling the same fault on every car ever made... not one I'm aware of that can't be driven off a cliff using the accelerator and steering wheel normally.

Again no. I've commented on this in the other topic as well. It would be true if every file system would die once at 100% usage, but this is not the case mow is it.

If you want to make a analogy to a car. It'd be filling car with 100% gas and it being bricked. Or maybe emptying the tank of a car and it being bricked.

Another analogy would be reaching maximum speed and the car being bricked.

Your analogy would be more akin to throwing your storage server of a cliff while connected with very long connection cords. Now in that case the thing not surviving is no longer a design flaw.

sretalla said:
They are known and documented in this forum in addition to elsewhere.

To start of a forum is not a source of documentation. Although it can be a way to acquire information.

The number of questions and incorrect assumptions throughout this forum and the internet seem to contradict your statement. If you want I can compile a list for you. Just let me know how large you want the sample size.

As for a example of documented, I would still not call this well documented: https://www.synology.com/en-global/products/performance
Another example of documentation, less pretty but of higher quality (also with a different intention): https://docs.splunk.com/Documentation/Splunk/8.1.2/Capacity/Referencehardware
Another example: https://docs.influxdata.com/influxdb/v1.8/guides/hardware_sizing/

As a closing remark:
Like it or not ZFS and thus TrueNAS seems to have some MAJOR disadvantages. I do recognize that it also has MAJOR advantages. However putting in the sand and ignoring the con's doesn't help any one. Now I am not saying my conclusion are correct. However I still haven't seen any tangible proof that your claims are correct, nor do I say they're incorrect. Except for the claim that using a file system and filling it to the bring causing a collapse of spacetime being normal and expected behavior.

nemesis1782 · Mar 8, 2021

@sretalla: Just wanted to add that I went through a lot of your posts around the forum and they're mostly helpful and do definitely indicate you know what you're talking about at least from a usage standpoint.

sretalla · Mar 8, 2021

So is filling a disk to 100% a user or a system behavior? Is designing a system in a way that will automatically result in a disk becoming full a system error?

nemesis1782 said:
every file system would die once at 100% usage

Many do. Others are helped by protection features of the OS they are running on.

nemesis1782 said:
filling car with 100% gas and it being bricked

If bricked = having gas all down your leg, then sure it is. And the car doesn't stop that either (it's the gas pump... much the same as the user shouldn't fill the disk past the fill limit...).

nemesis1782 said:
To start of a forum is not a source of documentation.

... Unless you count the documentation and resources sections...

nemesis1782 said:
Like it or not ZFS and thus TrueNAS seems to have some MAJOR disadvantages.

I don't think anyone is trying to deny them. You've been invited to test and confirm them for yourself.

nemesis1782 said:
However putting in the sand and ignoring the con's doesn't help any one.

Nobody here is doing that. If anything, we're trying to help people to see them and avoid the consequences.

nemesis1782 said:
Except for the claim that using a file system and filling it to the bring causing a collapse of spacetime being normal and expected behavior.

It's a feature of the filesystem and thus can/should be avoided in designing to employ it.

Perhaps you feel this exchange is adversarial... I'm just responding to your points, no malice here.

nemesis1782 · Mar 8, 2021

sretalla said:
So is filling a disk to 100% a user or a system behavior? Is designing a system in a way that will automatically result in a disk becoming full a system error?

Many do. Others are helped by protection features of the OS they are running on.

If bricked = having gas all down your leg, then sure it is. And the car doesn't stop that either (it's the gas pump... much the same as the user shouldn't fill the disk past the fill limit...).

Ok wait. Think we might have a bit of miss communication here.

What I interpreted is 100% full zPool is dead zPool. Remove the zPool and try again kind of dead. What I think you mean 100% full will mean no more writing and possible side effects due to a system not being able to persist it's data anymore which is well, yeah duh.

So let's say I fill a zPool for 100% will the zPool actually fail? Or will it allow me to remove say 90% of it's data and start using it again? (Fragmentation issues will in that case still be probable of course)

As for what I meant with bricked, see the top definition here: https://www.urbandictionary.com/define.php?term=bricked

sretalla said:
... Unless you count the documentation and resources sections...

Wouldn't really count that as part of the forum. Did not say/mean there is no documentation. I said a forum is not documentation. I also stated that the information I'm looking for is not in the documentation and that for many it is not clearly explained.

sretalla said:
I don't think anyone is trying to deny them. You've been invited to test and confirm them for yourself.

Not really what I meant. Many here hail ZFS as being the pinnacle of file system/storage manager. For some use cases I would agree, but there are many and I say many caveats. These caveats are not well defined and or explained. Many wave them away saying well if you want a Ferrari best not show up with 500 Euro to a show room. But often that is not the point since the need a minivan... No matter how much you spend on a Ferrari it'll never be a good minivan (assuming and hoping Ferrari never made a minivan :P )

sretalla said:
Nobody here is doing that. If anything, we're trying to help people to see them and avoid the consequences.

Well, let's split this one up.
1. Yes you and some others are definitely trying to be helpful, as with any forum not everyone is of course.
2. I do notice a fair of defensive posturing. In the way of o no he didn't just say that. Resulting in a (and I exaggerate) are you stupid or what? Which might in some cases may be deserved but is almost never fruitful.
3. In my opinion these consequence as well the what, how and why are often vague and not very well supported in facts or hard numbers. This of course is partly due to the difficulty of the topic and most here are probably (just) users of the system/tecnology.

sretalla said:
It's a feature of the filesystem and thus can/should be avoided in designing to employ it.

If the file system itself gets damaged or is unusable after being utilized for 100% then there is a issue in the design, period. Especially since the claim of ZFS is it's high level redundancy and data security. Now by damage/unusable I mean:
- A loss of data up to the 100%, of course any write above 100% is clearly not going to be written and gives a IO error to the system that ordered the write. Also the last writes before reaching 100% may and will probably be partial data, chances are low a file would fit exactly.
- Unable to free up data by removing files/blocks and after freeing up data being able to use said data again

sretalla said:
Perhaps you feel this exchange is adversarial... I'm just responding to your points, no malice here.

Hehe, no I was afraid you thought I was being adversarial which isn't my intention. So if that's not the case then it's great :)

At this point I have a lot of things to read up on. Will do this first to get a better handle on things so we can converse on more even ground.

nemesis1782 · Mar 8, 2021

As for the documentation I'll go through it all. However Just from the first page What is TrueNAS I already have quite a few comments. It does paint an accurate picture and seems to be mostly marketing padded with technical terms.

Ok gotten a bit deeper into the documentation and indeed answer a few questions I've had (many which I haven't asked yet). It'll take a few days to read through all of it though.

sretalla · Mar 8, 2021

OK, so here's an opportunity to take it forward a bit

nemesis1782 said:
- Unable to free up data by removing files/blocks and after freeing up data being able to use said data again

Yes, this is the case.

Due to being Copy On Write, ZFS needs space in order to delete files too, so if you fill it to 100% (or even quite close to it), you will possibly be unable to free any space to continue operation.

You can still read data from what's there.

sretalla · Mar 8, 2021

nemesis1782 said:
No matter how much you spend on a Ferrari it'll never be a good minivan

Agreed (although they did make something that looks a bit like one... beside the point... https://images.app.goo.gl/41SobJgkkap2fMVb8)

My reference was to the demand for high performance (IOPS if we're talking block storage for VMs or iSCSI in the most common use cases) and the complaint that it doesn't come at rock-bottom prices (you need a lot of "wasted" disk to get to it while remaining safe with the data since Mirrors are the way to deliver IOPS and RAIDZ is what most people have pictured in their minds as how they will afford a Ferrari, giving up only 10-20%, not 50% to redundancy).

sretalla · Mar 8, 2021

I think this might also be a good place for you to look: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html

The OpenZFS project itself talking about how to do it right for the right task.

HoneyBadger · Mar 8, 2021

I've tried to figure out how to phrase this a few times, so pardon me if the language is a little tortured here, and please don't take anything as condescension or critical. Think of it more as a "word salad" I've tossed together while watching progress bars go.

The "archival workflow" is fairly well understood and behaves nicely in ZFS. Dumping big files in, and deleting them rarely or never, tends to work great. Even when you're filling the pool up close to the maximum ("up to but not exactly 100%") capacity, the fill pattern still leaves you with a lot of contiguous free space, and deleting files in large chunks results in a large amount of space being freed at once.
Benchmarking this workflow is easy, but probably will be already well-understood.

As soon as you go to smaller granularity, or even worse doing "update-in-place" of files or block devices, the nice contiguous free space gets covered with a finely chopped mixed of those in-place files, as well as ruining the ability to sequentially read from the underlying vdevs. If you write a 1GB file, and then start updating random blocks of 1M in the middle of it, you'll end up with things out of order and have to seek around to read. Do that with ten 1GB files and it gets even worse. Do it with 100 50G .vmdk's worth of data on block storage - and you've basically asked your drives to deliver I/O at random across a 5T span of data.

Now, this "steady state" can absolutely be benchmarked, but the question is "what is the value of that benchmark?"

It will definitely tell us something we already know, that being "spinning disks suck at random I/O." But the question I would ask it "do you really need to hit that 5T span of data at performance-level-X, or do you realistically only need to hit 500G of it that fast?" Because that's where something like a huge ARC (with compression) and L2ARC devices start to come into play. With more hits to your RAM and SSD, and suddenly your spindles have more free time to deliver the I/O requests that miss the cache. Maybe it's a VM datastore or NFS export, you're backing your VMs up nightly or weekly, and you will hit all of that 5T span, you don't care too much that it takes a while as long as it finishes inside your backup window, but it can't tank the rest of your running VM performance.

The only true benchmark is you (or someone with the same workflow) actually using the storage. You can definitely make observations and extrapolations from someone else's experience, but it's difficult to try to "boil it down" to just a single number, graph, or report sheet. Bandwidth, latency, IOPS, the size of the working set, all of this will have to be taken into account. But at the same time it's important to have objective metrics, because what's "fast enough" for one person might be "intolerable" for another.

I'll see if I can manage to get something more coherent into text to help you out with some workflow and benchmark ideas, but I'd suggest checking the ground already trod by others with tools like VDbench, HCIbench, or diskspd for simulating "real world" setups in a scalable and programmatic scenario.

[/Ramble]

nemesis1782 · Mar 9, 2021

sretalla said:
Agreed (although they did make something that looks a bit like one... beside the point... https://images.app.goo.gl/41SobJgkkap2fMVb8)

That thing always makes me shiver.

sretalla said:
My reference was to the demand for high performance (IOPS if we're talking block storage for VMs or iSCSI in the most common use cases) and the complaint that it doesn't come at rock-bottom prices (you need a lot of "wasted" disk to get to it while remaining safe with the data since Mirrors are the way to deliver IOPS and RAIDZ is what most people have pictured in their minds as how they will afford a Ferrari, giving up only 10-20%, not 50% to redundancy).

I agree with the essence and do not get me wrong I appreciate the reference! Not sure yet if I agree yet on ZFS indeed being a good/best option for this.

However you've given me a lot to read and it'll sometime for me to absorb it all! Once I have done so I hope you might consider taking this conversion up again. Hopefully we'll be on more equal footing at that point.

nemesis1782 · Mar 9, 2021

sretalla said:
OK, so here's an opportunity to take it forward a bit

Yes, this is the case.

Due to being Copy On Write, ZFS needs space in order to delete files too, so if you fill it to 100% (or even quite close to it), you will possibly be unable to free any space to continue operation.

You can still read data from what's there.

Hmm. Yeah I do understand the issue. However this is only the case when not using a Fusion pool I imagine/guess. Still feel this is a design oversight.

While writing this there is probably a recovery scenario by adding another vDev to the pool. Which would free up space to make room for the metadata modifications.

Another thing is that CoW creates new metadata on update. Why would it create new metadata on delete? Does this mean ZFS keeps a block availability map of some sorts. In which case you could indeed not update it and thus not finish the delete transaction.

nemesis1782 · Mar 9, 2021

sretalla said:
I think this might also be a good place for you to look: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html

The OpenZFS project itself talking about how to do it right for the right task.

Ohh nice more reading material thnx!

@HoneyBadger: I'll get back to you on that my brain is to fragmented after a day of meetings to process what you wrote. I appreciate the well thought through and formulated reply!

HoneyBadger · Mar 9, 2021

nemesis1782 said:
Another thing is that CoW creates new metadata on update. Why would it create new metadata on delete? Does this mean ZFS keeps a block availability map of some sorts. In which case you could indeed not update it and thus not finish the delete transaction.

Indeed it does, via "metaslabs" and "space maps" - the atomic nature also means that the delete has to succeed and be valid on-disk first before the previous space is freed. If you still have even a handful of MB remaining though, you can try to delete a file, and as long as there's enough free space in the pool to write the necessary metadata+spacemap updates, you can slowly claw your way back.

If you're beyond that point, you're having to get fancy - null-byteing or truncating files directly, trying to shrink a zvol (if you have one) or burning through refreservation, try to kill snapshots - and as you already identified, add another vdev (ideally in a redundant manner!)

But it's definitely a case of "an ounce of prevention is worth a pound of cure." Once you reach a certain threshold on any filesystem, you need to have various degrees of alarm bells and klaxons screaming at you to do something from "look at this" to "drop everything and do something right now" because client systems have universally negative reactions to getting ENOSPC in response to a write.

sretalla · Mar 10, 2021

nemesis1782 said:
While writing this there is probably a recovery scenario by adding another vDev to the pool. Which would free up space to make room for the metadata modifications.

I think if you're really full (only a few bytes left) it's not even possible to add a VDEV, since all pool members need to know about new VDEVs (requiring writes for that to happen).

nemesis1782 said:
However this is only the case when not using a Fusion pool I imagine/guess. Still feel this is a design oversight.

I'm not sure that I see how using a fusion pool saves you (although I do agree that you would typically have a much larger metadata VDEV than required which wouldn't necessarily have other data being written to it... I'm not 100% sure that data won't overflow from the data VDEVs to the metadata one when there's not enough space... metadata will certainly overflow to the data VDEVs if required, but I'm not clear if there's some kind of prevention mechanism for it going the other way or if it's just "any port in a storm" coded in).

Important Announcement for the TrueNAS Community.

Benchmarking ZFS performance on TrueNAS

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Contributor

Contributor

Powered by Neutrality

Contributor

Contributor

Powered by Neutrality

Powered by Neutrality

Powered by Neutrality

actually does care

Contributor

Contributor

Contributor

actually does care

Powered by Neutrality

Similar threads