Sync writes vs SLOG

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I’m not thinking that. According to Freenas they say you can use 50% for iscsi and 80% for nfs, without any performance degradation. And you say Freenas /ixsystem is wrong… ok, have to read your materials then.

I think you're confused and maybe misunderstood something somewhere.

80% is commonly considered a fill percentage for normal ZFS pools used for general file storage. This is the point at which performance degradation starts to become more noticeable. It isn't a fixed number, some people think it's closer to 90%, etc.

Back in 2012-2013 I was already talking about this problem and was considered a bit heretical or insane by people unwilling to try to understand, and my rule of thumb at the time was actually 60%, not 50%. See for ex.


I later lowered it to 40-50%. So you might want to listen to what I'm saying, because I'm PROBABLY the indirect source for the 50% you did hear. I eventually prevailed upon Dru to make more accurate real world estimates of eventual steady state ZFS performance in the long term, which is what got put in the FreeNAS manual, and iX engineers have been referred to some of my discussions over the years because I've gone into some detail. You can find numerous past discussions by looking for the word "contiguous" by "jgreco" in the search box. Many of the results will discuss this at varying levels of complexity. This is a tough concept to wrap your head around, and I am committed to trying to help promote understanding of the issue.

So let me summarize for you.

This is not a protocol issue, though stuff like volblocksize is somewhat related. The general problem applies to both NFS and iSCSI. The specifics for each are slightly different.

Due to its tendency to write blocks from a transaction group to contiguous LBA's on disks, even if the blocks in the transaction groups were what we would normally consider to be random writes to random locations, this means that ZFS tends to write faster because it isn't ACTUALLY seeking to write stuff that would normally require seeks on UFS or EXT3.

The problem is that this causes severe fragmentation over time, and the question is where does the merry go round ride end. Fortunately our friends over at Delphix did a study of this, which is part of what convinced me I was wrong about the 60% and that it needed to be lower.

delphix-small.png


This table shows pool performance on a pool that is written to randomly until it achieves a stable throughput. This represents likely behaviour of your pool after extensive use and random overwrites. It's the number you can probably RELY on for your pool performance to never get worse than.

What this is saying is that at 10% pool capacity, your pool will be about 6x faster for writes than your pool at 50% capacity. The hard data presented here caused me to rethink the 60% advice I had been giving for years, because it's pretty clear that at 50% you're already into the "wow that sucks" area of the graph.

The problem is that most people don't do the hard work to figure out where things end up. They get a new pool that's empty and they do some benchmarks on it, and they think holy bitpools batman this is amazingly fast. The problem is that you need to understand how performance will evolve over time and what factors impact this. At 50%, you may actually be fine because most block storage usage does not cause maximal fragmentation, so YOUR steady state graph may not end up at the pessimistic low level. If you run VM's that do not frequently overwrite all their data, you will end up in a somewhat better curve, or that approach the pessimistic level over a much longer period of time -- years or even decades.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
So:
512GB RAM is excellent. A large working set in RAM will make reads nice and snappy
26 Disks in Z2 - Ouch. The IOPS on that are not going to be good and guess what - you need IOPS for VMWare type loads. I have no idea how effective the flashcache will be.
Xeon 2650 is not the fastest CPU around (depending on which version, some are slower) but it will be OK for this. Don't expect it to shine as SMB
40Gb networking, great
RMS200-8G presumably mirrored - did you get these 2nd hand as the current model is the RMS300-8/16G. But either way these are technically an issue. A SLOG holds by default up to 5 second of write data which on 10Gb is 6.5GB. at 40Gb thats a theoretical 26GB of SLOG data - on an 8GB card? Most sizing I have seen for a SLOG is the 6.5GB per 10Gb/s network speed. You are very short of this. Obviously you won't reach the full 26GB probably, but are likely I suggest to bounce off the 8GB limit on a regular basis (and I have no idea what happens then).

I have a fairly light VMWare environment with 10Gb NICs and have a 20GB SLOG for each pool. This appears to work well.
 

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
So:
512GB RAM is excellent. A large working set in RAM will make reads nice and snappy
26 Disks in Z2 - Ouch. The IOPS on that are not going to be good and guess what - you need IOPS for VMWare type loads. I have no idea how effective the flashcache will be.
Xeon 2650 is not the fastest CPU around (depending on which version, some are slower) but it will be OK for this. Don't expect it to shine as SMB
40Gb networking, great
RMS200-8G presumably mirrored - did you get these 2nd hand as the current model is the RMS300-8/16G. But either way these are technically an issue. A SLOG holds by default up to 5 second of write data which on 10Gb is 6.5GB. at 40Gb thats a theoretical 26GB of SLOG data - on an 8GB card? Most sizing I have seen for a SLOG is the 6.5GB per 10Gb/s network speed. You are very short of this. Obviously you won't reach the full 26GB probably, but are likely I suggest to bounce off the 8GB limit on a regular basis (and I have no idea what happens then).

I have a fairly light VMWare environment with 10Gb NICs and have a 20GB SLOG for each pool. This appears to work well.
Well, i don’t think the RMS-200 is that much of an issue as we‘ve seen almost 4k to 32k data writes. Mostly database transactions. So we will probably not utilize the 40GB, even the 10Gbps is hard to fill up with these small transactions. From what i have read is that when the transaction group (4GB) is full and it starts to fill up the second, it will flush the first from memory to the pool. And erase the transaction group from memory and the SLOG. Freenas will pause/throttle IO during this process until it catches up.

The flashcache is only good for read IO, but not so on writes. We will have a closer look at a mirror configuration and also test the stress on the array when it rebuild. We have a couple of 10-20 vms running as a test setup, so we could stress them, while rebuilding and see what will happen.

Wow @jgreco, thank you for this incredible explaination. We have to take a closer look at our workload. As long it is in the range of the "wow that sucks" area, we are good. We mainly source the Freenas information as valid and it is hard to follow any realtime data, because mostly it is used as a hobby solution. But i think with every storage, the drop in performance is the same. Except with ZFS and how it performs over time, makes me wonder how to keep the performance stable. From your information, i now understand that it doesn’t matter if we use nfs or iscsi, because we use block storage,performance degradation will be the case in both.

So as a rule of thumb x% amount of free space will asure a certain performance. We have a good time to play with this and as said above we have a couple of vm’s running to play with. We want to leave everything as default as possible, before making any changes. This way we get a perfect zero level benchmark to start with. I’m going to read some information you wrote.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
We mainly source the Freenas information as valid and it is hard to follow any realtime data, because mostly it is used as a hobby solution.

You think it is a "hobby solution"? No. FreeNAS is released as free-to-use, but it is the testing and bug shakeout branch of TrueNAS, which is very much NOT a "hobby solution." As the world has transitioned away from small-to-midsize servers with direct attached RAID5 arrays, storage is evolving towards high speed flash solutions, which ZFS is not exactly targeted at, and also high capacity storage solutions, which ZFS is essentially unparalleled at. Most "hobby solutions" are not selling 20 petabyte systems to customers.

The problem is that in the enterprise environment, you have an iXsystems engineer working on your storage with you, helping you provision and configure it correctly, analyzing your environment and making sure things work out for you. These customers are not usually participating on the forums here. If they have a problem, iXsystems fixes it for them.

There are those of us here on the forums who do this stuff professionally as well. My boss is a cheapskate arse who did the cost analysis of conventional NetApp/EMC/EqualLogic/etc solutions and found them to be 10-20x the cost of parts; he makes me run FreeNAS but I get back at that old green bastard by wasting time providing support on the forums. It's okay to tattle on me, he knows what I think of him. He gets cheap storage and I get the pleasure of chatting with clueful people on these forums, because rolling your own storage is not for idiots. I am very happy to help people become educated in the things that they really need to know in order to be successful.

That said, there are also a number of hardcore and amazing hobbyists on these forums too, so while yes there are lots of hobbyist users out there doing silly things with more questions than answers, there are several major vendors of ZFS-based enterprise storage solutions, and iXsystems is clearly one of the leaders of that pack.
 

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
Huh? No. I mean the talks on this forum. Most people on this forum use it as a hobby solution. It is hard to find real business like solutions, with high end datacenter hardware.
 
Last edited:

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
You think it is a "hobby solution"? No. FreeNAS is released as free-to-use, but it is the testing and bug shakeout branch of TrueNAS, which is very much NOT a "hobby solution." As the world has transitioned away from small-to-midsize servers with direct attached RAID5 arrays, storage is evolving towards high speed flash solutions, which ZFS is not exactly targeted at, and also high capacity storage solutions, which ZFS is essentially unparalleled at. Most "hobby solutions" are not selling 20 petabyte systems to customers.

The problem is that in the enterprise environment, you have an iXsystems engineer working on your storage with you, helping you provision and configure it correctly, analyzing your environment and making sure things work out for you. These customers are not usually participating on the forums here. If they have a problem, iXsystems fixes it for them.

There are those of us here on the forums who do this stuff professionally as well. My boss is a cheapskate arse who did the cost analysis of conventional NetApp/EMC/EqualLogic/etc solutions and found them to be 10-20x the cost of parts; he makes me run FreeNAS but I get back at that old green bastard by wasting time providing support on the forums. It's okay to tattle on me, he knows what I think of him. He gets cheap storage and I get the pleasure of chatting with clueful people on these forums, because rolling your own storage is not for idiots. I am very happy to help people become educated in the things that they really need to know in order to be successful.

That said, there are also a number of hardcore and amazing hobbyists on these forums too, so while yes there are lots of hobbyist users out there doing silly things with more questions than answers, there are several major vendors of ZFS-based enterprise storage solutions, and iXsystems is clearly one of the leaders of that pack.
Hahahaha.… well for small companies like me (2 person) this is the way to go for now. We have checked NetApp/Dell/EMC as a valid solution, but that begins from 25k for 1 unit. Now we have 2 for less then that.
well you definately deserve the level of Resident Grinch! ;-)
 
Joined
Oct 22, 2019
Messages
3,641
There are those of us here on the forums who do this stuff professionally as well. My boss is a cheapskate arse who did the cost analysis of conventional NetApp/EMC/EqualLogic/etc solutions and found them to be 10-20x the cost of parts; he makes me run FreeNAS but I get back at that old green bastard by wasting time providing support on the forums. It's okay to tattle on me, he knows what I think of him. He gets cheap storage and I get the pleasure of chatting with clueful people on these forums, because rolling your own storage is not for idiots. I am very happy to help people become educated in the things that they really need to know in order to be successful.
You had me for a moment. I even had to do a double-take. :tongue:
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Huh? No. I mean the talks on this forum. Most people on this forum use it as a hobby solution. It is hard to find real business like solutions, with high end datacenter hardware.

It's not that hard. Pay attention to the Resources section, especially the ones with high ratings. Listen to the high volume posters. Generally misinformation isn't allowed to survive long. If you post problems that require some higher end skills, usually there's going to be some comments from that crowd.

Anyways, the big benefit here is that unlike a lot of the vendor options, iXsystems is in the business of selling hardware, and as such, they've tried to make the appliance a lot friendlier than some of the vendor options where ongoing employment of the support engineers seems to be the goal of the product design. That means you have a huge head start at having a reasonably good chance to provide "self-serve" service.
 

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
So we did some finetuning and created new pool consisting out of MIRROR vDEVS. The base performance, without any LOG or L2ARC cache is acceptable. There is not much difference in the low sector sizes, but overal performance is better.
20220506_windows_MB_sync-always_nfs_SAS-MIRROR_zonder_cache-ROUTEROS.PNG
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
There is not much difference in the low sector sizes
I don't think that tool (or any tool) is likely to be a great test of small blocks across a network.

The amount of overhead needed to transfer, prepare and write out 512 bytes is enormous in comparison to the payload and takes almost the same amount of time as to process much more data.

I think what you're seeing there is the minimum time to transfer a packet/write a block just divided by different small numbers to give MB/s.

The numbers from about 32KB and up are maybe starting to make sense, but I would really trust the 128K (default ZFS recordsize) numbers and indeed you see it more-or-less holds flat from there as the "real" number (IMO).
 

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
I don't think that tool (or any tool) is likely to be a great test of small blocks across a network.

The amount of overhead needed to transfer, prepare and write out 512 bytes is enormous in comparison to the payload and takes almost the same amount of time as to process much more data.

I think what you're seeing there is the minimum time to transfer a packet/write a block just divided by different small numbers to give MB/s.

The numbers from about 32KB and up are maybe starting to make sense, but I would really trust the 128K (default ZFS recordsize) numbers and indeed you see it more-or-less holds flat from there as the "real" number (IMO).
Well database transactions are around the 4k, so we definately are watching those. Also we do a lot of development and small scripts are starting from 32k and up. So that’s what you want to look at. What is your workload? How many transactions? Etc.

I think we are going to stick with the mirror. Had some bad luck in the past, but hopefully this won’t happen again.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Well database transactions are around the 4k, so we definately are watching those. Also we do a lot of development and small scripts are starting from 32k and up. So that’s what you want to look at. What is your workload? How many transactions? Etc.
Not sure if you already saw this, but it seems it might be worthwhile for you: https://openzfs.github.io/openzfs-d...uning/Workload Tuning.html#database-workloads

I think we are going to stick with the mirror. Had some bad luck in the past, but hopefully this won’t happen again.
Did you have a spare in the pool on that occasion?
 

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
Not sure if you already saw this, but it seems it might be worthwhile for you: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#database-workloads


Did you have a spare in the pool on that occasion?
Yes, but we had a triple disk failure during the rebuild. We lost the complete pool. Luckily we still had the backups.

yes i have seen that. But every scenario’s is different. We use vmware infra, so we are bound to storage blocks, not dataset performance. This gives a complete different workload scheme and have to focus on vdisk performance.
for instance: setting skip-innodb_doublewrite in my.cnf does absolutely nothing in our setup. Database transactions are not written to a dataset, but within storage blocks.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The base performance, without any LOG or L2ARC cache is acceptable.

If that kind of small-block performance is considered acceptable then you'll be really happy once you add the LOG devices.
 

remonv76

Dabbler
Joined
Dec 27, 2014
Messages
49
If that kind of small-block performance is considered acceptable then you'll be really happy once you add the LOG devices.
Hahaha, no it is acceptable for sas performance, but yes we definately need a SLOG. But first a nice clean baseline performance measure, stressed and unstressed, to see where we stand. Then we are going to attach the RMS-200’s and DCT983 as SLOG and L2ARC and do the performance testen again.
 
Top