Proper sizing of SLOG and transaction group

sfcredfox · Jul 13, 2015

Experts,

I'd like to have you review my understanding/planning for SLOG size and clarify transaction group sizing:

System:
X8DTN+ / 2x Intel 5520 2.4Ghz / 72GB RAM ECC
IBM M1015
Pool1 -> 20x 146GB 3G 10K SAS (used for VMware datastore, SYNC heavy)
Pool2 -> 8x 2TB 7200 SATA2 (media content, not likely heavy)
2x Intel Pro 1000 PT/MT dual port cards (4 gigabit ports for iSCSI)
edit: Internal Intel Pro 1000 dual port (for management and CIFS/SMB share traffic)

SLOG Sizing
To setup my SLOG, I found these posts to underprovision my SSD for SLOG:
https://forums.freenas.org/index.php?threads/how-to-add-an-slog.16766/
https://forums.freenas.org/index.php?threads/how-to-partition-zil-ssd-drive-to-underprovision.11824/

Based on guidance, I should plan enough size for two transaction groups.
Having 72GB of RAM, 1/8 of it could be used for a transaction group.
This means my transaction groups should be 9GBs? (72GB*.125=9)

So if I want to hold two of them, that would be a total of 18GB, meaning if I create the partition for 20GB, it should hold two txgs?

-Or-

Should it be planned based on time? Should I plan for each txg being closed after the 5 seconds elapses?
With four ports for iSCSI (4gbps), that could be roughly 500 megabytes per second times 5? (500MBpsx5=2.5GBs of data)

This would mean I need to store only about 10GB of data?

My first thought is just to go with the larger number since 20Gb is still nothing, I just want to know if I understand this correctly so I don't plan on doing something stupid for my SYNC writes.

Transaction Group Sizing
https://forums.freenas.org/index.ph...nto-slog-zil-with-zfs-on-freenas.13633/page-5

I have seen some posts where people are talking about the transaction group (txg) getting too large on systems with large quantities of RAM for your underlying storage to commit it fast enough before the second txg closes, causing the system to pause IO.

I calculated above (hopefully correctly) my transaction groups could be as large as 9GB (72GB*.125)

I don't know how having two separate pools fits into this, maybe the transaction group holds data for both?
This add complication because the amount of data for each pool would vary based on use, and amount of data destine for the slower pool etc, etc, etc.

If one of the pools writes about 500MBps, it would take 18 seconds to write 9GB roughly? Seek, random vs sequential, fragmentation, etc, etc dependent?

Question out of this
Many times, people say don't tune anything because the results are insignificant and it complicates your life. Is this a case where I should actually consider limiting the txg side down so it won't get any bigger than the amount of time it would take to burn the data to down spindles?

If the answer is actually yes, would I need to limit the txg to about 2.5GB (500MBps x 5seconds)?

sfcredfox · Jul 13, 2015

Follow up thoughts:

After thinking about this more, could I even get the transaction group to get up to 9GBs in 5 seconds?

If all the IO is coming from the 5 active gigabit ports, that would only total about 625MBps from all of them. That's only 3.125 GBs of data in 5 seconds (625MBps*5 seconds txg time limit).

So if the system will only be getting external IO up to ~3GB every 5 seconds, couldn't I get away with a smaller SLOG? Say around 8GB?

I still think that might be a problem for the underlying pool if it can only handle 2.5GB (500MBps * 5 seconds). So, should I still consider limiting the max txg size down to 2.5GB? Theoretically, couldn't I still get more writes in from the 5 ports than the pool can handle? This of course would have to mean all those requests were writes, and large smooth sequential, which is probably not a realistic work load.

Robert Trevellyan · Jul 13, 2015

The one thing I can contribute here is that dedicated SLOG devices are one per pool, and that partitioning a single SSD into two doesn't count.

depasseg · Jul 13, 2015

And to add on, I think your sizing calculations are too fine grained. In my pool, I bought the smallest/fastest SSD I could find and is was still way overkill. And I added the whole thing as the SLOG. I figure I would let the OS overwrite the whole drive, ovs under-provision the disk. The end result is the same. I'm going to have a stream of data written all across the drive - both scenarios adequately distribute the writes across all of the SSD.

sfcredfox · Jul 13, 2015

Robert Trevellyan said:
The one thing I can contribute here is that dedicated SLOG devices are one per pool, and that partitioning a single SSD into two doesn't count.

Certainly, make sense. Only one of my pools will likely benefit from a *SLOG, so I will be leaving the other one alone to use default in pool ZIL. I read that forcing the two pools to compete for bandwidth on the device would be silly.

depasseg said:
And to add on, I think your sizing calculations are too fine grained. In my pool, I bought the smallest/fastest SSD I could find and is was still way overkill. And I added the whole thing as the SLOG. I figure I would let the OS overwrite the whole drive, ovs under-provision the disk. The end result is the same. I'm going to have a stream of data written all across the drive - both scenarios adequately distribute the writes across all of the SSD.

I guess that makes sense. I guess you're saying the SSD should still be capable of leveling itself with the partition across the entire drive?

Does the OS mark the data it no longer needs? The SSD wont be writing over data, but the OS will choose to do so. I guess you're saying just let the drive fill up and allow the OS to write over the old data, so it's OS leveling versus SSD leveling?

depasseg · Jul 13, 2015

My understanding is that the OS (or specifically ZFS) will do wear leveling if given more room than needed. Although I'm not sure how to test that out.

Robert Trevellyan · Jul 13, 2015

sfcredfox said:
Does the OS mark the data it no longer needs?

Do you mean, "Does ZFS on FreeBSD issue TRIM commands?" If so, a quick online search suggests, "Not until FreeBSD 10."

depasseg · Jul 13, 2015

TRIM is one method, but most modern SSD's also have a built in garbage collection routine that's pretty effective. And the Enterprise SSD's are already under(over?)-provisioned (a 128GB SSD could have an additional ~10-20% of space to help with it's own garbage collection and wear leveling functions).

Spearfoot · Jul 13, 2015

My understanding is that best practice is to over-provision SSDs you plan on using as a SLOG device, leaving only 4-8GB (or whatever size you've calculated) visible to the operating system. This gives the SSD's controller even more space to use for garbage collection/wear leveling etc., as @depasseg pointed out.

This article gives step-by-step instructions for doing it right:

https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm

And of course, the consensus here on the forum seems to be that an Intel DC S3700 is the ideal SSD for this purpose.

sfcredfox · Jul 16, 2015

Looks like there are two methods for provisioning the SLOG, under and over. I guess I'll have to keep researching and weighing people's performance.

Can anyone speak to the Transaction Group size with large quantities of system RAM?

depasseg · Jul 16, 2015

sfcredfox said:
Can anyone speak to the Transaction Group size with large quantities of system RAM?

Is there something specific you are looking for? I thought the TXG size was related to the network speed (specifically the max amount of possible writes per interval of time), not the size of the RAM.

sfcredfox · Jul 16, 2015

depasseg said:
Is there something specific you are looking for? I thought the TXG size was related to the network speed (specifically the max amount of possible writes per interval of time), not the size of the RAM.

So you're saying my ladder thought of basing it off the network IO was the more appropriate way of doing it?

From above:
"If all the IO is coming from the 5 active gigabit ports, that would only total about 625MBps from all of them. That's only 3.125 GBs of data in 5 seconds (625MBps*5 seconds txg time limit).

So if the system will only be getting external IO up to ~3GB every 5 seconds, couldn't I get away with a smaller SLOG? Say around 8GB?

I still think that might be a problem for the underlying pool if it can only handle 2.5GB (500MBps * 5 seconds). So, should I still consider limiting the max txg size down to 2.5GB? Theoretically, couldn't I still get more writes in from the 5 ports than the pool can handle? This of course would have to mean all those requests were writes, and large smooth sequential, which is probably not a realistic work load."

It sounds like this is the more correct way to determine a TXG?
In your opinion, is my basic calculation correct?
Can the transaction group size from these 5 links over power the underlying storage if it was running at a realistic heavy random workload for a bunch of VMs?

Spearfoot · Jul 16, 2015

Judging from the article below (by iXsystems sales engineer Marty Godsey), a .625 GB SLOG would be sufficient for most folks, who typically only have gigabit networks. I plan on getting a 100GB Intel DC S3700 and over-provisioning it to 8GB for my SLOG device. Which is definitely overkill... even 4GB would probably be overkill. But I say 'go big or go home!' ;)

ZFS will take data written to the ZIL and write it to your pool every 5 seconds. Here is some simple throughput math using a 1Gb connection. The maximum throughput, ignoring overheads and assuming one direction, would be .125 Gigabytes per second. With 5 seconds between SLOG flushes and using a 1Gbit link with 100% synchronous writes, the most you will see written to your SLOG is 5 x .125 GB = .625 GB.

This shows that you don’t need that much space for a SLOG and can use a smaller SSD. If you have a write-intensive application that requires multiple 1Gb Ethernet connects or a 10Gb, you can increase the size proportionally.

So bringing it home—when choosing an SSD for a SLOG device, don’t worry about space. Choose an SSD device that has extremely low latency, a high write IOPS, and is reliable. iXsystems did so for TrueNAS and the FreeNAS Mini and so should you.

http://www.ixsystems.com/whats-new/why-zil-size-matters-or-doesnt/

depasseg · Jul 16, 2015

What are your plans for each pool setup? I'm assuming you will have your 20 disk vmware pool setup at striped mirrors and not filled past 50% (recommendation for iscsi). This will give you 10 x ~170MBps.
And I'm guessing the media pool will be RAIDZ2 and won't have a heavy sync write workload.

I really think you are getting into the weeds to solve a problem that doesn't exist. I would not worry about modifying the txg size and associated thresholds (like times and timeouts). There is more to tuning it that a straight bandwidth calc. And as for installing 2 SLOGs, I have to ask - Why? Why does your media pool need a SLOG? Are you planning on having heavy sync writes there?

What device are you planning to use for your SLOG? You aren't going to be able to buy something as small as 8GB, in fact it will be much, much larger. So you can always start with provisioning 8GB (or 12, or 16GB) and adjust the overprovisioning in the future if needed.

depasseg · Jul 16, 2015

Oh, and do you expect to do a heavy/continuous write stream? Or will it be more of a bursty traffic profile?

sfcredfox · Jul 16, 2015

Spearfoot said:
Judging from the article below (by iXsystems sales engineer Marty Godsey), a .625 GB SLOG would be sufficient for most folks, who typically only have gigabit networks. I plan on getting a 100GB Intel DC S3700 and over-provisioning it to 8GB for my SLOG device. Which is definitely overkill... even 4GB would probably be overkill. But I say 'go big or go home!' ;)
http://www.ixsystems.com/whats-new/why-zil-size-matters-or-doesnt/

Running the math in this way gives me 2.5GB. (.125perlink*4links*5sec)
Thanks for posting that article

sfcredfox · Jul 16, 2015

depasseg said:
What are your plans for each pool setup? I'm assuming you will have your 20 disk vmware pool setup at striped mirrors and not filled past 50% (recommendation for iscsi). This will give you 10 x ~170MBps.
And I'm guessing the media pool will be RAIDZ2 and won't have a heavy sync write workload.

I really think you are getting into the weeds to solve a problem that doesn't exist. I would not worry about modifying the txg size and associated thresholds (like times and timeouts). There is more to tuning it that a straight bandwidth calc. And as for installing 2 SLOGs, I have to ask - Why? Why does your media pool need a SLOG? Are you planning on having heavy sync writes there?

What device are you planning to use for your SLOG? You aren't going to be able to buy something as small as 8GB, in fact it will be much, much larger. So you can always start with provisioning 8GB (or 12, or 16GB) and adjust the overprovisioning in the future if needed.

The 20 disk pool is for the VMs, but not exactly the perferred setup. It's Z1 in groups of 5. I'd get much better performance doing it the best way (mirrors), but space is an issue right now. I'll likely build a new pool in the future with a new enclosure and some different disks (these old 3G 2.5" disks are very slow ~30MB tops)
The SLOG is only for the VM pool, it's set conservatively with SYNC=always. I am not at all concerned about it having enough space, I just wanted to get the best answers on how much it was going to use without having to set it up first. Between the two of you and the others, I feel good about that now. I still haven't seem enough to know if under vs. over is best. I'll just go one way and see perhaps. Under provisioning was pretty easy to setup and add.

I am not planning two SLOGS, the media pool on the Z2 will likely have no SYNC writes.

My question about having the transaction group size in question is because 1/8 of the system's RAM is 9GB. I was curious about the bandwidth because I at first thought two transaction groups could be up to 18GB, but that doesn't seem to make sense because my 5 gigabit links could only product enough IO for 3.125GB of data according to the IX provided formula.

The poorly configured Z1 (20 disk) pool can do only about ~500MB per second sustained. (500MBps * 5 seconds = 2.5GB)

It's the difference between the links being capable (realistic?) of producing 3.125GB of IO and the one pool only being able to write 2.5GB per second.

I am basically hearing that it's unlikely this will become a problem. Not all OI is going to one pool, not all of it is writes, those links won't run at 100% except when forcing an unrealistic load on the system.

depasseg said:
Oh, and do you expect to do a heavy/continuous write stream? Or will it be more of a bursty traffic profile?

This pool runs a VM datastore with every type of VM you could have (DC's, Exchange, SharePoint, file servers, streaming media, etc.) for the purpose of testing and determining how this system would perform under real conditions (other than user count, I don't have traffic generators).

Servers booting, doing virus scans, and backups are when the system gets taxed.

*edit: I basically asked this question because I didn't want to put the system under load and have it start halting IO every so often if the pools can't keep up with the transaction groups because it had so much RAM.

I'm guessing there won't be a problem at all, I just didn't want to transfer the VM disk pool off the old system (signature) and onto this one thinking 'More RAM, always better, no issues, etc. If that was a bad idea or a bad design (other than the Z1 vs mirrors), I think you would have caught it by now. I'm looking forward to seeing how the new system will do.

Spearfoot · Jul 16, 2015

I still haven't seem enough to know if under vs. over is best.

Is there such a thing as 'under-provisioning'? When it comes to SSDs, my understanding is that there is only 'over-provisioning', i.e., allocating a larger portion of the disk as free space that the built-in controller can then use for wear leveling and so forth. Perhaps the confusion lies in the fact that 'over-provisioning' results in 'under-sizing' the amount of space available to the operating system?

Or I could just be completely off-base! Anyway, here is a good explanation of over-provisioning:

http://www.samsung.com/global/busin.../SSD/global/html/whitepaper/whitepaper05.html

sfcredfox · Jul 16, 2015

I guess I can't figure out how that's different from this:
https://forums.freenas.org/index.php?threads/how-to-add-an-slog.16766/

Spearfoot · Jul 16, 2015

Yeah, the FreeNAS GUI only lets you work with partitions/full disks when adding them to a pool. It doesn't provide a means of over-provisioning the device. You'll have to do that beforehand, under a different OS or on a different machine. In my case, I booted Ubuntu from a USB stick on my FreeNAS system and followed the steps at the Thomas-Krenn link I gave above:

https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm

SSD manufacturers already 'over-provision' their drives to an extent, providing extra space above and beyond the partitioned capacity of the drive. Over-provisioning simply takes that a step further by increasing this extra capacity. Of course, this necessarily means that the partitioned space seen by the operating system is smaller.

For example, a 100GB SSD already has a few additional gigabytes of extra capacity available 'behind the scenes'. By reducing the partitioned space from 100GB to 8GB, we're providing an extra 92GB for the SSD's controller to work with. For a ZIL SLOG this is great! 8GB (or 4GB or whatever) is all you'll actually need, plus you gain the benefits of extra performance and longevity.

Important Announcement for the TrueNAS Community.

Proper sizing of SLOG and transaction group

Patron

Patron

Pony Wrangler

FreeNAS Replicant

Patron

FreeNAS Replicant

Pony Wrangler

FreeNAS Replicant

He of the long foot

Patron

FreeNAS Replicant

Patron

He of the long foot

FreeNAS Replicant

FreeNAS Replicant

Patron

Patron

He of the long foot

Patron

He of the long foot

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Proper sizing of SLOG and transaction group"

Similar threads