Finding suitable ZIL and L2ARC drives

Status
Not open for further replies.

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
We have a small ESXi cluster in our office running on a FreeNAS array with sync disabled...

I want to put a ZIL and L2ARC cache on the box and was wondering if purchasing two Samsung 840 Pro drives and partitioning both so p1 is mirrored for the ZIL and p2 is striped for the L2ARC.

Does that seam feasible?

The server has 16GB of RAM and 16x 1TB SATA 7200rpm disks connected to an M1015 controller.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
For SLOG (ZIL) use, you want a device with supercapacitor or equivalent in order to assure that the device is able to actually commit to flash the things that have been sent to it. Intel's 320 line is the cheapest suitable thing I've been able to find for that, and, bonus, it uses an array of capacitors instead of a supercap. However, it is MLC. So you underprovision and let wear leveling do its thing.

Mirrored SLOG is unnecessary with ZFS v28. If the SLOG device fails, the system will revert to using the on-disk ZIL.

Splitting an SSD for combined SLOG and L2ARC use is not supported under FreeNAS, though you can probably hatchet it in from the command line. Still, since the cost per gig of an appropriate SLOG device is much higher than the cost per gig of an appropriate L2ARC device, the smart move is to purchase a small SLOG device and a much larger inexpensive L2ARC.

The 840 would probably be a fine drive for L2ARC. Be aware that the data structures to support L2ARC are stored in ARC, so ultimately it is suggested that if your server can handle up to 32GB, you are better off upgrading RAM to 32GB and then adding L2ARC on top of that. Otherwise, the addition of a large L2ARC device will actually dramatically decrease the amount of ARC available. On a 16GB system, by default, that's only going to be about 9GB to start with. Each entry in L2ARC eats 200 bytes of ARC. A fairly conservative suggestion is that L2ARC can be about 10x the size of ARC, and aggressively perhaps as much as 20x the size of ARC. Many factors play into how successful any given configuration is, though.

You can probably attach a 120GB SSD for L2ARC without a problem with your existing 16GB of RAM.
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
For ARC you'd probably be better off adding RAM then messing around with a L2ARC, esp for ESXi. On the SLOG you will need something very high performance, because even with sync turned on for your disk array can probably clock a fairly high IOPS rate I'm guessing. For my one 16 disk array I tried a STEC ZeusIOPS SAS drive and found I lost performance, I had to use a Zeusram to actually get fairly decent right performance, to get something close to sync=disabled you are going to need something like a FusionIO card, which is what ixSystems put in the TrueNAS boxes.

Other thing about SSD drives they quote really great IOPS #s but they are always based on 4k block writes, which isn't what your getting when a VM writes out a few bytes to a file. Other thing is while you don't need a large SSD drive for a SLOG(like 8GB is plenty), almost all SSDs performance increases with size. The reason is more NAND flash chips that can write in parallel in the larger the drive, small drive = less chips = less throughput, and you can get the performance review will always be for the 480GB model or something like that.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So I'll chime in... And I'll agree with what the folks here have said, increase your RAM to the maximum extent possible as you will yield the best results there first. Also you should fully understand how a ZIL and L2ARC work to see if they would benefit the users of your system, it just depends on how that system is used.

One other thing that is not related but if this is critical office data, are you running ECC RAM? Just a thought. I'm hearing some folks are having dataloss/corruption just because they had a RAM failure where ECC RAM would have prevented it.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm with pbucher. I'd add more RAM before considering an L2ARC. An L2ARC will use up RAM to keep an index of what is in the L2ARC, thereby consuming some RAM. If your system is already having performance issues because of insufficient RAM you can actually make your server run slower by adding an L2ARC. In my presentation I make a comment that money that can be spent on RAM or an L2ARC is almost always better spent to max out RAM first. You can figure out how much RAM will be used by the formula 180-bytes*(number of blocks for your L2ARC). Block size is based on your load type. NFS is 128k, iSCSI is 4k and i'm not sure about CIFS. Assuming you use iSCSI, a 120GB L2ARC would use about 5GB of RAM just to index the L2ARC. Remember that the L2ARC is the second level ARC. You want the L1ARC(referred to just as the "ARC") to be as big as you can reasonably afford before going to the L2ARC. Considering you have only 16GB of RAM right now, giving 5GB of RAM away to manage the L2ARC would probably be detrimental to your performance.

I could be wrong about this, but I thought someone in the forums had done some testing with using SSDs as a ZIL and L2ARC at the same time and it didn't quite work out as planned because the ZIL traffic and L2ARC traffic competed with each other thereby degrading their overall contribution to the server. I might see if I can find the link if I get a chance.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I did some testing with no benefits when I did the Intel LAN card testing but my setup was a typical home system, not a fancy 16 drive system so unfortunately my results would not truly be a good reflections of what might be achieved. I did find out that more RAM achieved better results than the ZIL and L2ARC.

And I do recall other testing and like Cyberjock said, not favorable results.

@leonroy, what are you hoping to achieve by adding the items? What are your current benchmarks and are you going to see a true return on investment?
 

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
Thanks for all the replies.

Didn't realize that the L2ARC took such a toll on the ARC. A lot to consider when speccing a ZFS box. Seems a real balancing act.

I have 4x4GB ECC Unbuffered and the board only has 4x memory slots so that's a bit of a bummer. The ZIL I purchased originally is an Intel 320 40GB SSD. Problem with it is that the writes are only 40MB/s. As a result I'm seeing NFS writes of 30MB/s with sync and 95MB/s with async. Should've read the specs instead of the reviews which only look at the much faster 80GB+ models which have more parallelisation across the memory chips.

I'm hoping to build a box which serves VMs to vSphere over NFS and is able to saturate a dual Gigabit aggregated pipe.

My feeling is that my next steps should be:
1. Get a faster ZIL - the Intel S3700 100GB SSD looks a good bet with 200MB/s writes. The 200GB model however offers 365MB/s writes but I'm guessing the 100GB model will be fast enough to saturate two Gigabit links?

2. Upgrade memory to 32GB.

3. Get an L2ARC if performance is still lacking.

That sound about right?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Honestly, I'd try the upgrade to 32GB first, then the ZIL, then the L2ARC.

The ZIL doesn't need record breaking MB/sec. It needs high transaction rates, which SSDs generally do. If you try to write an 8GB file to your server you won't have very many sync writes as the sync commands is most often at the end of a file write. The ZIL also only caches writes below a certain size to prevent overusing the ZIL.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You're really getting 40MB/sec out of a 40GB 320? Wow, I'm impressed. Ours manages 20-25MB/sec. I've been punishing a poor little N36L here with ESXi dumping backups over NFS at it. With sync=disabled, it was hovering maybe around 400Mbit/s, now around 190Mbit/sec with the 40GB 320 SLOG device. Given that the system is not that zippy to begin with, I could probably believe that the 320 could go somewhat faster, but 40 out of its rated 45MB/sec is pretty good.

Still, gigabit is ~125MB/sec, and dual is ~250MB/sec, so the S3700 100GB SSD would appear to fail even the most basic math check if the bar required is "potential to allow saturation".
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If you try to write an 8GB file to your server you won't have very many sync writes as the sync commands is most often at the end of a file write. The ZIL also only caches writes below a certain size to prevent overusing the ZIL.

For NFS with ESXi, it'll happily honor ESXi's sync write requests and basically everything will run through the SLOG. It's a messy business.

Oh, I forgot to mention, make sure to UNDERPROVISION YOUR DEVICE. You only need a gig or maybe two, and leaving a huge amount of the device available and unpartitioned gives the SSD's wear leveling a lot to work with.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I guess I should have clarified that normal NFS writes would only have a sync at the end of the file writing operation. ESXi does things a little differently to protect your data.
 

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
You're really getting 40MB/sec out of a 40GB 320?
Nope, I meant 40MB/s is the specced write speed. Real world I'm only getting 30MB/s or more like 26MB/s so near enough the numbers you're seeing unfortunately!


Still, gigabit is ~125MB/sec, and dual is ~250MB/sec, so the S3700 100GB SSD would appear to fail even the most basic math check if the bar required is "potential to allow saturation".

Kinda. Real world I don't think I've ever seen an NFS link giving more than 110MB/s in a best case scenario with large contiguous file copies. You're probably right though, a 100GB SSD with only a 200MB/s write speed is likely to be the bottle neck. That 200GB SSD is just so darn expensive though and will only add a marginal improvement unless I go Fiberchannel or Quad Gigabit links.

Oh, I forgot to mention, make sure to UNDERPROVISION YOUR DEVICE. You only need a gig or maybe two, and leaving a huge amount of the device available and unpartitioned gives the SSD's wear leveling a lot to work with.

So in the FreeNAS terminal I should manually create a tiny partition and leave the rest of the disk unused and then add that as log or cache?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
For a log, yes. For a cache, you'll just want to select the whole device from the GUI options.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Nope, quote 40MB/s as the specced write speed. Real world I'm only getting 30MB/s or more like 26MB/s so near enough the numbers you're seeing.

It's actually 45MB/sec. But I'm usually skeptical of device speed claims, real world use is almost always worse than best case lab benchmark of course. The way you had said it made it sound like you were actually getting 40 out of the device, but I see that it could be read either way.

Kinda. Real world I don't think I've ever seen an NFS link giving more than 110MB/s in a best case scenario with large contiguous file copies. You're probably right though, a 100GB SSD with only a 200MB/s write speed is likely to be the bottle neck.

But you also have to allow for the fact that the 200MB/sec write speed is "with a tailwind", "in a lab", "under optimum conditions", etc. If you really want the assurance of wire speed, you have to assume that the 200MB/sec score on paper will turn out to be 100 or maybe 150MB/sec in practice.

That 200GB SSD is just so darn expensive though and will only add a marginal improvement unless I go Fiberchannel or Quad Gigabit links.

So the question is, do you actually write so much stuff that being able to support wire-speed writes on dual gigabit is mandatory? Generally speaking, heavy writes in a VMware environment is a good thing to avoid if possible.

For example, on the previously-mentioned N36L/320 SSD, while it'd be NICE to write faster, the best-case sync=disabled is only about twice as fast, and since the backups are running daily, and they're completing in six to eight hours, this isn't an operational problem.

So in the FreeNAS terminal I should manually create a tiny partition and leave the rest of the disk unused and then add that as log or cache?

For log, yes. The GUI will handle L2ARC just fine, I believe.
 

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
Reading up on sizing a ZIL is it accurate to size it for 1/2 the amount of RAM?

So on my 16GB server a ZIL of 8GB would be adequate and on a 32GB server 16GB would be fine?

Is there any need to add 1GB for overhead? ie. a 9GB ZIL on a 16GB system?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
No. The purpose of the SLOG is to allow the system to avoid forcing writes out to the in-pool ZIL "right now" (aka sync). Instead it gets pushed out to the SLOG "right now". ZFS still commits the writes to the pool in both cases, but does so as part of the normal transaction group writing process. The interaction is necessary to allow ZFS to use larger block sizes (up to 128K).

So ZFS doesn't let you get ahead of yourself too much. If you have a transaction group being flushed to disk, at most, optimistically, you can be gathering up contents for the next transaction group. If ZFS gets to the end of the txg while the previous one is still being flushed to disk, I/O pauses until the previous one is done flushing. Otherwise you could wind up with a crazy amount of data queued to write to a slow pool.

This gives us some guidance as to SLOG size; the ZIL can reasonably be as large as two transaction groups. So for a SLOG device, that's the minimum SLOG size to avoid problems.

The next question: what the hell's the size of a transaction group? Great question. I went through a lot of problems relating to unrealistic ZFS auto-sizing of transaction groups a while back, and the answer is that this is controlled by vfs.zfs.write_limit_max, which by default is set to 1/8th system memory. That means a device that is 1/4th the size of your system's memory is just large enough, and I'm sure someone had a good reason to generalize that up to 1/2 the size of your system's memory.

BUT! I'm not convinced that a random divisor of memory size is a good estimate as to the I/O capacity of your system. In particular, I had a fast system with 32GB of memory but four relatively slow (50MB/sec) drives. ZFS was trying to ram 4GB transaction groups out to that every five seconds. Can you say "FAIL!"

Interactive response is affected heavily by the speed of the pool and the unreasonableness of the workload, and setting a reasonable non-melting-the-drives-to-slag transaction group size based on various factors is a mandatory bit of tuning around here now. As a result, transaction groups are typically much smaller than the advice you saw.

From a practical point of view, since a gigabit ethernet is 125MB/sec, and two is 250MB/sec, even if you leave the transaction group flush at five seconds, only ~~1250MB could build up in a txg. So again it pays to understand the problem from a few different directions.
 

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
After adding an Intel S3700 100GB partitioned as an 8GB ZIL I'm only seeing writes hitting 40MB/s with sync enabled over NFS. With sync disabled they're closer to 90MB/s.

Am I missing something?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yes, you are missing the reduced latency inherent in omitting the ZIL... doing it "correctly" is a lot more challenging than saying "ah fsck it all".

That's what this is all about, it isn't easy. You can't have fast, cheap, AND correct, sadly.
 

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
40MB/s is good enough for the small vSphere cluster I'm running but out of curiosity what am I missing here hardware wise? If I wanted to saturate a single Gigabit link what sort of hardware changes would I need to make?

System spec is:
  • Xeon 1220L v2
  • 16GB DDR ECC Unbuffered
  • 2x LSI 9211-8i HBAs
  • ZIL - Intel S3700 100GB SSD
  • L2ARC - is 256GB Samsung 840 Pro

Thanks for all the patience and explanation so far!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It is most certainly correct; almost ALL SSD's come "pre underprovisioned," so like what's your point?

Mine is that since only a gig or two is meaningful to begin with, leaving more sectors marked as unallocated potentially gives the controller more options to pick from, which allows the controller to pick a free block and make it more available more quickly. Since a sync write is unlikely to be exactly the drive's preferred size for fastest I/O, the next best option is to make sure that there's a much larger supply of "available" flash blocks, so that you can have a better chance of avoiding the need to read an existing block to modify.
 
Status
Not open for further replies.
Top