SLOG benchmarking and finding the best SLOG

DaveFL · Feb 23, 2020

What block size is used for servicing VMs over NFS? Trying to understand what size is most important.

Rand · Feb 24, 2020

I think its 64K but I was never able to find out/prove it

HoneyBadger · Feb 24, 2020

DaveFL said:
What block size is used for servicing VMs over NFS? Trying to understand what size is most important.

It's important to understand that recordsize in ZFS is a maximum. Provided that you're using thin provisioning on your NFS datastores, the recordsize writes should be roughly equal to the actual in-guest writes. Look to the application in question to see if it has a desired block size for NTFS and you might be able to eke out some gains by using a matching recordsize all the way down.

Take a look at this blog from Tintri (yes, I know they didn't come to a good end) about the data they collected from their NFS user base:

https://web.archive.org/web/2019090...read-and-write-sizes-virtualized-environments

The short answer is that in general, small-block still dominates, 4K and 8K. VMware does move in 64K chunks for storage vMotion though.

Edit: Replaced Tintri link with an archive.org one

Constantin · Aug 5, 2020

... and now for something slightly different.

As part of TrueNAS, special VDEVs will come into play, potentially speeding things up quite a bit for small files, metadata, etc. I wonder what type of SSD to purchase for this sort of VDEV and thought I'd start here.

In particular, I'm considering using three 1.6TB DC S3610s for the special VDEV, as I have plenty of SATA slots left over, these drives are MLC, have huge DWPDs compared to consumer drives, and are relatively price-competitive on eBay. Am I barking up the wrong tree or does this hardware sound like a good idea for the intended purpose?

Stilez · Aug 5, 2020

Constantin said:
... and now for something slightly different.

As part of TrueNAS, special VDEVs will come into play, potentially speeding things up quite a bit for small files, metadata, etc. I wonder what type of SSD to purchase for this sort of VDEV and thought I'd start here.

In particular, I'm considering using three 1.6TB DC S3610s for the special VDEV, as I have plenty of SATA slots left over, these drives are MLC, have huge DWPDs compared to consumer drives, and are relatively price-competitive on eBay. Am I barking up the wrong tree or does this hardware sound like a good idea for the intended purpose?

First off, check how much metadata you are likely to have to store. Reckon its quite small, for main metadata, and not huge for dedup, if you use that (most people don't). Theres a thread about it somewhere recently, because i asked that question too.

So I'd question what your reason/need for 1.6TB devices is. If its that they are cheap and future proof, fine. But you won't need that size. I've got 45TB of data deduped down to 12.9TB and the entire metadata and dedup tables are only about 150-200 or so GB max (I haven't checked exactly).

Second, why 3? Is that because of 3 way mirror, or RaidZ? What vdev structure? If youre thinking of RaidZ, don't. Don't raidZ metadata, you want speedy access and very fast R/W/rebuilds, not parity checks. Do it as a plain 2 or 3 way mirror.

Third, P3610 was a mid range one for write latency. Do you need sync writes (SLOG/ZIL)? P3700 was the low latency one in that family, not P3610. Just be aware.

Laat, can you afford Optane (900p/905p)? If so, go for a mirrored pair of those, a hundred times over, compared even to Intels P36xx/P37xx. Ive added a resource on SSD/Optane and special vdevs, its got some useful data too. And another on dedup machine setup that also covers what good soecial vdevs can do, and last, this thread (https://www.ixsystems.com/community...s-and-is-l2arc-less-useful.86086/#post-600233). They overlap a lot, but still, go read them.

Constantin · Aug 5, 2020

Thank you for your reply!
My dataset is largely dormant, has no VMs, no dedup, etc. It's just a Z3 on 8 HDDs supplemented by a p4801x SLOG and a m.2 1TB L2ARC SSD.

My hope was to mirror the 3 SSDs in the special VDEV to improve small file performance yet maintain the kind of redundancy that a Z3 array calls for.

Or do you think the 1TB EVO 860 L2ARC should be put to pasture and I put the metadata on the special VDEV also?

Running multiple (even smaller) Optanes as a metadata cache gets rich very quickly!

As for SLOGs, I only have one here, a 100GB P4801x, which has been serving me very well.

Stilez · Aug 5, 2020

The L2ARC benefits are basically related to how much RAM you have and if you have RAM issues. You dont say whats on the pool or how you use it, or how full it it, or how much RAM or the size of your Z3 HDDs. We can say that the L2ARC SSD is independent of the rest. Also to me, it seems a bit strange that you say it's "largely dormant" but also you feel a need to "improve small file performance". Dormant systems tend not to have performance issues, that's why they are called dormant?

Those would be my questions, before any comment.

Constantin · Aug 5, 2020

The purpose of this NAS is as a (currently single-VDEV) pool for bit-rot free storage. There are a lot of pictures, movies, iTunes files, documents, and like stuff stored on it. Most of the data is on a write-once basis... i.e. it doesn't change much. However, writing small files is relatively slow, so I thought the special VDEV may come to my rescue.

One of the biggest stressors / workloads for the NAS is when I make regular backups and there the presence of a "hot" L2ARC made a huge difference. Other "regular" file transfer include TimeMachine sparsebundles getting updated fairly regularly or the iTunes media folder expandinf as new content is added. Time Machine is probably the biggest source of daily write activity, besides snapshots.

I am considering the inclusion of metadata in the Special VDEV simply because OSX Mojave behaves badly when it comes to browsing network volumes - unlike older versions of OSX, the OS no longer gives a visual clue that the OS is still waiting for directory data - you simply have to wait and see if something pops up or not. (I consider this one of the biggest failures of the current generation OSX UI, but whatever.)

The L2ARC in my NAS is set to metadata = only. Thus, once it is hot, browsing is speedy and reliable. However, to achieve "hot" status, one generally has to pass over a directory about 3 times. I wish L2ARC data were preserved between reboots and this feature may become available in the future, but I'm not holding my breath. The designers at iX are likely more interested in getting special VDEVs to work since they can, among other features, behave like a persistent L2ARC.

The crucial difference is that a failure in the L2ARC doesn't affect the pool. However, a special VDEV failure will take down the pool. Hence, my desire to use a 3-wide mirror for the special VDEV, using enterprise-grade SSDs. So in closing, I'd like to improve the read/write performance of the NAS for small files in particular and thought that giving the 40TB pool about 1.6TB for small files would help speed things up a lot when it comes to rsyncs and like operations.

Stilez · Aug 5, 2020

Constantin said:
snip..

Thats plenty info to be going on. Let me do some thinking.....A lot of the access will be metadata. So just a special vdev will help a lot with that, L2ARC will help too, and you have it, so may as well keep it. Let's take this a step at a time.

First, L2ARC as that's easy. You have the 1TB M.2 and that's a good choice of device ideal for routine L2ARC. If it runs a bit slow under load it's fine. If it dies and isn't redundant it doesnt hurt anything. Its probably fast enough to read but not the cutting edge fast you'd want for your core disks in the pool anyway. It's helping and not hurting? May as well leave it be. No reason to change its use, and you dont have a better use for it.

Some useful tunables to speed up metadata loading are:

vfs.zfs.dedup.prefetch=1 You don't use dedup but someone reading this might
vfs.zfs.metaslab.debug_load=1 Force early free space map loading
vfs.zfs.metaslab.debug_unload=1 and prevent it unloading
vfs.zfs.l2arc.write_max=3000000000 and again in ongoing use (3 GB/sec) (loader or RC??)
vfs.zfs.arc.meta_min=SOME_NUMBER_OF_BYTES (loader) prevents metadata up to this size being kicked out of RAM
vfs.zfs.l2arc.feed_again=1 Speed up L2ARC load
vfs.zfs.l2arc.write_boost=3000000000 don't need to protect our SSD, let it load data as fast as it wants at warmup (3 GB/sec)
vfs.zfs.arc.min_prescient_prefetch_ms=6000 Data loaded becasue we know we will use it soon, keep in RAM for a bit longer
A lot more but depends on your exact hardware, so I wont go into it, those are kinda general ones that should be safe for a lot of uses. A lot won't be relevant with special vdevs, such as metadata block size tweaks. Some people will point to max async I/Os per device but there are many others, and so much depends on your pool, hardware and use..

I'm quickly coming to the conclusion that almost anything you do, special vdevs will be a huge performance win. Even if its just backups and scrubs. So take that as read and move on...

The short answer is yes, sounds good to me. You can tweak it various ways - different model SSDs, some work on tunables, maybe partition the SSDs in case, to reserve 20GB for SLOG/ZIL if it's ever needed, but given your description these seem to be the takeaway points

3 way mirror, check. Sounds good
Pretty decent loss-protected datacentre quality SSDs, if not top flight, sounds good
1.6 TB raw space after mirrorring, plenty of space for metadata and a lot of small files. As you're using both, monitor very carefully and be prepared to set a low size limit, and cut down on file use when you get to say 40% full. There's no easy way to selectively move larger size small files off the vdev once done, or to move migrated metadata back onto the special vdev once moved off if you have to alter things.
An alternative would be to manually partition the 1.6TBs into say 300 GB (for metadata) and 1.25 TB (for small files), that way your small files never squeeze out space needed for future metadata. The cost is a bit of wasted space since you aren't using the ZFS feature of storing both in one vdev. Not as much as you'd think, because you can fairly accurately figure your metadata sizes. Also it means metadata and small.files can be independently.migrated over time, if needed.
Build your new pool from scratch, and send-recv the data, if there's any way on earth you can. That way you get all data optimally placed using latest zfs stuff. But tricky as your 8 HDDs are Z3, not easy to do. I'm not sure how you migrate existing data in situ, and rewriting existing files may not touch all metadata.
If you have the nerve, and your backup is reliable, what I'd do would be to back up one last time, test the data is good! then destroy and rebuild the pool under 12-beta. Then send the data back, using -x options to ensure that original stuff doesnt overwrite your new pool settings. If you need the commands I used for that, say so.

Constantin · Aug 5, 2020

I really appreciate the insights. One thing I was also considering is to put the older system images I have here into sparsebundle files to eliminate the huge number of small files that they contain before I undertake the swap you suggest. Does that seem like a good idea?

I figured it would help a lot with rsync as well as the bands are large continguous files than all the small stuff that makes up a 68K Mac and similar stuff stored here.

Stilez · Aug 5, 2020

Constantin said:
I really appreciate the insights. One thing I was also considering is to put the older system images I have here into sparsebundle files to eliminate the huge number of small files that they contain before I undertake the swap you suggest. Does that seem like a good idea?

I figured it would help a lot with rsync as well as the bands are large continguous files than all the small stuff that makes up a 68K Mac and similar stuff stored here.

I can't comment on that as written, not knowing enough about sparse bundles. But my server stores a ton of system images, and every last one of them is packed one way or another into single files, by the originating (imaging) software.

A simple solution would be to tar-gzip (.tgz) them locally - that is, have a system image inside a directory, tgz the directory, verify/test its contents! then delete the directory and keep the tgz file. Same net effect, and would I do that? Very yes.

Stilez · Aug 5, 2020

I added the following to a previous message, but you might not see it. Partitioning your 1.6's into 0.3+1.25's, also means that metadata and small.files not only can't crowd each other out, but they can be independently.migrated to larger devices over time, if needed.

So for example, say your small files grew. In a years time you could buy 3 x 512 GB SSDs for metadata only, zpool attach them to the metadata vdev and let them resilver, then zpool detach all 3 original 300 GB partitions. Then, next stage, you zpool detach a partitioned 1.6 from the small.files vdev, unpartition it, and zpool attach it back. Repeat x 3 and when the last of the 3 is attached back, autoexpand will kick in and the small files vdev will occupy the entire 3 x 1.6 mirror, not just 1.25 of it.

But bear in mind, you can only grow vdevs that way, not shrink them, and with Z3 you can't remove a top level ssd vdev, only grow it or (by growing to a new device) migrate it bigger. You can't zpool remove a top level vdev with Z3 in the pool - that's a ZFS limitation - so whatever you do, if you evaluate your needs and you need more ssd space, switching mirror disks to slightly larger mirror disks is the only option, apart from destroy and rebuild.

For that reason I'd keep the original partitioning to under 250 GB for metadata. Its probably plenty, and means you can later migrate metadata to 3x 256 not 3x512 which will save future money. To.do this, you could partition them all something like 250 GB metadata (nvd1p1 nvd2p1 and nvd3p1) + 900 GB small files (nvd1p2 nvd2p2 and nvd3p2) + 425 ish GB freebsd-swap on each ssd (nvd1p3 nvd2p3 and nvd3p3). Then as time goes on, you can see how both vdevs fill, and you can easily one ssd at a time detach, reformat to say 400+900+275 swap (if you need more metadata space), or 250+1100+225 swap (if you need more small file space), reattach the ssd, and let zfs autoexpand the required vdev selectively that way. Later on you can repeat and expand again, and then again later, using more of the swap space you held back, until the ssd is fully enough used (70-80%?), and the 1.6 ssds can't hold both metadata and small files and you need to get more ssds or let some small data sit in the main pool. That way you don't reserve sizes that your pool doesn't really need, waste space, and force an avoidable extra purchase or larger purchase down the line.

Constantin · Aug 6, 2020

Those are great suggestions, and I will follow them. I may come back to you on the commands you mentioned, but it may be better if I do so in a separate thread, as I have hijacked this one enough (Sorry @HoneyBadger!) Thank you again for all your suggestions! The drives are on order, and I will slowly prepare my disk array here for the transition.

HoneyBadger · Aug 6, 2020

Based on previous experience I would wager on special vdevs needing some pretty hefty endurance values if you're going to be doing metadata/ddt on them. I'd bet on needing 3 DWPD as a minimum with a preference for more.

Optane is obviously the best option as @Stilez mentioned from both endurance as well as the ability to deliver consistent read latency under a write workload - but if that's too pricey, look for SSDs that get a good grade on the VMware vSAN HCL as "hybrid cache tier" devices, since that's a similar mixed-RW workload.

Regarding tunables - I don't think your L2ARC scan rate will be enough to push 3GB/s there, but you're welcome to try. I'd be a little more conservative still and go with 300MB/s - something tells me L2ARC will fill more than fast enough at those speeds.

And I've got to remake the thread with some modern data anyways, so no worries if it gets a touch cluttered.

Stilez · Aug 6, 2020

HoneyBadger said:
Regarding tunables - I don't think your L2ARC scan rate will be enough to push 3GB/s there, but you're welcome to try. I'd be a little more conservative still and go with 300MB/s - something tells me L2ARC will fill more than fast enough at those speeds.

3 GB/sec is kinda shorthand for "don't throttle it, and if the SSD doesn't live forever, so be it". Its not a statement of expected data transfer values ;-)

ornias · Aug 6, 2020

@Constantin "I wish L2ARC data were preserved between reboots and this feature may become available in the future, but I'm not holding my breath"

It has already been merged into OpenZFS and will be included in 2.0 :)

Stilez · Aug 6, 2020

ornias said:
@Constantin "I wish L2ARC data were preserved between reboots and this feature may become available in the future, but I'm not holding my breath"

It has already been merged into OpenZFS and will be included in 2.0 :)

The one I'm waiting for, now that OpenZFS 2 is close, is log dedup (I think its called?). I get mixed vibes that its wanted commercially and paid for as development, and that its wanted but not being worked on much. Any idea what the actual situation is?

ornias · Aug 6, 2020

Stilez said:
The one I'm waiting for, now that OpenZFS 2 is close, is log dedup (I think its called?). I get mixed vibes that its wanted commercially and paid for as development, and that its wanted but not being worked on much. Any idea what the actual situation is?

I can't find any reference to it on github, any link about what you are talking about?
It doesn't seem to be part of OpenZFS2.0 nor does it seem to be developed at the moment.

Is it possible you are confusing some sort of Proof-of-Concept thing with actual development? There are a LOT of PoC's being done with ZFS and the time from PoC to actual PR is about a year, the time from PoC to merge is about 3-5 years.

Constantin · Aug 6, 2020

HoneyBadger said:
Based on previous experience I would wager on special vdevs needing some pretty hefty endurance values if you're going to be doing metadata/ddt on them. I'd bet on needing 3 DWPD as a minimum with a preference for more.

Optane is obviously the best option...

I hear you and theoretically, I could put on a bifurcated / switched NVME riser card and Optane away, but that stuff is not cheap. I use it in my SLOG, it does a great job there

The good news is that the S3610 just makes your cutoff at 3 DWPD. The S3710 series does 10, though at a 25% premium over the cost of the 3610 series and the largest drive available in the S3710 series is 1.2TB vs. 1.6TB in the 3610 series.

eBay is filled with these kinds It drives now and for $200 a 1.6TB enterprise-grade drive in a 3-way mirror seems like a ok risk/reward proposition. I may even buy a 4th to have a cold, proven spare here.

ornias · Aug 6, 2020

Constantin said:
I hear you and theoretically, I could put on a bifurcated / switched NVME riser card and Optane away, but that stuff is not cheap. I use it in my SLOG, it does a great job there

Nice thing about Optane:
You can use it as both SLOG and L2ARC without issue...

Important Announcement for the TrueNAS Community.

SLOG benchmarking and finding the best SLOG

Explorer

Guru

actually does care

Vampire Pig

Guru

Vampire Pig

Guru

Vampire Pig

Guru

Vampire Pig

Guru

Guru

Vampire Pig

actually does care

Guru

Wizard

Guru

Wizard

Vampire Pig

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SLOG benchmarking and finding the best SLOG"

Similar threads