The path to success for block storage

jgreco · Dec 31, 2019

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times.

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most people come to ZFS for, and what a vast majority of the information out there is about. The other is storage of small, randomly written and randomly read data. This includes such things as database storage, virtual machine disk (ESXi VMDK, etc) storage, and other uses where lots of updates are made within the data. This is optimal for mirrors. The remainder of this document is generally aimed at VM storage and iSCSI, but is still valid for database storage, NFS datastores, etc.

1) Recognize the biggest underlying fact: ZFS is a copy-on-write filesystem.

With a bare hard disk, if you issue a write command to LBA 5678, that specific LBA on the HDD is written, and will be right after LBA 5677 and right before LBA 5679. However, with ZFS, when you write to a virtual disk's LBA 5678, ZFS allocates a new location for that new block, writes it, and frees the old. This means that your system which might have previously had LBA's 5677, 5678, 5679 as sequential data on the ZFS pool will now have 5678 in a different spot. If you try to do a read of the "sequential" LBA's 5677, 5678, 5679 from the VM, there will be a seek in the middle. This is generally referred to as fragmentation. This property would seem to suck, but it brings with it the ability to do a variety of cool things, including snapshots.

You need to pay particular attention to fragmentation as a design issue.

2) You need to use mirrors for performance.

ZFS generally does not do well with block storage on RAIDZ. RAIDZ is optimized towards variable length ZFS blocks. Unlike "normal" RAID, RAIDZ computes parity and stores it with the ZFS block, and on a RAIDZ3 where you store a single 4K sector, you get three parity sectors stored with it (4x space amplification)! While there are optimizations you can do to make it suck less, the fact is that a RAIDZ vdev tends to adopt the IOPS characteristics of the slowest component member. This is partly because of what Avi calls "seek binding", because multiple disks have to participate in a single operation because the data is spread across the disks. Your ten drive RAIDZ2 vdev may end up about as fast as a single drive, which is fine for archival storage, but not good for storing lots of active VM's on.

By way of comparison, a two-way mirror vdev can be servicing two different operations (clients reading) simultaneously, a three-way mirror vdev can even be servicing three different operations. There is massive parallelism available with mirrors.

Additional reading:
Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

3) Plan to use lots of vdevs.

For most VM or database applications, you have lots of things wanting to do lots of I/O. While hard disks are much larger than they were 25 years ago (16TB vs 1GB), their ability to sustain random I/O is virtually the same (approximately 100-200 random IOPS). With the advent of virtualization, hard drive IOPS are getting shared between VM's, creating an effective reduction in HDD IOPS per VM over what you'd get from a physical workload. ZFS can help with the read workload through ARC and L2ARC caching, but for writes, it always goes to the pool. Using more vdevs increases the available pool IOPS.

Most virtualization designs set a target level of IOPS per VM. It helps to recognize that a single HDD vdev only has maybe 200-300 mixed IOPS available, so if you are planning on 50 IOPS for each VM, and you want 40 VM's, you need probably at least 8 vdevs to be in the ballpark, 10 would be better.

4) ZFS write speeds are closely coupled to easily finding large contiguous ranges of free space.

ZFS writes are a complex topic. One of the biggest factors in ZFS write speeds is the ability of the system to find large contiguous runs of free space. This ties in to fragmentation as well. In general, ZFS will tend to write a transaction group to disk as a large sequential write if it can find the free space to do so. It doesn't matter if the files being written are for sequential or random data! Because of this, ZFS seems to be amazingly fast at writes especially on a relatively empty pool. You can write a large sequential file to the pool and it goes fast. You can rewrite random data blocks and it also goes fast -- MUCH faster than if it were seeking around!

But there's a dark side to this. If you are writing on a fullish fragmented pool, all writes will be slow. You can be writing what you think is a large sequential file, and ZFS will be having to scrounge together little bits of space here and there due to fragmentation, and it will be slow.

Prior analysis of this suggests that this effect becomes very significant at around 50%. This isn't to say that every pool that is 50% will be very slow, but that over time, a pool with 50% occupancy will tend to stabilize at a steady state with relatively poor write performance in the long run.

5) Because of this, a 12TB 5400RPM drive is a lot more valuable to most pools than a 6TB 7200RPM drive.

By the time you are seeking heavily enough for you to be concerned about the RPM of the drive, you have already dropped from being able to write at 150-200MBytes/sec (sequential) to the drive down to just a few MBytes/sec (random). A 7200RPM drive going at even 10MBytes/sec (200 48KByte random writes per second) is nowhere near as fast as a 5400RPM drive writing sequentially.

Buy 5400/5900RPM drives much larger than you'd otherwise think you need if you want fast write speeds. I think 7200RPM drives are for chumps.

6) Keep the pool occupancy rate low.

This ties in with the write speed strategy. ZFS needs to be able to easily find large amounts of contiguous free space. Our friends at Delphix did an analysis of steady state performance of ZFS write speeds vs pool occupancy on a single-drive pool and came up with this graph:

A 10%-full pool is going to fly for writes. By the time you get up to 50%, the steady state performance is already pretty bad. Not everyone is going to get there... if you have a lot of content that is never rewritten, your fragmentation rates may be much better because you haven't rewritten as much stuff.

Particularly noteworthy: The pool at 10% full is around 6x faster than the pool at 50%.

But what about reads? We've spent all this time talking about writes and free space. ZFS rewards you with better write speeds if you give it gobs of free space. Reads still suffer from fragmentation and seeks!

This is true. ZFS really only has one mechanism to cope with read fragmentation: the ARC (and L2ARC). So these next bits are somewhat simpler.

7) It is best to have at a bare minimum 64GB RAM to do block storage.

Especially with iSCSI, block storage tends to do poorly on ZFS unless there is a lot of cache. While there is no one-size-fits-all rule, doing anything more than trite VM storage seems to go poorly with less than 64GB RAM.

8) Ideally you want to cache the working set

The working set is a term used to describe "active data" on the pool -- data that is being accessed. For example, on most UNIX systems, the disk blocks for /bin/sh are frequently read, but the disk blocks for the manual page for phttpget(8) are probably not ever accessed. It would be valuable to have the disk blocks for /bin/sh in ARC but not phttpget's man page in ARC. How exactly you wish to define the working set is a good question. Blocks read within a 5 minute period? 60 minute? Once a day? This doesn't have a "correct" answer, but it isn't unusual for the working set of a VM to be in the range of 1/50th to 1/20th of the on-disk size of the VM.

By caching the working set, you free the pool to focus on reading the occasional thing not in the working set, and to focus on writes. A ZFS system with the entire working set cached will show almost no pool read activity.

A lot of the working set isn't frequently accessed. It's fine for that to be covered by L2ARC. You want to size your ARC to cover the frequently accessed stuff plus enough space for the L2ARC indirect pointers.

Additional reading: Why iSCSI often requires more resources for the same result

9) VM storage is an exercise in parallelism and fragmentation

Don't bother doing conventional benchmarks for your VM storage pool. A good VM storage pool is designed to be doing many operations in parallel, a thing that many benchmarks suck at. It is far better to run benchmarks designed for heavy parallelism from multiple VM's in your production setup, and don't just run them once when the pool is empty, but rather let it get fragmented and then see how it is.

10) Don't misunderstand the ZIL.

The ZFS Intent Log is not a write cache. We call a Separate LOG device the "SLOG" for a reason. It isn't a cache.

No amount of SLOG will make up for a crappy RAIDZ pool design. (Again! ZIL/SLOG Is Not A Cache!)

The fastest sustained write speed your pool will EVER be capable of is when you turn off sync writes. That's it. No more.

Adding sync writes (whether ZIL or SLOG) will ALWAYS slow down your pool compared to the non-sync write speed.

We use sync writes on VM data to ensure that a VM remains consistent if the filer panics or loses power. If this is not a concern for you, feel free to disable sync VM writes, and things will go faster!

Additional reading: Some insights into SLOG/ZIL with ZFS on FreeNAS

11) Write speeds greater than read speeds?

When ZFS is "writing" to the pool, it is actually creating a transaction group in RAM to commit to disk later. If ZFS is "reading" from the pool and the data is not in ARC/L2ARC, it actually needs to go out to a HDD to pull the data in. If your read speeds are slower than your write speeds, it just means that the data being read wasn't in cache. If you expected that data to be in the working set, perhaps your working set is too small.

12) Make sure your drives aren't SMR.

Shingled magnetic recording drives are generally unsuitable for FreeNAS and ZFS, but are particularly horrible for block storage due to the need to be rewriting small blocks in the middle of random tracks. If you use SMR for block storage, expect it to suck. Sorry. No gentle way to say it.

I'm probably not done with this but felt I needed to bang out some bits. If you reply please do not be shocked if I trim your reply and steal your idea, remember, I'm a grinch.

Mirfster · Feb 4, 2020

jgreco said:
I think 7200RPM drives are for chumps.

Oof, ya hit me right in the feels.

jgreco · Feb 4, 2020

Mirfster said:
Oof, ya hit me right in the feels.

jgreco said:
I think 7200RPM drives are for chumps.

Okay, fine, maybe slightly overstated, but basically you only get a small percentage speed increase in random I/O capacity on a 7200RPM drive compared to a 5400RPM drive. This tends to be more meaningful on a conventional filesystem where fragmentation isn't as big a deal. Given the price differential between a 5400RPM drive and a 7200RPM drive (typically 1.5-2x more expensive historically, in my experience) I'd rather spend that extra money on a larger drive.

A CoW filesystem will always veer towards greater fragmentation and ZFS simply doesn't have any tools to FIX fragmentation.

You can, obviously, mitigate some of the damage from fragmentation by making sure there's large contiguous runs of free space, so my preference is to spend dollars on larger hard disks rather than "faster" hard disks, and keep the larger hard disks emptier.

On the read side of things, really all you can do is throw lots of L2ARC at it.

If you've burned through all those things and it's still too slow, then faster drives are all that's left, but I don't think you'll see a huge speed boost.

woods · Feb 6, 2020

interesting read!

Do you have a similar piece of writing on sing-threaded storage systems? Like a video NAS for a single client? Would you also recommend using mirrors in this case if especially reading speed maximisation is preferred?

jgreco · Feb 6, 2020

No, most of that is easier to figure out.

Mirrors tends towards being faster for concurrent accesses. RAIDZ is inherently more oriented towards singlethreaded access.

woods · Feb 13, 2020

Maybe a thing to add is that sometimes, slow speeds can be caused by disks with some bad sectors. As a newby, it was quite confusing to me to see extremely slow writes in comparison to my reads and FreeNas not taking the disk offline, showing the pool as healthy. So I only figured it out by doing a test and replace that drive. It fixed the issue immediately.

otpi · Jun 5, 2020

Thanks for the write up!

It begs the question: Does ZFS ever do any housekeeping? Like defragging the disks?

Does this mean a "full" (fragmented) pool benefits more from adding SSD cache (L2ARC), and SLOG for those writes?

garm · Jun 5, 2020

otpi said:
Does ZFS ever do any housekeeping? Like defragging the disks?

No

otpi said:
Does this mean a "full" (fragmented) pool benefits more from adding SSD cache (L2ARC), and SLOG for those writes?

ZFS tracks free space fragmentation, once you reach levels that hit performance your pool is over utilized and your upgrade plan need to kick in

jgreco · Jun 5, 2020

otpi said:
Thanks for the write up!

It begs the question: Does ZFS ever do any housekeeping? Like defragging the disks?

It can't.

The general nature of a CoW filesystem means that overwrites always result in fragmentation.

The massive feature set of ZFS, clones, snapshots, etc., mean that there could be dozens or even thousands of references to a given data block, and to move that one data block, you would need to update all of the metadata that refers to it so that could result in thousands of metadata updates just to move a file data block. And then rewriting those metadata blocks is also a problem...

Does this mean a "full" (fragmented) pool benefits more from adding SSD cache (L2ARC), and SLOG for those writes?

Yes for L2ARC, and SLOG is *not* a cache, so no to that. I have discussed this many times. The only thing you can really do to improve the write situation is to have lots of free space available on the pool.

otpi · Jun 8, 2020

jgreco said:
The massive feature set of ZFS, clones, snapshots, etc., mean that there could be dozens or even thousands of references to a given data block, and to move that one data block, you would need to update all of the metadata that refers to it so that could result in thousands of metadata updates just to move a file data block. And then rewriting those metadata blocks is also a problem...

Makes sense.

For my typical home NAS use I don't expect this to ever become a problem. So this is strictly hypothetical, out of curiosity: What happens when a disk is resilvered? Does it also rebuild the fragmentation, or is it on a file level such that large fragmented files would be sequentially written on the new disk?

thepixelgeek · Jun 16, 2020

I discovered I have SMR drives. Now what?

jgreco · Jun 16, 2020

thepixelgeek said:
I discovered I have SMR drives. Now what?

1) Replace them

or

2) Pray you never need to resilver - a possible remediation would be to use a CMR drive for any replacement drive

or

3) Wait a really long time when you resilver

I know these are not the answers you want to hear. I am trying to present you with realistic options though. In theory the SMR drives are actually pretty good at read-mostly storage, so for anything that is archival in nature or read-mostly, you could possibly be okay.

jgreco · Jun 16, 2020

otpi said:
Makes sense.

For my typical home NAS use I don't expect this to ever become a problem. So this is strictly hypothetical, out of curiosity: What happens when a disk is resilvered? Does it also rebuild the fragmentation, or is it on a file level such that large fragmented files would be sequentially written on the new disk?

Fragmentation refers to the manner in which blocks are occupied on the disk (or free blocks are available, or however you wish to describe it).

RAIDZ works "differently" than RAID5 in that parity locations are not precomputed, but a block of data (like a 1MB block) is laid down contiguously across the disks, and the repaired sectors need to be in the same spots, so, yes, more or less, resilvering does nothing to help with fragmentation.

f00b4r00 · Jun 28, 2020

noob question: is fragmentation a matter at pool or dataset level? i.e. if I set tight quotas on some datasets but the pool on which these datasets live has ample free space, will filing the quota-restricted datasets expose the write amplification problem, or am I "safe"?

Ericloewe · Jun 28, 2020

Datasets are merely a consumer of what is, in effect, an abstract storage layer. Hence the zpool and zfs commands being separate.
For performance, the pool layer matters, not whether the dataset is full according to its quota.

Well, there is one catch: IIRC, standard behavior is to rigidly enforce quotas, which takes a while and slows down writes if the dataset is near full. There was talk of a new option to have fuzzy enforcement of quotas, where you don't lose speed, but quotas may be slightly exceeded.

Sir.Robin · Jul 13, 2020

Sooo... my 8x2TB pool is about 50% utilized. Is there any point in going down to a 6x 4TB pool when upgrading from 2 to 4TB drives?
I'm thinking performance wise. Not counting on using the extra space.

kspare · Jul 13, 2020

One thing we do to combat fragmentation is to simply migrate all our vm's to another server. Thats really the only way to deal with it?

jgreco · Jul 13, 2020

kspare said:
One thing we do to combat fragmentation is to simply migrate all our vm's to another server. Thats really the only way to deal with it?

Yes, the historical answer for ZFS fragmentation, database users, block storage users, etc., is to migrate stuff back and forth in order to force defragmentation. Databases tend to be highly random access anyways, but block storage for VM's can definitely involve rewritten sequential segments (think OS updates etc), so this is most apparent to folks using ZFS for VM storage.

Rewriting is not the only answer. If you keep gobs of free space on the pool along with gobs of ARC and L2ARC, you can get to a point where the working set is all in your ARC/L2ARC and then the write speeds to the pool are mostly what's being impacted by fragmentation. So if you have 20TB of pool storage, 10TB of which is used, 1TB of which is working set, if you have 1TB of L2ARC, your working set will tend to live out of the L2ARC, but your writes will get slower over time as fragmentation increases (~50% utilization). If you increase the pool storage to 100TB, your writes should remain relatively fast as well, because ZFS isn't having to zip all over the place to allocate small runs of blocks. However, lots of people are understandably resistant to the idea of needing 200TB of disk space (since you need to use mirrors) plus 1TB of SSD L2ARC, to get 100TB of pool storage that gets you 10TB of very fast storage.

Don't shoot the messenger. I'm just answering your question with the other way to deal with it -- throw lots of resources at it and the problem almost vanishes. This is compsci. You're making tradeoffs to leverage relatively cheap HDD space to mitigate the issue of very low HDD seek speeds.

We're all waiting for the point at which SSD makes more sense. Depending on what numbers you use and what endurance you need, we might almost be there. Two 10TB HDD's give you ~1-5TB of VM space, 1 if you want good speed, at a cost of around $400. Four 1TB SSD's give you ~1-1.5TB of VM space, at a cost of around $500, if you don't mind consumer endurance levels. So it's close.

kspare · Jul 13, 2020

This is true. We also run 256gb of ram, Mirrored P3700 ZIL, and P3700 1.6tb L2Arc. we're now experimenting with mirrored meta drives and it looks really good.

When we migrate off, it's usually quarterly, this is to do general maint of the server as well. ie. firmware, sofware updates etc.

jgreco · Jul 13, 2020

kspare said:
When we migrate off, it's usually quarterly, this is to do general maint of the server as well. ie. firmware, sofware updates etc.

Yeah, that's the other issue, it's really hard to do upgrades without going through an evacuation cycle. This is one of the places where HDD will continue to win out for some time yet, because you *have* to have the extra space lying around, unlike with SSD where you might be tempted to crowd in VM's.

Important Announcement for the TrueNAS Community.

The path to success for block storage

Resident Grinch

Doesn't know what he's talking about

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Contributor

Wizard

Resident Grinch

Contributor

Patron

Resident Grinch

Resident Grinch

Dabbler

Server Wrangler

Guru

Guru

Resident Grinch

Guru

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The path to success for block storage"

Similar threads