The path to success for block storage

TxAggieEngineer · May 21, 2023

Thanks for a very interesting post! I'm curious about a couple of the recommendations. I've been working with iXsystems on an X10 configuration that will be 100% iSCSI for a system to support ~15 VM's in a production environment. The X10 has a max of 32GB of RAM and this configuration will have one pool with three 2-disk vdevs. The performance specs I have been provided are 1900 IOPS at 600MB/s (614,400KB/s) at 80% capacity. This seems far higher than what is possible in this configuration based on what I've read and especially with only 32GB RAM vs. the 64GB recommendation so I'm wondering how there could be such a significant delta? I realize it's been four years since the original post so has TrueNAS and/or ZFS handling of iSCSI improved significantly during that time?

jgreco · May 21, 2023

TxAggieEngineer said:
I realize it's been four years since the original post so has TrueNAS and/or ZFS handling of iSCSI improved significantly during that time?

No. We just disagree. iXsystems has made a few comments on my recommendations over the years; these are recommendations where *I* feel confident in telling some random Internet user a statistic or suggestion that should hold true under reasonable conditions. iXsystems, on the other hand, enters into customer relationships and may be contractually obligated to make good on what they say. I provide no such guarantee but I do not want my reputation in the industry tarnished. I might overshoot a bit, as a result. On the other hand, iXsystems has previously indicated that 16GB will "work" with iSCSI (and yes it will) but my take on this is that it doesn't seem to work out well for the average deployment.

I recommend 64GB as a floor for all but the most trite iSCSI setups. I've outlined my reasoning in

Resource - Why iSCSI often requires more resources for the same result

iSCSI is a SAN protocol. NFS, CIFS, etc., are NAS protocols. For a NAS protocol, the client sends a command to the filer, such as "open this file", or "read ten blocks", or "remove this file." On the filer, the local NAS protocol daemon...

www.truenas.com

and I believe it to be reasonably sound, but it isn't true in all cases (which explains the carefully placed word "often" in the title). A secondary reason is that the experience here on the forums has generally been that L2ARC is more useful as you give it more memory to gather up MFU statistics; most people trying L2ARC with only 16 or 32GB of RAM see much less benefit. This presumably has something to do with the average working set size of a VM -- some blocks of a VM are rarely accessed, and ideally you want enough ARC to allow intelligent eviction of infrequently used blocks to the L2ARC, which then gives you some additional IOPS.

TxAggieEngineer said:
The performance specs I have been provided are 1900 IOPS at 600MB/s (614,400KB/s) at 80% capacity.

In reading between the lines, I feel like they're selling you on three reasonably fast HDD based vdevs. This controls some of the rest of this, so I'd just like to clarify that. 1900 IOPS for six devices breaks down to about 330 IOPS per drive, clearly HDD territory, and 600MB/s breaks down to 100MBytes/sec per drive, which may be overly optimistic. Hard drive seeks are performance killers, so it helps to understand that a highly pessimistic calculation of 200 IOPS per drive doing 4K I/O, assuming seeks after most of them, only nets you about 1MByte/sec. That is terribly worst case performance but it is worth understanding that this is what the low end is. Now pick your chin up off the floor.

So the thing that's a problem here is that ZFS is a copy-on-write filesystem, and as the filesystem ages, you get increased fragmentation of the data on the pool. As that happens, the cost both to read and to write it increases. When a ZFS pool is only 10% full, that means that ZFS has no problem finding contiguous regions of space to write to, and in many cases even random writes may end up having sequential space allocated to them, meaning they write very fast because there's no seek. That's awesome, and many people report "SSD-like" performance on ZFS HDD pools that are fresh and mostly empty. The problem is that as the pool fills, and as the pool fragments, or, worse, BOTH, the inevitable cycles of writing blocks and then freeing them eventually fragment the space. Eventually you reach something called a "steady state" which is where the performance degradation reaches an average where it isn't getting worse (on average) and the pool will continue to perform at about this same level of performance from there on unless some variables change. The main variables are pool percentage full and number of rewrites. For this discussion, I typically ignore number of rewrites because we assume it will eventually get there. We want to know how bad it is REASONABLY LIKELY to get. So, some years ago our friends at Delphix ran a test to demonstrate this. Let's consider their resulting graph. This shows a pool being held at 10%, 25%, 50%, 75%, and 95% full by writing data to that capacity. Once reached, the tester begins removing data and then writing additional data. This test continues until the pool's measured KB/sec reaches a plateau (the "steady state").

Their test was only with a single disk, but the compsci is generally valid regardless. If you have a pool that is only 10% full, it will go much faster because most reads and writes can be fulfilled with minimal seek activity. 90% of the disk is empty and ZFS will favor the large empty areas for new writes. As the older blocks are removed, they contribute towards new empty areas, and as long as you're writing into an empty area, the write will be fast and the reads of that will be fast too.

As the pool fill percentage is increased, performance drops. There's compsci wizardry in here that trades off one thing for another.

So here's what I have to say: I could believe that iXsystems will sell you an iSCSI system with the claimed parameters. You didn't go into particulars, but if you were trying to buy 20TB of iSCSI space at the claimed specs, I could build something that would probably do that job using 6x 20TB HDD's (giving you a 60TB pool) and then if you look at the ~33% mark on the graph, I would find it entirely plausible that you would get a reasonable amount of performance out of it. I don't actually know the actual performance characteristics of the drive that was used by Delphix to generate this, but I can tell you that this is how ZFS gets its reputation for speed. Avoiding excessive seeks (remember the pessimistic 1MByte/sec from above?) is a realistic trick that may be able to deliver what you're looking for. iXsystems has an easier time working these things from real numbers since they have access to the hardware that they're selling you. But I can still explain what you need to know in order to apply reasonable skepticism when kicking the tires on the thing.

pathfinder1985 · Sep 27, 2023

What about mixing ssd and hdd in same pool in mirrors?
I have 4246 netap shelf with 24 x 3TB sas and EMC shelf with 11x800gb ssd MU disks.
Should i make two seperate pools, or 2way mirror with 2 spares on each type of drive?

Ericloewe · Sep 27, 2023

Well, you'd get HDD performance at SSD prices, which I'll take the liberty of assuming is contrary to your goals.

pathfinder1985 · Sep 27, 2023

Ok, so separate road it is :D

2 Mirror, its VMware home lab....40+vms...

HoneyBadger · Sep 29, 2023

pathfinder1985 said:
2 Mirror, its VMware home lab....40+vms...

Make sure to check in on the resource regarding sync writes as well; because it's a home lab, there's obviously a little more leeway on the risk-vs-reward side of enforcing sync writes and potentially having to restore from snapshots or backup.

Ebedorian · Oct 23, 2023

jgreco said:
It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times.

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most people come to ZFS for, and what a vast majority of the information out there is about. The other is storage of small, randomly written and randomly read data. This includes such things as database storage, virtual machine disk (ESXi VMDK, etc) storage, and other uses where lots of updates are made within the data. This is optimal for mirrors. The remainder of this document is generally aimed at VM storage and iSCSI, but is still valid for database storage, NFS datastores, etc.

1) Recognize the biggest underlying fact: ZFS is a copy-on-write filesystem.

With a bare hard disk, if you issue a write command to LBA 5678, that specific LBA on the HDD is written, and will be right after LBA 5677 and right before LBA 5679. However, with ZFS, when you write to a virtual disk's LBA 5678, ZFS allocates a new location for that new block, writes it, and frees the old. This means that your system which might have previously had LBA's 5677, 5678, 5679 as sequential data on the ZFS pool will now have 5678 in a different spot. If you try to do a read of the "sequential" LBA's 5677, 5678, 5679 from the VM, there will be a seek in the middle. This is generally referred to as fragmentation. This property would seem to suck, but it brings with it the ability to do a variety of cool things, including snapshots.

You need to pay particular attention to fragmentation as a design issue.

2) You need to use mirrors for performance.

ZFS generally does not do well with block storage on RAIDZ. RAIDZ is optimized towards variable length ZFS blocks. Unlike "normal" RAID, RAIDZ computes parity and stores it with the ZFS block, and on a RAIDZ3 where you store a single 4K sector, you get three parity sectors stored with it (4x space amplification)! While there are optimizations you can do to make it suck less, the fact is that a RAIDZ vdev tends to adopt the IOPS characteristics of the slowest component member. This is partly because of what Avi calls "seek binding", because multiple disks have to participate in a single operation because the data is spread across the disks. Your ten drive RAIDZ2 vdev may end up about as fast as a single drive, which is fine for archival storage, but not good for storing lots of active VM's on.

By way of comparison, a two-way mirror vdev can be servicing two different operations (clients reading) simultaneously, a three-way mirror vdev can even be servicing three different operations. There is massive parallelism available with mirrors.

Additional reading:
Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

3) Plan to use lots of vdevs.

For most VM or database applications, you have lots of things wanting to do lots of I/O. While hard disks are much larger than they were 25 years ago (16TB vs 1GB), their ability to sustain random I/O is virtually the same (approximately 100-200 random IOPS). With the advent of virtualization, hard drive IOPS are getting shared between VM's, creating an effective reduction in HDD IOPS per VM over what you'd get from a physical workload. ZFS can help with the read workload through ARC and L2ARC caching, but for writes, it always goes to the pool. Using more vdevs increases the available pool IOPS.

Most virtualization designs set a target level of IOPS per VM. It helps to recognize that a single HDD vdev only has maybe 200-300 mixed IOPS available, so if you are planning on 50 IOPS for each VM, and you want 40 VM's, you need probably at least 8 vdevs to be in the ballpark, 10 would be better.

4) ZFS write speeds are closely coupled to easily finding large contiguous ranges of free space.

ZFS writes are a complex topic. One of the biggest factors in ZFS write speeds is the ability of the system to find large contiguous runs of free space. This ties in to fragmentation as well. In general, ZFS will tend to write a transaction group to disk as a large sequential write if it can find the free space to do so. It doesn't matter if the files being written are for sequential or random data! Because of this, ZFS seems to be amazingly fast at writes especially on a relatively empty pool. You can write a large sequential file to the pool and it goes fast. You can rewrite random data blocks and it also goes fast -- MUCH faster than if it were seeking around!

But there's a dark side to this. If you are writing on a fullish fragmented pool, all writes will be slow. You can be writing what you think is a large sequential file, and ZFS will be having to scrounge together little bits of space here and there due to fragmentation, and it will be slow.

Prior analysis of this suggests that this effect becomes very significant at around 50%. This isn't to say that every pool that is 50% will be very slow, but that over time, a pool with 50% occupancy will tend to stabilize at a steady state with relatively poor write performance in the long run.

5) Because of this, a 12TB 5400RPM drive is a lot more valuable to most pools than a 6TB 7200RPM drive.

By the time you are seeking heavily enough for you to be concerned about the RPM of the drive, you have already dropped from being able to write at 150-200MBytes/sec (sequential) to the drive down to just a few MBytes/sec (random). A 7200RPM drive going at even 10MBytes/sec (200 48KByte random writes per second) is nowhere near as fast as a 5400RPM drive writing sequentially.

Buy 5400/5900RPM drives much larger than you'd otherwise think you need if you want fast write speeds. I think 7200RPM drives are for chumps.

6) Keep the pool occupancy rate low.

This ties in with the write speed strategy. ZFS needs to be able to easily find large amounts of contiguous free space. Our friends at Delphix did an analysis of steady state performance of ZFS write speeds vs pool occupancy on a single-drive pool and came up with this graph:

A 10%-full pool is going to fly for writes. By the time you get up to 50%, the steady state performance is already pretty bad. Not everyone is going to get there... if you have a lot of content that is never rewritten, your fragmentation rates may be much better because you haven't rewritten as much stuff.

Particularly noteworthy: The pool at 10% full is around 6x faster than the pool at 50%.

But what about reads? We've spent all this time talking about writes and free space. ZFS rewards you with better write speeds if you give it gobs of free space. Reads still suffer from fragmentation and seeks!

This is true. ZFS really only has one mechanism to cope with read fragmentation: the ARC (and L2ARC). So these next bits are somewhat simpler.

7) It is best to have at a bare minimum 64GB RAM to do block storage.

Especially with iSCSI, block storage tends to do poorly on ZFS unless there is a lot of cache. While there is no one-size-fits-all rule, doing anything more than trite VM storage seems to go poorly with less than 64GB RAM.

8) Ideally you want to cache the working set

The working set is a term used to describe "active data" on the pool -- data that is being accessed. For example, on most UNIX systems, the disk blocks for /bin/sh are frequently read, but the disk blocks for the manual page for phttpget(8) are probably not ever accessed. It would be valuable to have the disk blocks for /bin/sh in ARC but not phttpget's man page in ARC. How exactly you wish to define the working set is a good question. Blocks read within a 5 minute period? 60 minute? Once a day? This doesn't have a "correct" answer, but it isn't unusual for the working set of a VM to be in the range of 1/50th to 1/20th of the on-disk size of the VM.

By caching the working set, you free the pool to focus on reading the occasional thing not in the working set, and to focus on writes. A ZFS system with the entire working set cached will show almost no pool read activity.

A lot of the working set isn't frequently accessed. It's fine for that to be covered by L2ARC. You want to size your ARC to cover the frequently accessed stuff plus enough space for the L2ARC indirect pointers.

Additional reading: Why iSCSI often requires more resources for the same result

9) VM storage is an exercise in parallelism and fragmentation

Don't bother doing conventional benchmarks for your VM storage pool. A good VM storage pool is designed to be doing many operations in parallel, a thing that many benchmarks suck at. It is far better to run benchmarks designed for heavy parallelism from multiple VM's in your production setup, and don't just run them once when the pool is empty, but rather let it get fragmented and then see how it is.

10) Don't misunderstand the ZIL.

The ZFS Intent Log is not a write cache. We call a Separate LOG device the "SLOG" for a reason. It isn't a cache.

No amount of SLOG will make up for a crappy RAIDZ pool design. (Again! ZIL/SLOG Is Not A Cache!)

The fastest sustained write speed your pool will EVER be capable of is when you turn off sync writes. That's it. No more.

Adding sync writes (whether ZIL or SLOG) will ALWAYS slow down your pool compared to the non-sync write speed.

We use sync writes on VM data to ensure that a VM remains consistent if the filer panics or loses power. If this is not a concern for you, feel free to disable sync VM writes, and things will go faster!

Additional reading: Some insights into SLOG/ZIL with ZFS on FreeNAS

11) Write speeds greater than read speeds?

When ZFS is "writing" to the pool, it is actually creating a transaction group in RAM to commit to disk later. If ZFS is "reading" from the pool and the data is not in ARC/L2ARC, it actually needs to go out to a HDD to pull the data in. If your read speeds are slower than your write speeds, it just means that the data being read wasn't in cache. If you expected that data to be in the working set, perhaps your working set is too small.

12) Make sure your drives aren't SMR.

Shingled magnetic recording drives are generally unsuitable for FreeNAS and ZFS, but are particularly horrible for block storage due to the need to be rewriting small blocks in the middle of random tracks. If you use SMR for block storage, expect it to suck. Sorry. No gentle way to say it.

I'm probably not done with this but felt I needed to bang out some bits. If you reply please do not be shocked if I trim your reply and steal your idea, remember, I'm a grinch.

This is the single most important thread I have read on this forum since I've been on here after having set up my TrueNAS scale server. Reading this has made my experience make a whole lot more sense. #lightbulbmoment

jgreco · Oct 23, 2023

Ebedorian said:
This is the single most important thread I have read on this forum since I've been on here after having set up my TrueNAS scale server. Reading this has made my experience make a whole lot more sense. #lightbulbmoment

Awesome.

Every now and then I like to think someone has one of those #lightbulbmoments, and with a little luck it may make you a ZFS convert forever, even if some of the implications (such as relatively large resource consumption) seem to suck at first.

Good luck and happy stashing.

gdreade · Oct 23, 2023

@jgreco : Looks like the formatting of the last paragraph in section 13, "Use an appropriate network" is off; the markdown for the URL (and the URL) is showing, rather than just a formatted link.

jgreco · Oct 23, 2023

<hat_tip></>

faktorqm · Jan 3, 2024

jgreco said:
It kinda depends.

It's totally possible to engineer "sleepy VM's" that avoid unnecessary I/O, especially writes, simply by doing things like disabling atime, doing more disciplined writes in the form of batch updates.

I typically run hypervisor storage in RAID1, generally with something like a 9271-8i controller with CacheVault, and this works pretty well. Back around 2014-2015, I could put three or five "consumer grade" SSD's (two mirrored or four mirrored, plus a spare) in a system, often for less than a single nonredundant "enterprise grade" SSD. So this works out to a "know your workload" game.

Anyways I bought a bunch of Intel 535's for our datacenter use back on Black Friday 2015 for $140-$180/each, which were rated for 40GB/day(?) on a 5 year warranty. I knew our workload was higher than that, but maybe only 80-120GB/day(?), and with SSD prices in rapid decline, I figured I'd burn them out in a year or three and replace them as they fell, getting larger ones. Well SSD pricing trends didn't quite work out that way, but the idea is still sound.

Hi there! happy new year. I'm sorry for replying a 4 year old post. But I'm trying to design a new NAS system and have a question about RAID controllers with cache. My goal is to provide storage iSCSI to a vmware 8 hypervisor trough a dual 10gbe network adapter. So I have 6 SSD 1Tb SATA 3 each. This is for home lab, not work. My plan was to mirror them forming 1 pool with 3 vdevs, each vdev with 2 disks mirrored. It caught my attention the comment because the general recommendation is to use and IT mode firmware for trueNAS, not raid mode for non-iscsi related pools. I understand also that the cache is pointless in this IT mode scenario as there is nothing to buffer. For pools exclusively related to iscsi block sharing, do you recommend a raid controller with IR firmware and use raid1 to make the mirrors? Maybe I'm doing a newbie question but after several hours reading all the recommended posts it's not clear to me yet. Thank you! Regards!

Patrick M. Hausen · Jan 3, 2024

faktorqm said:
For pools exclusively related to iscsi block sharing, do you recommend a raid controller with IR firmware and use raid1 to make the mirrors?

Using any form of "hardware RAID" is strongly discouraged for ZFS regardless of the usage scenario. Never use a RAID controller. There is no benefit and you will likely lose data. ZFS was designed specifically to replace RAID.

jgreco · Jan 3, 2024

Patrick M. Hausen said:
Using any form of "hardware RAID" is strongly discouraged for ZFS regardless of the usage scenario. Never use a RAID controller. There is no benefit and you will likely lose data. ZFS was designed specifically to replace RAID.

I have an entire resource on the topic.

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

faktorqm said:
It caught my attention the comment because the general recommendation is to use and IT mode firmware for trueNAS, not raid mode for non-iscsi related pools.

You never want a "smart" controller in between ZFS and your storage. ZFS *is* the smart controller. The best options are all LSI, and LSI "IT" firmware HBA's are known to reliably do the correct thing. Just storing and fetching blocks.

faktorqm said:
I understand also that the cache is pointless in this IT mode scenario as there is nothing to buffer

Well, there's a huge crapton to buffer. Much more than the capacity of any RAID controller. ZFS will happily ARC a terabyte of data for you if you have the RAM. None of that is going to be meaningfully buffered in a RAID controller's cache, and the RAID controller will also tend to hide the disk status from ZFS. That's very bad.

faktorqm said:
For pools exclusively related to iscsi block sharing, do you recommend a raid controller with IR firmware and use raid1 to make the mirrors?

No. ZFS is still your storage controller and you have to play by the ZFS rules. If you have a RAID card and one of your disks has a silently corrupted block, the RAID controller won't know and it will pass that block up to the OS. ZFS gets the corrupt block, sees the bad checksum, but then CANNOT do anything about it because you've foolishly hidden the redundant disk behind a RAID controller, so ZFS cannot say "give me the OTHER block that holds this same data". Then you lose data because ZFS returns zeroes. If you had used an HBA and exposed both disks to ZFS, ZFS is not only happy to fetch the data from the other disk, but it will also REPAIR the data on the corrupted disk. This is stuff that card-level RAID controllers just don't do.

faktorqm said:
Maybe I'm doing a newbie question but after several hours reading all the recommended posts it's not clear to me yet. Thank you! Regards!

Questions are fine. Basically all the questions that are likely to be asked have already been asked and answered; many of those answers are in the Resource section (which is a bit of a jungle, admittedly). Don't be offended if we point you at an answer that's already been extensively discussed and you'll end up fine if you follow those links. It's just that endless repetition of the same answers is tedious, but we all are also aware that ZFS is wicked complex when you first look into it, so the questions aren't a bad thing.

faktorqm · Jan 3, 2024

thank you very much both of you for your posts. I will continue reading :)

Important Announcement for the TrueNAS Community.

The path to success for block storage

TxAggieEngineer

Dabbler

jgreco

Resident Grinch

Resource - Why iSCSI often requires more resources for the same result

pathfinder1985

Cadet

Ericloewe

Server Wrangler

pathfinder1985

Cadet

HoneyBadger

actually does care

Ebedorian

Dabbler

jgreco

Resident Grinch

gdreade

Dabbler

jgreco

Resident Grinch

faktorqm

Dabbler

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

What's all the noise about HBAs, and why can't I use a RAID controller?

faktorqm

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

The path to success for block storage

Dabbler

Resident Grinch

Cadet

Server Wrangler

Cadet

actually does care

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Hall of Famer

Resident Grinch

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The path to success for block storage"

Similar threads