Why iSCSI often requires more resources for the same result

jgreco · Mar 6, 2015

iSCSI is a SAN protocol. NFS, CIFS, etc., are NAS protocols.

For a NAS protocol, the client sends a command to the filer, such as "open this file", or "read ten blocks", or "remove this file." On the filer, the local NAS protocol daemon translates this into UNIX file syscalls, and passes it off to the filesystem.

For a SAN protocol, the client itself is running the filesystem, and is passing requests for certain blocks over the network, to read or write them. On the filer, the SAN protocol daemon translates this to operations within a single file (or zvol), making the changes the client requests.

At first glance, these two things seem to be very similar, but in practice they may not be, and the differences can hurt you.

Consider the case where a NAS protocol requests a file to be created, written, closed, and then deleted. The client will ask the filer to do each of those operations, and the requests are quick and efficient. The process of creating a file is a single operation for the client, and the filer sorts it out and reads the disk and figures out where to allocate space, and handles all the disk updates. Writing, closing, and deleting the file are also straightforward operations, viewed from the client's side.

On the other hand, for SAN, the same steps are more complex. Since the filesystem code is located on the client, a request to open a file means that the filesystem code has to access the root directory, traverse the directory structure, access the free block list, and then write an update to the target directory, and also write file metadata - each of which are individual read/write operations that may require blocks that are (from the filer's point of view) randomly scattered around. The SAN has no context to understand that this block is directory metadata and that block is file data, or in other words what the relative value of things are. Writing, closing, and deleting the file also result in a larger number of operations.

But those are just immediate differences. There are also more complex side effects.

For example, consider that ZFS has its gorgeous ARC. So, using a NAS protocol, you open a file, write it, and close it. That file is probably stored in ARC. Now you delete that file. It is instantly removed from the ARC, freeing that space for other, more useful things.

That doesn't happen with SAN. Since the SAN abstraction moves the filesystem layer to the client, all ZFS sees are requests to read and write blocks. It has no idea that certain blocks are directories and metadata. It may have no idea that certain blocks have been freed by the client's filesystem. As a result, it is quite possible for useless data to be sitting around in ARC because there's no effective way for ZFS to understand that "these blocks aren't relevant" or "these blocks are most useful." ZFS will make its best guess based on how frequently the data is accessed, but in order for that to work, it needs to be keeping track of a lot more blocks in ARC. It does not have the advantage of its own internal flags as to which blocks might be metadata or the context in which a block is being retrieved (sequential file read, etc) or which ones are no longer needed. By giving it a much larger ARC, it can successfully sample access frequency on a much larger number of blocks, which gives it the ability to arrive at a good level of performance despite the lack of specific insight.

Further, most NAS file updates are to write entire files, which is easily managed by the block allocation strategies of ZFS ... it'll try to find a nice contiguous set of blocks for the file. However, a SAN virtual disk is stored as a single ZFS object (file or zvol), and updates result in fragmentation. Fragmentation is combatted through the use of caching, but caching is more of a remediation than a true solution. A CoW filesystem will always tend towards heavy fragmentation when updating mid-file blocks.

The end result is that you usually need more resources, especially ARC and L2ARC, in order to have good performance with iSCSI when compared to NAS protocols like CIFS and NFS.

Tywin · Mar 6, 2015

jgreco said:
That doesn't happen with SAN. Since the SAN abstraction moves the filesystem layer to the client, all ZFS sees are requests to read and write blocks. It has no idea that certain blocks are directories and metadata. It may have no idea that certain blocks have been freed by the client's filesystem. As a result, it is quite possible for useless data to be sitting around in ARC because there's no effective way for ZFS to understand that "these blocks aren't relevant" or "these blocks are most useful." ZFS will make its best guess based on how frequently the data is accessed, but in order for that to work, it needs to be keeping track of a lot more blocks in ARC.

This is very similar to what happened with SSDs; the abstractions assumed in one era ceased to be applicable in another, and there was a period where SSDs couldn't reach their full potential. Enter the TRIM command, which basically coupled the layers together by letting the file system tell the SSD which blocks it doesn't care about anymore.

Based on a cursory Google search, it looks like iSCSI provides similar functionality through the "UNMAP" command, although it seems this is not implemented by FreeNAS, and client support is apparently not great either. There are, however, some clients that do support it. If this is something that keeps coming up, perhaps a push to support UNMAP in FreeNAS is in order.

jgreco · Mar 6, 2015

Actually, UNMAP is supported in FreeNAS, but this is only a portion of an overall difficult problem. UNMAP doesn't really solve big issues such as fragmentation and identification of metadata blocks. The techniques ZFS has for coping with these issues is generally workable if you're willing to throw resources at it. However, people who don't blink at spending $1K on a quality RAID controller often pale at the idea of putting 64GB+ of RAM into their filer.

What people want -> old repurposed 486 with 32MB RAM and a dozen cheap SATA disks in RAIDZ2

What people need -> E5-1637v3 with 128GB RAM and a dozen decent SATA disks, mirrored

Tywin · Mar 6, 2015

jgreco said:
Actually, UNMAP is supported in FreeNAS, but this is only a portion of an overall difficult problem. UNMAP doesn't really solve big issues such as fragmentation and identification of metadata blocks.

Of course not, UNMAP solves the specific problem it was designed for, informing the underlying system of what blocks are no longer of immediate interest. As for the other issues you mention, presumably you mention metadata blocks because they are accessed more frequently, e.g. in traversing the file system. Since ZFS's ARC is designed to keep frequently accessed blocks cached, this basically amounts to "have enough ARC". Hence coming full-circle to the point of your post, that iSCSI requires appropriate resources to have the same performance as other protocols.

Edit: I see the reference to UNMAP in the 9.3 docs. It didn't show up in that cursory Google search.

jgreco said:
The techniques ZFS has for coping with these issues is generally workable if you're willing to throw resources at it. However, people who don't blink at spending $1K on a quality RAID controller often pale at the idea of putting 64GB+ of RAM into their filer.

What people want -> old repurposed 486 with 32MB RAM and a dozen cheap SATA disks in RAIDZ2

What people need -> E5-1637v3 with 128GB RAM and a dozen decent SATA disks, mirrored

This argument is kind of specious though, since you were discussing different resource requirements for SAN vs NAS protocols for the same performance level. I put to you that a 486 with 32 MiB of RAM and a dozen cheap SATA disks in RAIDZ2 running NFS will a) not work and b) even if it did, not perform as well as the proposed E5 system running iSCSI.

jgreco · Mar 6, 2015

Tywin said:
Of course not, UNMAP solves the specific problem it was designed for, informing the underlying system of what blocks are no longer of immediate interest. As for the other issues you mention, presumably you mention metadata blocks because they are accessed more frequently, e.g. in traversing the file system. Since ZFS's ARC is designed to keep frequently accessed blocks cached, this basically amounts to "have enough ARC". Hence coming full-circle to the point of your post, that iSCSI requires appropriate resources to have the same performance as other protocols.

Right, but "appropriate resources" translates to "larger resources" because "have enough ARC" means "have ARC sufficient to speculatively store blocks because we have no idea whether these are meta or data and therefore need to give them a lot more time in residence to sort out whether they're important or not."

This argument is kind of specious though, since you were discussing different resource requirements for SAN vs NAS protocols for the same performance level. I put to you that a 486 with 32 MiB of RAM and a dozen cheap SATA disks in RAIDZ2 running NFS will a) not work and b) even if it did, not perform as well as the proposed E5 system running iSCSI.

Well, literally, that's true, but I was making a figurative point. People come in expecting magic and their hopes are dashed by the cold light of reality.

Tywin · Mar 6, 2015

jgreco said:
Right, but "appropriate resources" translates to "larger resources" because "have enough ARC" means "have ARC sufficient to speculatively store blocks because we have no idea whether these are meta or data and therefore need to give them a lot more time in residence to sort out whether they're important or not."

Never disagreed with that, don't know where you got the idea that I did.

jgreco · Mar 6, 2015

I didn't say you disagreed with it. It was mostly aimed at the audience, to reset unrealistic expectations of what "appropriate resources" might mean.

I have repeatedly said that it is totally possible to get awesome performance with iSCSI by throwing resources at the problem, but that the resources necessary might be horrific especially to newcomers.

That seems particularly appropriate in this thread...

yis · Mar 6, 2015

Thank you all for this discussion , this has been very helpful and eye opening. The reason i went with 12gb of memory because that's the guys recommended me to have, but reading more and more about freenas requirement it seems that i need to change my original plan. I might as well add a fourth drive and increase the memory to 32GB....

till my next pay check :)

yis · Mar 6, 2015

ah, and also getting rid of RAIDZ1 and have mirrored vdevs instead... will update later.

jgreco · Mar 6, 2015

Memory is easily added later. The pool change will probably be a rather large impact, worry about that now.

yis · Mar 6, 2015

jgreco said:
Memory is easily added later. The pool change will probably be a rather large impact, worry about that now.

yeah got that part.. working on it now.. had to move the files i copied yesterday so i dont lose them..
good thing this came up now before i got my ESX up and running : )

GeoffK · Jun 3, 2015

yis said:
yeah got that part.. working on it now.. had to move the files i copied yesterday so i dont lose them..
good thing this came up now before i got my ESX up and running : )

What is your expectations here? How many VM's are you planning on running and what kind of load do you want?

wreedps · Nov 27, 2015

jgreco said:
I didn't say you disagreed with it. It was mostly aimed at the audience, to reset unrealistic expectations of what "appropriate resources" might mean.

I have repeatedly said that it is totally possible to get awesome performance with iSCSI by throwing resources at the problem, but that the resources necessary might be horrific especially to newcomers.

That seems particularly appropriate in this thread...

I have plans for a Monster ISCSI FreeNAS. I have about 700-800GB of 16GB modules ECC DDR3 Registered at home right now. I am thinking I want a box with a 384GB or so. If I can run 52 VMs on 24GB of Ram and a 3 SATA vdevs in a mirror. I should be able to run 500-1000VMs with 384GB+

jgreco · Nov 27, 2015

Depending on your VM's, that might be totally possible. It isn't the count of VM's that is significant in any way. It's the amount of traffic they're pushing, and, especially, the amount of traffic they're writing, which forms the basis for dimensioning of a ZFS system. Reads can be handled by throwing RAM and L2ARC at it, but fundamentally a pool has certain IOPS write limits depending on various stuff. You can throw 384GB of RAM and gobs of L2ARC at the read problem, but the only way to make it work for writes is to minimize what you're doing and limit it to intelligent things ("make buildworld" on a FreeBSD box being the typical example of "how stupid are you?").

zambanini · Nov 28, 2015

just a note regarding so many virtual machines: take care about a well timed boot order. otherwise your monster will have troubles with the random IOs. on a freenas start, your l2arc is also empty (there is some code for a persistent l2arc, no clue if it will be available in FN soon).

edit regarding the persistent l2arc, to make the post complete:
https://reviews.csiden.org/r/267/ right now it will be part of openzfs, code review is almost done.

jgreco · Nov 28, 2015

Persistent L2ARC should be coming somewhere down the road, yes. Between this and ARC compression which will reduce the header size for L2ARC, there's a lot of room for read-side performance improvements.

The real interesting part of trying to decide what's likely to work and what isn't is that it is difficult to characterize what would be the most useful things to have in ARC; the working set isn't necessarily the set of sectors that are read during boot, for example. If your average uptime for a VM is 100 days and you've got 1000 VM's, the only way that the sectors required on boot are likely to get into L2ARC is if you're running dedup and they're common between VM's. Otherwise, the frequency with which a VM requests that sector is not likely to land it in L2ARC, since it stands little chance of being in ARC long enough to be accessed twice.

Jimmy Tran · Mar 13, 2016

I wish I read this article a few months ago. This was very helpful. There are two reasons why I chose iSCSI:

1. I don't have a SLOG for my NFS writes were either 7MB/s or 30MB/s. Couldn't remember.
2. iSCSI supports VAAI primatives. When cloning disks, snapshots etc, I wanted the tasks to be handled by FreeNAS but I don't think it does. Can someone confirm this?

Yorick · May 9, 2020

Am I understanding this right that: The additional ARC memory needed for iSCSI block storage is also needed when NFS is used for block storage, vmdk files or the like. Just like iSCSI, VM storage over NFS may delete a file inside the file system contained in the VMDK, and FreeNAS has no way of knowing that happened, and keeps the block(s) in ARC.

Different behavior, presumably, for outright deleting an entire VM on an NFS share - in that case, the vmdk file disappears and FreeNAS knows about it. There ARC could be freed, while with iSCSI FreeNAS would remain (blissfully) unaware.

jgreco · May 9, 2020

Yes, correct, more or less. iSCSI is the worst of all. With NFS, as you note, FreeNAS will know that the vmdk has been freed, which adds to the pool's free space list, eliminates the ARC entries, etc. Those are generally beneficial. Removing a large VM and having that happen on NFS can lead to more contiguous free space being available, and therefore better performance.

Of course, most datastore operations on both NFS and iSCSI are NOT vmdk removal, so in the general case, you still have the problem that ZFS has no clue about the significance of changes being made within the vmdk. If you are able to get hints such as SCSI UNMAP, of course, that's a little bit different, but ZFS still doesn't know how to differentiate between file data and metadata.

Giving ZFS more RAM (ARC) means that ZFS can be caching more of the vmdk data. When that happens, ZFS has a better chance of identifying metadata or other highly-accessed data within the vmdk. It might not know what it is, but it knows it is more valuable to cache it. It seems to me like to support a VM on a vmdk you need about twice as much RAM on FreeNAS to get similar results as if you were running tasks directly on the FreeNAS filer (not as a VM). This is highly workload dependent, of course.

HoneyBadger · May 9, 2020

For VMware specifically, the secret sauce for success I've found is to use VMFS6 (automatic and asynchronous space reclamation) with thin disks (so that in-guest TRIM gets translated successfully to UNMAP) and sparse ZVOLs for iSCSI. This way, you keep as much free space as possible at the ZFS level - remember, writes to contiguous space are fast - and assuming your guests have TRIM/UNMAP support, they can also pass those commands down to free blocks as they go, which does have the effect of ejecting them from ARC.

NFS, unless things have improved, is still a bit dodgy at in-guest space reclamation. Last I checked, it still had to be done manually by writing in-guest zeroes and then svMotioning the disk to cause VMware to note "oh hey, there isn't actually data here" - some vendors have plugins for this, but I admit I need to retest with the latest VMware and FreeNAS software here.

A word of warning though for thin/sparse; don't fall into the trap of oversubscribing your storage right out of the gate (either at the VMFS or the ZFS level). Compression will absolutely save you some space, but make sure you understand your workload and data first, and absolutely set up several levels of early warnings if you do decide to oversubscribe. Just because FreeNAS supports the VAAI Thin Provisioning Stun primitive doesn't mean you ever want to see it used.

Important Announcement for the TrueNAS Community.

Why iSCSI often requires more resources for the same result

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Dabbler

Dabbler

Resident Grinch

Dabbler

Dabbler

Patron

Resident Grinch

Patron

Resident Grinch

Dabbler

Wizard

Resident Grinch

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Why iSCSI often requires more resources for the same result"

Similar threads