Poor ISCSI performance on FreeNas 8.3

Status
Not open for further replies.

g.lloyd

Cadet
Joined
Nov 23, 2012
Messages
3
I am looking at various software ISCSI solutions. I am getting very poor performance over ISCSI using freenas 8.3.

I am using a Dell 2950:
8 cores, 32gb ram , 4 gigabit ports on different vlans for the ISCSI traffic, and 15 x 15k 146GB SAS drives with a zpool config of 3 raidz arrays (each with 5 drives).

I cannot achieve above 60MB/s writes and 100MB/s reads using a linux initiator (ubuntu 12.04 server).

I have tested other iscsi targets such as Windows 2012 server and Ubuntu 12.04 with targetcli using hardware RAID and I am getting over 4 times the write performance and 2 x the read performance using iozone benchmarks over ISCSI.


Has anyone got any tweaks for ISCSI on freenas 8.3 ?

I would really like to use the product but at the moment it just isn't cutting it over ISCSI.

Thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
iscsi doesn't work too well on a COW file system.
 

William Grzybowski

Wizard
iXsystems
Joined
May 27, 2011
Messages
1,754
I am looking at various software ISCSI solutions. I am getting very poor performance over ISCSI using freenas 8.3.

I am using a Dell 2950:
8 cores, 32gb ram , 4 gigabit ports on different vlans for the ISCSI traffic, and 15 x 15k 146GB SAS drives with a zpool config of 3 raidz arrays (each with 5 drives).

I cannot achieve above 60MB/s writes and 100MB/s reads using a linux initiator (ubuntu 12.04 server).

I have tested other iscsi targets such as Windows 2012 server and Ubuntu 12.04 with targetcli using hardware RAID and I am getting over 4 times the write performance and 2 x the read performance using iozone benchmarks over ISCSI.


Has anyone got any tweaks for ISCSI on freenas 8.3 ?

I would really like to use the product but at the moment it just isn't cutting it over ISCSI.

Thanks

How can you be so sure the bottleneck is the iSCSI target? Have you ran iozone tests? Did you isolated network performance (iperf)?

iscsi doesn't work too well on a COW file system.

Where did you get that from?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I got that COW from 2 other people in the forum. I forget their names, but it does make sense if you think about how iscsi works and how COW works.

There's one user, forgot his name, that has been trying to explain to people that ZFS and iscsi can have problems when put together.
 

William Grzybowski

Wizard
iXsystems
Joined
May 27, 2011
Messages
1,754
Please explain. I don't see any correlation between iSCSI and COW in this scenario, COW acts in another level of the filesystem. We have been able to get 800MB/s from iSCSI.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
COW is, very generally speaking, probably a poor design decision for an iSCSI storage platform based on standard hard drives.

I'll note that there are various things that can be done to make the problems related to this better. I'm not here to discuss those, I already know about them. I've spent a lot of time looking at the various problems FreeNAS users have with ZFS and iSCSI. It is certainly possible to get a system that performs reasonably well, but there are some significant caveats. You can look at the results of all the time I spent screwing around with txg buffer sizing in bug 1531, for example. You can imagine that I have similar experience messing with this.

With specific respect to COW and the point I've made in the past: an iSCSI device is a virtual hard drive. An initiator sees it as a linear contiguous array of blocks, and generally speaking, we've spent several decades writing filesystems that treat them as such, and optimize to handle them as such.

But let's ignore that for a minute, and investigate the underlying issue. You create a 1TB file for iSCSI (appx 2 billion blocks). You mount it on a client and then you read it with dd on the raw disk device. William gets his 800MB/sec. Great. You proved your point.

Now you give me the client. I go in and write one million randomly selected blocks. What you're going to find is that the blocks in the ZFS extent file are no longer generally contiguous, and as a result of additional seek delays, you get something less than 800MB/sec doing that same read test.

So. Here is another troubling data point. ZFS would love to allocate those new blocks "nearby" in order to reduce seek time, yes? Of course it would. However, if you investigate the users who come to the forum, there is a fairly common expectation that they have a NN GB ZFS volume and that this should mean that they can have an NN GB iSCSI extent. This is foolish, of course, let's agree on that point right away. Anyone who has spent any time with ZFS knows that ZFS really needs to be kept at less than 80% or performance degrades horribly (in part due to what we're talking about!). However, experimentation suggests that this issue is substantially worse with iSCSI; in order to reliably allocate blocks sufficiently nearby to minimize the issue and allow ZFS a plentiful bucket of blocks to work with, it appears to me that a better number might be more like 60%, or even 50% capacity, which is a lot of space to hold in reserve.

But even so, eventually there's a large amount of fragmentation that has to be dealt with as an iSCSI extent file ages, the number of random writes increases, etc. This can hurt performance.

So one of my common talking points is about making VM's that are designed for the virtual environment, rather than just assuming that they have their own physical resources. Pretty much every UNIX system likes to do trite writes, stuff like atime updates, which add approximately zero value. If that filesystem is mounted on an iSCSI disk and underlying it on the server is ZFS, I can cause a bunch of writes just by READING files (causing their atimes to all be updated) and suddenly the contiguous blocks that made up my filesystem are no longer contiguous. Nice! What about when someone rebuilds ports? Or does a makeworld? ("HORROR!") It isn't the single reallocation one time that's a problem. It's the cumulative effect that this has on an extent file over the period of months or years.

Anyways, COW brings some interesting capabilities to the table, no doubt, but in many cases those capabilities are not needed. I've been leaning towards implementing iSCSI on top of UFS based file extents, which is of course limiting in several ways. However, the performance characteristics don't degrade in the manner ZFS based file extents do over time, and the dynamics are trivial to understand, unlike ZFS, which has been a relative nightmare of design decisions that are not-quite-right for iSCSI uses.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There's the man with the answers! I figured I'd wait for his response since he could explain it better than I.
 

William Grzybowski

Wizard
iXsystems
Joined
May 27, 2011
Messages
1,754
Well, generally, yes, thats a good point and I have to agree.

However, that would only explain why you'd lose performance over time, reallocating blocks, higher seek times, etc.
Lets suppose just for a second that you're not using such feature, no snapshots whatsoever. Would COW still play this game? I don't think it would because you would no longer keep track of a datablock with COW flag because you don't need to know that in a point of time you had something different, thus you won't need to reallocate the block. Unless I am missing something.

You might say that then you will lose the most important feature of ZFS, that might be right, but you still get block checksum, data prefetch, etc.

Even with simpler filesystems you might face this sort of problem over time, the space is usually not allocated upfront, the file is sparse and depending how you fill it the blocks might be allocated in different regions.

Back to ZFS, there are tons of ways to mitigate this issue, like L2ARC and larger transactions groups.

Anyway, personally I would do the tradeoff if necessary anytime and I don't think the overhead the user is seeing of about 400% is due to ZFS COW, there must be something going on and for sure there are tunings available.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The problems of a copy-on-write filesystem and fragmentation as the filesystem ages for this sort of file application are known issues. Copy-on-write and locality are natural enemies. For a case where you weren't ever going to do linear reads of contiguous blocks, this is pretty much a nonissue; for everything else, "it depends."

It's fortunate that we live in this wonderful age of cheap fast storage; did you see some of those Black Friday SSD deals? Think I saw a 256GB SSD somewhere for $140, and a 512GB for $299. So the real answer might be to just implement iSCSI on SSD and not care, which is still not-quite-a-full-solution but it's hell-fast and lots of fun.

I should also note that the impact that COW has on the average user absolutely PALES compared to the other issues I've investigated; for the most part, unless you're running a busy NNTP or IMAP server or large hosting environment with millions of small files, you're unlikely to rack up so much fragmentation on an iSCSI file extent that it becomes more than a slight drag. That doesn't mean the problem doesn't exist, or that in ten years it won't be miserably slow, just that it probably isn't a killer for small-scale users. Simple fix: stop iSCSI, copy the file, erase the original, rename, restart iSCSI. Obviously not a fix for a production system, hahaha, but "approved for home use!"

Also, a SSD that is sufficiently large to cope with the working set off of an iSCSI extent file (remember, most hard drives see a vast majority of their work on a very small subset of blocks!) is likely to re-balance the books significantly when considering an issue like what we're discussing here. It will not be a complete fix, I'd guess. I think it's perfectly possible to architect a storage system based on ZFS and a virtual machine environment that is designed to work within the constraints.

No magic bullets, it all comes back to needing to intimately understand what you're doing and designing accordingly.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
However, that would only explain why you'd lose performance over time, reallocating blocks, higher seek times, etc.
Lets suppose just for a second that you're not using such feature, no snapshots whatsoever. Would COW still play this game? I don't think it would because you would no longer keep track of a datablock with COW flag because you don't need to know that in a point of time you had something different, thus you won't need to reallocate the block. Unless I am missing something.

I always understood CoW to simply mean that you will NEVER write to the same block with the same data. Hypothetical situation:

You have a file that consumes 3 blocks, in order. You read the file and then save the file, but block 2 changes. Block 2 does not overwrite the same place and instead is written to an empty block(we'll say block 4). Then the pointers in the file system are updated to show the file is now read in order, 1, 4 then 3.

In a non-COW system your file would still be read in order 1,2 then 3.

The advantage to this is that if you lose power while block 2 is being rewritten you will now have a corrupt file. No good copy exists. With CoW you can playback the transaction log and see that block 4 was in the process of being written and based on the status of the writing the system may undo the file system changes and restore the file to the original order.

If this is true, then an iscsi drive will become horribly screwed up with seek times and other inefficiencies. I totally agree that this is likely not the case with the OP, it is something to consider for long term use. I've actually thought alot about this lately and it really makes me consider if a "defrag" tool is more necessary for ZFS than might be thought if you have a storage system that has been in use for years.

Here's a youtube video I found. Skip ahead to 4:55 minutes and he explains CoW. https://www.youtube.com/watch?v=gthel59G56c&feature=related
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Would COW still play this game? I don't think it would because you would no longer keep track of a datablock with COW flag because you don't need to know that in a point of time you had something different, thus you won't need to reallocate the block. Unless I am missing something.

Well, one of us has a fundamental misunderstanding about how ZFS works.

https://blogs.oracle.com/roch/entry/the_dynamics_of_zfs

ZFS never overwrites live data on-disk

http://www.bsdcan.org/2012/schedule/attachments/203_ZFSOptimizedForBlock_BSDCAN2012.pptx

ZFS never overwrites a currently allocated block

The important concept is that it writes new data to the disk in a free block. This means that atomicity is guaranteed throughout the update. It doesn't matter that the old block will be marked as free in a matter of nanoseconds - the new data is stored in a new location. I believe that the indirect block is also updated in the same manner, but quite frankly I haven't looked at the code lately.

Do you disagree with this? If so, cite please.

Anyways, getting back to this:

Even with simpler filesystems you might face this sort of problem over time, the space is usually not allocated upfront, the file is sparse and depending how you fill it the blocks might be allocated in different regions.

It's stupid to generate a sparse file for such an application. Inexperienced users may well do this, but it is an error. Anyone who does this professionally is likely to create a non-sparse data file.

I don't think the overhead the user is seeing of about 400% is due to ZFS COW

I agree with that, but the question you asked noobsauce80 was

iscsi doesn't work too well on a COW file system.
Where did you get that from?

I think that's a perfectly defensible statement. The instant you put something like iSCSI on ZFS, you're fighting some of the design assumptions of ZFS. Can you make it work well? Yes, probably, eventually. However, it is clear to me that it is worth weighing the various benefits and factors. FreeNAS with UFS and iSCSI is pretty compelling too, and if what you're looking to do is just to have redundant storage, FreeNAS with UFS on a 1GB Atom and a pair of SSD's may actually be faster and a shade less expensive than FreeNAS with ZFS on a 4GB (or 8GB) Atom and a pair of SSD's, due to the additional overhead of ZFS. (SSD's in example chosen specifically to nullify any substantial seek issues.)

ZFS wins for snapshot support. ZFS wins for large amounts of storage (in any number of ways!). ZFS wins for RAIDZ2 (or Z3!) data protection.

ZFS loses for out-of-the-box responsiveness under stress (1531). ZFS loses for seeks on spinning rust due to COW (this thread). Etc.

As with everything, you have to add up the pluses and minuses for your intended application.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
This is an amazing thread. I'm enjoying all of the links and info from this thread. Gonna save a bookmark because I'm sure I'll be referring someone to this someday.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If this is true, then an iscsi drive will become horribly screwed up with seek times and other inefficiencies. I totally agree that this is likely not the case with the OP, it is something to consider for long term use. I've actually thought alot about this lately and it really makes me consider if a "defrag" tool is more necessary for ZFS than might be thought if you have a storage system that has been in use for years.

You seem to have a sufficient enough understanding of what's going on here that I'll point out a few more things to you. Hope they're interesting.

1) A traditional UNIX environment involves writing to files and then letting them sit. Rewrites of data in the middle of a file are the exception rather than the rule; in general, a file is much more likely to be appended than overwritten. Writing out a bunch of files at the same time suggests they're related in some way, and luckily the allocator will tend to allocate blocks in the same regions. So you can think of this as "files written at the same time have a bias towards being grouped together."

1a) What causes severe fragmentation on a DOS box, for example, is that people will fill the thing to darn near 100%. They'll also run the defrag tool, which (at least in DOS days) will actually create a solid block of no-free-space disk, meaning that any changes are more likely to incur a long seek. FFS avoids this by holding a 5-10% reservation, and ZFS guides point out that you shouldn't fill a ZFS pool more than maybe 80%. Both techniques work to try to provide some number of "nearby" free blocks.

1b) An allocator faces many challenges; trying to allocate a replacement block that shares some locality with the rest of the file is only one challenge. You can read a bit about the ZFS details.

1c) The long and short form of it is that you probably shouldn't panic. The overall effect of the allocation strategy for typical files tends to dampen fragmentation effects, and the less you fill a ZFS pool, the better it does. It is probably worst for large files with rewrites. Oh, look, there's iSCSI. :(

2) While ZFS may tend to fragment larger files that are being rewritten, which was part of the point of this thread, there are similar but harder-to-understand effects at play that mean fragmentation may not hurt you quite as much as you're thinking, the one that's easiest to grasp is that stuff that's written at the same time is likely to be temporally related, so you're introducing some sort of affinity even while breaking your contiguous set of blocks. That's actually a variation on 1) above. Obviously that goes straight to hell if you do something that's totally psycho like just randomly writing random blocks.

2a) Don't be shocked when William's "we got 800MB!" does degrade over time. A single disk that can be pushing 150MB/sec can be brought to its knees doing just dozens of transactions per second for random data, requiring seeks. However, the net effect of 2) combined with readahead and multiple spindles probably makes the end result less painful than what it could be.

3) What ZFS really needs is a maintenance thread. ZFS is well-situated *because* of its design to have a worker running around, using spare IOPS (similar to scrubbing) to identify things that could be done to optimize the file system. Unfortunately, there are lots of things ZFS needs that it isn't likely to get anytime soon. For example, there is no compelling reason that it should not be possible to add a disk to a RAIDZ{,2,3} and have the vdev space expand (or increase data protection or whatever). As a software-defined system, ZFS is in the ideal situation to implement such features.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
They are interesting! Thanks for the info! I love reading about the nitty gritty details.
 

g.lloyd

Cadet
Joined
Nov 23, 2012
Messages
3
How can you be so sure the bottleneck is the iSCSI target? Have you ran iozone tests? Did you isolated network performance (iperf)?

I measured the performance with iozone. The Target is the same machine that I have tested the other solutions on before installing freenas.

We have purchased a couple of Intel 520 SSD's for caching to see what happens. With ubuntu and zfs we were able to get better performance by performing sync=off on the zvol presented to the ISCSI initiator. It doesn't seem to do much on freenas.

If not it looks like linux with lvm backstores for us.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I think you mean "sync=disabled"? I think you are making a grave mistake...

From http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html

The options and semantics for the zfs sync property:

sync=standard
This is the default option. Synchronous file system transactions (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log) and then secondly all devices written are flushed to ensure the data is stable (not cached by device controllers).

sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by a system call return.
This obviously has a big performance penalty.

sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.

Notice the "very dangerous". I don't know if I could emphasis how bad of an idea this is. You are basically telling FreeNAS/FreeBSD to tell your iscsi machine that a write is complete when it is not. If your server crashes, loses power, etc you risk corrupting the file system and/or data. Some of the high end hardware RAID controllers have a similar feature(both of my hardware RAID controllers have this feature). They also tell you in the manual not to enable the write cache without a battery backup for the controller. Not battery backup for the computer, for the controller. It's a battery pack that plugs directly into the raid controller. Some controllers won't even let you enable it without a battery backup being installed. I've known several people that said "eh, I got an UPS and the system never crashes.. I'm fine". I have never met anyone personally that has NOT had significant file system corruption or loss of data yet for that stupidity. But I will tell you that I can get amazing performance gains when enabling the write cache(gee.. and you enabled it and liked the results too!). One machine couldn't recognize that the partition didn't need formatting.

I have personally seen corruption just from doing a reboot because data was "synced" to the write cache on the controller during shutdown but the delay from going to the controller to the disk was longer than the remaining shutdown time for the OS. The system rebooted and the data that was supposed to be written was lost. Oops. NTFS is now corrupt. Run a chkdsk and hope you still have files. We lost about 20% of our data on that server because of a single shutdown.

ZFS also has no fsck but scrubs instead. If you successfully corrupt the file system beyond the point of parity and checksums you cannot run a fsck. There is no fsck because ZFS is supposed to be resistant to corruption. There's a webpage discussing the fact that ZFS may need a fsck more than was thought.

I'm just going to warn you.. you are treading on extremely thin ice. If I were working with you I'd give my boss a letter in writing requiring his signature that I have made him completely aware of the consequences of disabling sync writes and that I will not be held responsible in any way, shape, or form for the consequences of those actions. There is just no way I would ever trust a system with sync=disabled with any of my data. I won't even do it on my home servers despite having good backups.
 

g.lloyd

Cadet
Joined
Nov 23, 2012
Messages
3
Sorry I was trying to recall the sync command from memory.
Thanks for the advice, I don't intend to run production with sync=disabled I am just trying to get my head around freenas being poor as an iscsi target compared to some other ones I have tested. (local benchmark speeds are more than good enough).

We are attempting to build our own SAN using commodity hardware, at the moment we are just seeing what is out there rather than throwing a big wad of cash at emc /netapp etc.


I have been reading a few posts with people claiming that round robin mpio is poor on freenas.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ah. Thank god. I've seen so many servers get trashed because idiots enable features and say "hey.. I got a massive performance increase" without knowing the consequences. Yes, some of it was related to gov't stuff! I take the stance "if I can't explain in detail what that feature does then I shouldn't be changing it".

I actually DID do a letter to my supervisor when I was in the Navy because we did just this. We were stupid and ran an exchange server on a RAID5. Rule #1 with RAID5/6.. don't run a transactional database(like exchange). We did anyway(before my time). When we had 3000 email accounts it got VERY slow. How do you fix it without spending money? Tweak this little harmless "write cache" setting. Poof. Server ran great.. for a week. Server lost power due to a basewide outage and the database was corrupt beyond recognition. No backups because performance was too low while doing backups. So we had to recreate the server..from scratch. I told my boss where to stick it and told him I wasn't going to work weekends and overtime to fix HIS screwup. Gave him a copy of the document signed by him. Oddly, he got "fired" less than 2 weeks later for "unknown" reasons. The commanding officer of the station started asking big questions when his email stopped working... and I was more than happy to throw my boss(who was an incompetent asshole) under the bus. I politely told him he'd have to talk to my supervisor. Supervisor sent him back to me.. so I spilled the beans.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
g.lloyd, you could try putting ZFS and your iSCSI on an SSD. That would get rid of the whole issue with ZFS and iSCSI. Another option is to use UFS. Honestly, I'm not sure why more people don't choose to use iSCSI on SSD with ZFS. It seems like that's a perfect way to utilize the latest advancements in technology.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
As I just pointed out in another thread, the old guidance to use faster spinning rust probably makes no sense now that SSD's are falling into the price realm of SAS disks.

I've also pointed out that lots of people come from the old SAN mentality of having one big storage device that has to fit all needs. Often a busy SAN array has only a small zone of activity that's really busy and the rest of the data is pretty idle. From a read point of view, that's an awesome target for a little L2ARC addition - but you could also take the dataset in question and shove it onto a separate pool on a mirrored pair of SSD's.
 
Status
Not open for further replies.
Top