Why is my NFS write performance this bad?

SirHaxalot · Dec 23, 2012

I'm trying to understand how ZFS synchronous writes work. As I first understood it, adding a SSD ZIL would enable the OS to have an fast write cache, but instead my write performance goes from bad to worse when enabling the ZIL, and it only seems to affect NFS writes. My thought process was the the ZIL would enable the OS to return to synchronous writes as soon as it has written the data to the ZIL device, but my guess is that the ZIL at the moment only adds another step to the process and it still has to wait for the data to be written to the physical disks before returning.

I have 3x 2TB 7200rpm drives in a RAID-Z configuration for the main data and 2x60GB SSDs split up with a 6GB mirrored log partition and the rest assigned as L2ARC cache (no RAID). When I write to a NFS datastore synchronously, I get a write performance of ~10MB/s with sync=standard and not more than ~30MB/s with sync=disabled. Asynchronous writes are better but still bad, 30MB/s with sync=standard and 50MB/s with sync=disabled.

Random 4k writes perform a little better but there is still only an improvement of a few hundred IOPS when enabling ZIL, which is much worse than what I'm expecting.

At the same time, there's no issue at all when I'm using SMB shared. Then I can easily reach 90MB/s writes, which is what I'm expecting with the current setup (nearly hitting the 1Gbit limit).

bollar · Dec 23, 2012

This seems a common problem. I don't use NFS, so I can't comment on it, but I noticed some recommendations on changing /boot/loader.conf to include the vfs.zfs.cache_flush_disable=1 command.

Look at this link for more details on why ZFS and NFS need tuning to work together: http://christopher-technicalmusings.blogspot.com/2010/09/zfs-and-nfs-performance-with-zil.html

SirHaxalot · Dec 23, 2012

I tried to set the zfs:zfs_nocacheflush variable to 1, and lost a few more MB/s write performance. I assume that it's the same thing but I can't really test that now as it would require me to reboot the box..

cyberjock · Dec 23, 2012

Can you post system hardware, FreeNAS version, network card, etc?

ZIL solves some problems.. very well. But it can create other problems so if you don't know you shouldn't use it.

SirHaxalot · Dec 25, 2012

The system has the following specifications:

Intel Pentium ("Sandy Bridge", LGA1155) Dual-core @ 2,8Ghz
8GB RAM
FreeNAS 8.3 (ZFS v28!)
2x 60GB OCZ Agility SSDs for ZIL and cache
3x 2TB Segate Barracude 7200rpm drives for primary storage

As for the SSD ZIL, I got it when I first installed the system and discovered that NFS write performance was even worse than now at only 5MB/s, and then I got recommendations from everywhere to get a SSD as write cache when dealing with NFS. I assume that you aren't suggesting that everyone that knows ZFS says absolutely not to do (disabling ZIL altogheter)

bollar · Dec 25, 2012

I think it's worth testing to see the impact on NFS performance with-and-without a ZIL and with-and-without an SSD SLOG (ZIL). If the performance really is significantly better with write caching off, then you need to consider how important your data is. If you can do without a few transactions in the event of a power-loss or other crash, then I'd seriously consider it. I'd probably feel better if I did snapshots every minute, just to make myself feel better about having the file system in a known state if I need to rollback.

bollar · Dec 25, 2012

Here's a good article on optimizing ZFS and NFS on Sun storage systems. Basically the recommendations are similar for what I found with iSCSI:

When configuring the Series 7000 Unified Storage System appliance, the following factors have the greatest influence on the overall NFS performance of the system:
The choice of the ZFS pool RAID level
Number of disks configured in the ZFS pool(s)
Provision of Write-optimized SSDs (logzillas)
Provision of Read-optimized SSDs (readzillas)
Matching the ZFS pool blocksize to the client workload I/O size
Size of (DRAM) Memory
Number/speed of CPUs

In general, the biggest causes of NFS performance problems on the Series 7000 appliance when configuring/sizing the system are:
The 'wrong' choice of RAID level
No Log SSDs ... or too few
Not enough disks configured in each pool

The choice of the ZFS pool RAID level
This is the most important decision when configuring the system for performance.

Choosing a RAID level:
Double Parity RAID is the default 'Data Profile' type on the BUI storage configuration screen - > this is NOT a good choice for Random workloads.
If tuning for performance, always choose 'Mirrored'.
For random and/or latency-sensitive workloads:
use a mirrored pool and configure sufficient Read SSDS and Log SSDs in it.
budget for 'disk IOPS + 30%' for cache warmup.
RAIDZ2/RAID3 provide great useable capacity for archives and filestores, but don't use these RAID levels for random workloads.

You might also want to note that they say that NFS almost always writes asynchronously unless instructed otherwise by opening a file with the O_DSYNC option.

http://oradbastuff.blogspot.com/2012/05/how-to-tune-performance-in-oracle-sun.html

cyberjock · Dec 25, 2012

SirHaxalot said:
As for the SSD ZIL, I got it when I first installed the system and discovered that NFS write performance was even worse than now at only 5MB/s, and then I got recommendations from everywhere to get a SSD as write cache when dealing with NFS. I assume that you aren't suggesting that everyone that knows ZFS says absolutely not to do (disabling ZIL altogheter)

No, what I meant was that you shouldn't add a ZIL hoping for increased performance without understanding it. There is a ZIL in RAM and that can be disabled(that is a big no-no). I'll never recommend anyone disable that. I was referring only to the SSDs. I never recommend people add them unless they give a situation where a ZIL is actually a big benefit. For most everyone, it won't help significantly.

Conversely, if a ZIL SSD were to suddenly go bad data can be lost/corrupted when it is committed to the zpool(if it is not lost completely). A sync write is suppose to ensure that the data is "for sure" stored on permanent storage (zpool) before the next action is completed. With a ZIL drive the data may not be written to the zpool and is instead written to the ZIL drive. If the system fails then the data should be on the ZIL and can still be commited to the zpool the next time the system is booted up. That is one of the reasons why it is always recommended that ZILs be mirrored. L2ARC does not have to be mirrored because if the L2ARC drive fails the zpool will simply re-read any "missing" data from the zpool itself. The only consequence for a failed L2ARC is the performance loss of losing the L2ARC

SirHaxalot · Dec 26, 2012

So, in a nutshell write caching doesn't really exist for synchronous modes, and NFS performance in general is piss poor with ZFS. You can explain all you want that ZIL isn't what I originally thought, but that only means that we come to the same conclustion I've been nearing myself: That ZFS simply handels sync writes in a retarded way.

I guess I just have to look for another solution..

bollar · Dec 26, 2012

Probably not without some potentially significant tuning or reconfiguration. I'd try UFS and see how that works.

cyberjock · Dec 26, 2012

SirHaxalot said:
So, in a nutshell write caching doesn't really exist for synchronous modes, and NFS performance in general is piss poor with ZFS. You can explain all you want that ZIL isn't what I originally thought, but that only means that we come to the same conclustion I've been nearing myself: That ZFS simply handels sync writes in a retarded way.

I guess I just have to look for another solution..

I would hardly say that ZFS handles sync writes in a retarded way. It is more accurate to say that ZFS handles writes for maximum data reliability. Windows handles sync writes in a retarded way, and a loss of power during copying almost always results in a loss of data. Of course people always assume "well, the data was copying and it wasn't a fault of Windows" or other reasoning.

bollar · Dec 26, 2012

noobsauce80 said:
I meant was that you shouldn't add a ZIL hoping for increased performance without understanding it. There is a ZIL in RAM and that can be disabled(that is a big no-no).

Unless I misunderstand what you're saying, the default ZIL is never in RAM, it's stored on disk when ZFS doesn't have time to properly write files to the file system. This is why the data is protected once it's in ZIL, even if there's a crash. There is a period where data is in cache before it's committed to ZIL or file system, but this isn't ZIL.

As you know, one of the ways to make the ZIL faster is to move the ZIL to a faster device - such as an SSD or NVRAM. (SLOG or separate intent log). It really shouldn't make access slower, aside from the scenario where all writes are asynchronous and data is written to slog and then file system without gaining any efficiency.

Additionally, Oracle says that turning the ZIL off won't cause ZFS data corruption and they recommend exploring that option in certain circumstances - but it could cause NFS client data corruption - which sounds like a semantic argument to me: corrupted data is bad regardless.

bollar · Dec 26, 2012

noobsauce80 said:
I would hardly say that ZFS handles sync writes in a retarded way. It is more accurate to say that ZFS handles writes for maximum data reliability. Windows handles sync writes in a retarded way, and a loss of power during copying almost always results in a loss of data. Of course people always assume "well, the data was copying and it wasn't a fault of Windows" or other reasoning.

NFS was developed by Sun in the early 80s, so we can't blame this one on Windows. :D

Fortunately Sun saw the error of their ways and gave us ZFS.

cyberjock · Dec 26, 2012

Ok, maybe I'm not using the right word. The write cache that is normally kept in RAM.

This is a slightly better explanation of what I'm talking about: http://en.wikipedia.org/wiki/ZFS#ZFS_cache:_L2ARC.2C_ZIL

What is more interesting is that the article mentions that the ZIL is virtually never read. Which means that if you have 8GB of RAM and your ZIL drive is bigger than 8GB the rest is "wasted" as it will never be used.

This is even better: http://www.nickebo.net/zfs-zil-l2arc-what-that-about/

I guess instead of calling the RAM write cache a ZIL the better term is transaction groups. :P
In essence, if you want a ZIL to provide a good sized cache you also want alot of RAM. That's somewhat disappointing since alot of people seem to trade the ZIL for more RAM hoping for more performance.

cyberjock · Dec 26, 2012

I think what the OP should do is setup a CIFS share and try copying data to/from the server with CIFS. If he still gets very poor performance then it somewhat rules out the protocol as the problem. However, if the speeds suddenly jump to a good value then NFS may be to blame. Of course, this is still not 100% guaranteed.

I'm also a little hesitant to make the leap that NFS is solely to blame since the OP has put a ZIL and L2ARC on the same physical drive. I'm not sure if that is a wise choice performance wise since even when you are doing writes you are likely doing some reading. If the same drive is having to commit ZIL writes and L2ARC writes to cache data that may also hurt performance.

NFS was designed back when computers had a fraction of the performance of today's computers. We're almost implying that NFS overhead is related to the poor performance which I think is flawed logic.

Additionally, the OCZ Agility is not exactly a high performing drive by today's standards. OCZ only claims 135MB/sec write speeds which his RAIDZ can almost certainly outperform for async writes. A faster ZIL drive that is also not the same drive as the L2ARC may help.

Of course, I still subscribe to the notion that unless you know you need a ZIL its not likely to give you a major performance boost. We are also seeing that concept played out by the OP.

I see ZIL as an option only when every single other possible option is exhausted and we're seeing test data showing that the problem IS sync writes and nothing else. For all we know the issue could be CPU bound somehow. There's far too many variables at the present time without the OP providing massive amounts of data on zpool data unless we happen to come across a smoking gun with a few commands.

cyberjock · Dec 26, 2012

Here's an excerpt from this thread:

Any sync writes under 64KB in size are written to the ZIL first, then they become async writes that get written out to the pool along with the next transaction group (5-30 seconds later, depending on pool settings).

Any sync writes over 64 KB in size are written to the pool directly as part of an immediate transaction group (ASAP).

Sequential or streaming don't matter. It's just size that counts.

Interesting considering that the FreeNAS manual(I believe that is where I read this) mentions the ZIL is good for databases only. In that situation I'd expect that you'd have tons of writes that likely never exceed 4k since each write to the database is probably written one field at a time.

Stephens · Dec 26, 2012

noobsauce80 said:
In that situation I'd expect that you'd have tons of writes that likely never exceed 4k since each write to the database is probably written one field at a time.

Generally speaking, the lowest granularity for I/O I've seen is a record. (Database ---> Table --> Record ---> Field [ex... "Last_Name"])

bollar · Dec 26, 2012

In the first message, the OP says:

At the same time, there's no issue at all when I'm using SMB shared. Then I can easily reach 90MB/s writes, which is what I'm expecting with the current setup (nearly hitting the 1Gbit limit).

The "problem" with ZFS and NFS is that ZFS honors the o_sync requests it receives from NFS. This requires the data to be written to non-volatile storage before it can be confirmed back to NFS. The faster ZFS can complete that write, be that file system, ZIL or SLOG, the faster it can send the confirmation.

Some unnamed file systems apparently say they honor the request, but they apparently don't, and immediately send a confirmation which makes their performance look better.

Great blog on the subject: http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained

FWIW, I did turn NFS on with an OpenIndiana 151a7 system and got a write speed of 12Mbps. This system can do AFP and SMB at the theoretical maximum and iSCSI at ~85Mbps. Unfortunately, I don't have time to experiment with the settings to see how much that can be improved.

bollar · Dec 26, 2012

noobsauce80 said:
What is more interesting is that the article mentions that the ZIL is virtually never read.

Even if it's never read, the important thing is that ZFS can confirm that a write is completed as soon as data is written to some sort of non-volatile storage. If the client is waiting for that confirmation, ZFS can still hold the data to be written in RAM and write to the file system at a later time (let's say 5-30 seconds), so long as it has quickly written the data to the ZIL or SLOG. For the applications that require this confirmation before they will accept the next record (like some databases), it's a huge performance improvement.

Otherwise, if one is on a budget, then RAM is probably a better use of $ than SLOG.

cyberjock · Dec 26, 2012

Stephens said:
Generally speaking, the lowest granularity for I/O I've seen is a record. (Database ---> Table --> Record ---> Field [ex... "Last_Name"])

Exactly. Generally I've seen databases enter intormation one record at a time(or sometimes one field at a time if you look the code very closely). Generally speaking I wouldn't expect that the average record(or field) is more than 64k. When you start dealing with large amounts of data like that a small number of records can turn into a huge database.

I remember reading somewhere that ZILs are extremely useful for databases but little else. Although some places say ZIL can often help NFS it seems to usually provide very little benefit. CIFS seems to benefit even less still with a ZIL.

Which goes back to what I have thought for quite some time(and said early on in this thread) that if you don't KNOW that a ZIL is useful for you it almost certainly isn't. Saying "gee.. I want a write cache and a ZIL is a write cache" isn't knowing. It's trying to make some very bad assumptions and spending money on hardware that won't see a significant performance increase(perfect example.. look at the OP). I say KNOW and I put it in caps for emphasis but I'm sure there's boatloads of people out there with a ZIL anyway. As far as I'm concerned if you are even in a forum asking questions you have nowhere near the knowledge, experience, and understanding to make an informed decisions if you need a ZIL let alone make the large leap to claim it will give you a major performance increase. Of course, thanks to Windows, a lot of people think that throwing more hardware at a problem always helps they and ultimately are in shock when the performance increase is minimal.

My guess is that people that live and breath for database administration know exactly how a ZIL works and knows exactly how to take advantage of it(perhaps by tweaking their writes to be less than 64k all of the time. Also, because of how nasty copy-on-write can be for databases(see iSCSI issues for the same exact problem) my guess is that database administrators don't use ZFS as much as we'd think or only use it as a backup server.

You know how I caused a 3 fold increase in zpool write speed recently? I disabled the on-card write cache on my RAID controller. I went from about 250MB/sec to over 800MB/sec. Very odd indeed because in the Windows/Linux world you can see amazing performance increases by upgrading a RAID controller cache.

Frankly, if you are a Windows/Linux guy learning ZFS the first rule of ZFS is to forget everything you think you know about file systems, performance optimizations, etc. and just shut up and listen to the FreeNAS manual and do a lot of Google searching and reading. I had to and despite it sucking I've somehow avoided all of the problems people have in this forum regularly.

Important Announcement for the TrueNAS Community.

Why is my NFS write performance this bad?

Dabbler

Patron

Dabbler

Inactive Account

Dabbler

Patron

Patron

Inactive Account

Dabbler

Patron

Inactive Account

Patron

Patron

Inactive Account

Inactive Account

Inactive Account

Patron

Patron

Patron

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Why is my NFS write performance this bad?"

Similar threads