SMB share: Writes are GBE fast. Reads are only GBE fast if cached. Why?

scurrier · Apr 1, 2014

I'm seeing some behavior from my FreeNAS box that I believe is strange. Basically, writes to my shares are always GBE fast, reads are only GBE fast if cached.

Problem description:

FreeNAS has two shares via smb/cifs, hosted on two different RAIDZ2 arrays:
-A 6-disk RAIDZ2 of modern Seagate 4TB NAS drives.
-A 4-disk RAIDZ2 of oldish Seagate 1.5TB consumer drives.
Sequential writes to the shares can ALWAYS max out my GBE, hovering around 109-112 MB/s. This is an example 20 gig file transfer, to show you how solid it is. When it finishes, hard drive activity stops immediately. The drives do not have to catch up to the RAM cache or anything. Picture of the write transfer:
When I first sequentially read a file from a share, it hovers around 90-100 MB/s. If I cancel the transfer mid-way through...(shortly after this point:
)
...and then read the same file again, it will peg around 110 MB/s until it gets to the point where I cancelled before. Then it will slow back down to the 90-100 MB/s speeds I was originally getting. You can see this effect here:

Question:
So it seems that without caching, the share's sequential reads are slow and can't max out GBE. Why is it so slow without caching?

More information:
I have extensively tested the read speeds of these arrays and I know that they are much faster than GBE is. This is all sequential, no excuses.

FTP can max out the connection:

I have no sysctls or tunables. atime is turned off on this dataset.

Here's my FreeNAS hardware:
Intel i3-4130
Corsair 2x8GB RAM (CT2KIT102472BD160B)
Supermicro X10SL7-F (onboard IT-flashed LSI 2308 for 8 SAS / onboard Intel for 6 SATA)
Supermicro 933T-R760B, 3U 15-bay hotswap case

My network is all GBE, cables are at least CAT 5E, and there is only one switch between these machines. I am transferring to/from a windows 8 machine's SSD using a Marvell Yukon NIC, but the phenomenon is the same on other machines in my house.

I could definitely live with these speeds but I just hate a nagging problem that I don't understand. Appreciate anyone's input as to what else I should look into.

scurrier · Apr 1, 2014

Forgot to mention, I am on FreeNAS-9.2.1.3-RELEASE-x64 (dc0c46b). I have powerd disabled.

The CIFS "server minimum protocol is set to "-----" and the "server maximum protocol" is set to "SMB3." There is an "SMB3_00" option even lower on the list, but I have not changed away from the default. I have disabled "hostnames lookups" since the tooltip describes it as "expensive" although I'm not using allow/deny lists so I'm pretty sure that's not doing anything.

cyberjock · Apr 2, 2014

Well, for the most part, once you hit 90MB/sec or so, even a small out of latency between getting the data off of your disks and getting it to your desktop can hurt throughput. 90MB/sec is roughly 80% of theoretical throughput, which is what most engineering-types would call maximum speed.

So I don't really see a problem at all, and I don't see anything to "fix". More RAM may help, but I wouldn't necessarily expect it to give you a constant 110MB/sec all the time.

scurrier · Apr 2, 2014

Always appreciate your input, cyberjock. I can understand the point that there is no serious problem here.

For me, it's just frustrating that the performance is there, just not in this certain (important) situation, and for seemingly no good reason (that I understand). I have proved that the hardware is capable, but there seems to be some kind of unnecessary inefficiency happening in this certain case. Why should a non-cached read get any lower % theoretical throughput than a cached read?

I have literally zero knowledge of the inner-workings of samba and only a surface knowledge of ZFS. So I can only venture a guess. Based on the evidence that only samba reads from disk are slow, I am guessing that there is some samba tuning/buffering issue here. When reads are cached in ZFS, they are effectively buffered for samba even though samba is not doing the buffering. So it's covering up an issue with samba not making a big enough buffer. That's my uneducated guess.

Is there any way to force samba to a bigger buffer? A quick google tells me I need to investigate samba socket options.

Maybe I need to try a UFS share and see how well that works. I have a creeping suspicion that it will work better. Maybe samba somehow gauges how fast the file system is and self-tunes based on that, and ZFS's caching is fooling the self-tune into selecting the wrong tuning for non-cached reads. Wouldn't it be disappointing if a relatively unsophisticated UFS share could be more effectively utilized by samba than ZFS?

Someone, please tell me why I am wrong.

cyberjock · Apr 2, 2014

scurrier said:
For me, it's just frustrating that the performance is there, just not in this certain (important) situation, and for seemingly no good reason (that I understand). I have proved that the hardware is capable, but there seems to be some kind of unnecessary inefficiency happening in this certain case. Why should a non-cached read get any lower % theoretical throughput than a cached read?

Because your memory subsystem is a few orders of magnitude faster in both latency and throughput than your disk subsystem.

Going UFS is crazy stupid. UFS is basically discontinued in FreeNAS. It has no real cache at all, has limited RAID formats it supports, has no protection from data corruption that makes people want ZFS, need I say more?

sockets aren't going to solve your problem. I told you what the problem was 2 paragraphs up. The inefficiency you are thinking exists is called "moving parts don't operate at the speed of light". I'll tell you what the manual says if you want ZFS to have higher performance.. add more RAM. That's generally the best and fastest way to add performance to your pool. More RAM = more cache for ZFS = more speed.

Dennis.kulmosen · Apr 2, 2014

Remember the rule of thumb.
1GB ram per TB raw storage.
You got 30TB of raw capacity but "only" 16GB of ram in your system. So adding another 16GB would improve your overall performance on the system.
Just my 2 cent. ;-)

Sent from my iPad using Tapatalk

Rand · Apr 3, 2014

Well there where several threads now where ppl said that their writes are faster than their reads (me included). And while I agree that this is a luxury problem it is weird that a usually faster operation (read) is slower than the usually slower one (write).
So I think the question *what could cause this* is valid - and while more Ram is always a good way to speed thing up its not the answer here b/c its not slow per se but only slower than expected when RAM is not being used.

And i dont think the 1GB Ram : 1 TB storage applies to empty storage anyway (performance wise);)

eraser · Apr 3, 2014

Dennis.stjernholmco said:
Remember the rule of thumb.
1GB ram per TB raw storage.

I've been wondering about this for a while.... Do you know where does this rule of thumb comes from? Surely having 20 TB of large movie files stored does not really require 20+ GB of RAM to serve them up to a single client...

eraser · Apr 3, 2014

Rand said:
Well there where several threads now where ppl said that their writes are faster than their reads (me included).

In the last thread I remember reading, the problem was resolved by disabling the existing LAG groups. Are you doing any NIC teaming on either end?

gpsguy · Apr 3, 2014

From the manual (and numerous posts on the forum) -

"For systems with large disk capacity (greater than 8 TB), a general rule of thumb is 1 GB of RAM for every 1 TB of storage.

If you plan to use your server for home use, you can often ignore the thumbrule of 1 GB of RAM for
every 1 TB of storage. If performance is inadequate you should consider adding more RAM as a first
remedy. The sweet spot for most users in home/small business is 16GB of RAM."

eraser said:
Do you know where does this rule of thumb comes from?

eraser · Apr 3, 2014

My first impression is that writes are fast because ZFS write caching (txg groups) effectively turns all writes into sequential writes. Consumer disks are pretty good at sequential writes.

For reads, worst case scenario is that you end up with random read IOs for everything.

Could that explain the behavior that you see?

scurrier · Apr 3, 2014

eraser said:
Could that explain the behavior that you see?

Good thinking, but my reads are definitely sequential because I transferred everything sequentially, I'm basically the only user of these volumes right now.

scurrier · Apr 3, 2014

cyberjock said:
Because your memory subsystem is a few orders of magnitude faster in both latency and throughput than your disk subsystem.

OK. But my disks are multiple times faster than my network. Seems like that should be fast enough if the buffer between Samba and ZFS is being managed well. If Samba is starving for data to send, shouldn't it be requesting data off the disk that is "further ahead" of the transfer.

cyberjock said:
Going UFS is crazy stupid. UFS is basically discontinued in FreeNAS. It has no real cache at all, has limited RAID formats it supports, has no protection from data corruption that makes people want ZFS, need I say more?

I meant just to test my crazy self-tuning theory. I tried testing it today but could not make a UFS stripe out of any of my spare disks that would go above 140 MB/s, which left me thinking that I cannot trust the results. But when I did a test from UFS cache, I was seeing a solid 96 MB/s. I am discarding this result because I don't know enough for it to tell me anything. Although it might be suggesting that this is a Samba issue because Samba is not even serving files out of cache at max GBE (if you assume max is 110 MB/s). I'll admit I'm working on a lot of assumptions here which is why I'm not very sure of anything until I find a smoking gun.

cyberjock said:
Well, for the most part, once you hit 90MB/sec or so, even a small out of latency between getting the data off of your disks and getting it to your desktop can hurt throughput. 90MB/sec is roughly 80% of theoretical throughput, which is what most engineering-types would call maximum speed.

Couldn't this latency you speak of be resolved with adequate buffer management? Why isn't that happening? Why are writes more network-efficient than reads?

cyberjock said:
So I don't really see a problem at all, and I don't see anything to "fix". More RAM may help, but I wouldn't necessarily expect it to give you a constant 110MB/sec all the time.

I tried to eliminate the inadequate RAM variable by detaching all volumes except my RAIDZ1-0 (RAID 10) of 4x 1.5 TB drives (which are the older drives). I did some more testing tonight and found that my read speeds can range from 65 to 100 MB/s. Here's an example. I just sequentially wrote this file yesterday to a volume with sequential free space.

I'm getting 205 MB/s in a dd test. That's with bs=1M and after blowing out the ZFS cache with a 20+ GB file. On my dumpy, 4 year old drives. My large volume is faster (6x new drives) and still shows similar behavior. Something's not right.

Dennis.stjernholmco said:
Remember the rule of thumb.
1GB ram per TB raw storage.
You got 30TB of raw capacity but "only" 16GB of ram in your system. So adding another 16GB would improve your overall performance on the system.
Just my 2 cent. ;-)

Agree that this could be a cause in some cases. But see above about how I eliminated this possibility and still am getting poor results.

eraser said:
In the last thread I remember reading, the problem was resolved by disabling the existing LAG groups. Are you doing any NIC teaming on either end?

I know this isn't addressed to me but I am only using one NIC, no LAG (LAGG?) has been used, ever.

gpsguy said:
From the manual (and numerous posts on the forum) -

"For systems with large disk capacity (greater than 8 TB), a general rule of thumb is 1 GB of RAM for every 1 TB of storage.

See above.

::::::::::::
So yeah. I could be convinced that this is the best I can get, if given more details from people who understand the inner workings of samba and ZFS. But as it stands, I am not convinced.

Does anyone have any suggestions of things to prod and test? Is there some way I can PROVE that I should not expect better?

cyberjock · Apr 4, 2014

eraser said:
My first impression is that writes are fast because ZFS write caching (txg groups) effectively turns all writes into sequential writes. Consumer disks are pretty good at sequential writes.

This is wrong. It does NOT make all writes sequential. Not one bit.

scurrier said:
Good thinking, but my reads are definitely sequential because I transferred everything sequentially, I'm basically the only user of these volumes right now.

That is a bad assumption to make. Very bad. If you actually want to know how your files are stored, there's a tool to examine the pool's metadata... check out zdb. ZFS uses it's own algorithm to fill your pool, and your assumption basically ignored all the processing ZFS uses to decide where to put the data.

scurrier said:
OK. But my disks are multiple times faster than my network. Seems like that should be fast enough if the buffer between Samba and ZFS is being managed well. If Samba is starving for data to send, shouldn't it be requesting data off the disk that is "further ahead" of the transfer.

That depends.. does Samba have enough RAM to be doing things like that? Cause ZFS is supposed to be your cache.. NOT Samba. You should ask the Samba guys, but I'm 99% sure the answer to your question is "no".

scurrier said:
Couldn't this latency you speak of be resolved with adequate buffer management? Why isn't that happening? Why are writes more network-efficient than reads?

If you have invented a way that buffers can negate the fact that moving parts can now perform at the speed of light, you are about to be bigger than the entire IT industry and you should sell your product.

I don't both trying to compare writes to reads unless there's a major disparity. I'm talking 200MB/sec write and 25MB/sec reads. Even with that disparity, scientific testing and observation would be recommended before you assume something is wrong. Both reads and writes are affected by far far far too many things to try to make assumptions like "my reads should be faster than writes". One such example I'm sure you are aware of is file fragmentation. And being that no defrag tool exists for ZFS, you can probably understand that a heavily fragmented file will perform poorly. I consider this whole writes versus reads to be pretty laughable because you really aren't applying any scientific theory to the reasoning for being upset. We've had one or two people in IRC lately "complaining" about this, and I didn't bother posting a reply. It wasn't worth my time. And don't get me started on compression(which is defaulted to enabled in 9.2.1+). Compression alone can turn everything you're used to seeing upside down. Compression can hurt or help depending on various factors like compressibility of the data, CPU usage during both the writes and the reads, ARC performance at the time of the write and read, etc. So yeah.. not a single shred of scientific evidence suggesting "something is wrong" sticks out and bites me.

And lastly.. have you ever thought there could be a performance bottleneck with your desktop? We don't just recommend Intel on your servers. Putting them on your desktops matters too. I broke down back in 2009 or 2010 and bought all Intel NICs for my house because I was pissed at my windows server performing poorly. I tried registry edits all up and down the internet. I had dismised the whole "buy intel" thing because I just couldn't believe it. Finally bought like 8 Intel NICs and dropped them into my machines. Guess what? Every machine I own has an Intel NIC now. Speeds almost doubled for me.

scurrier said:
I tried to eliminate the inadequate RAM variable by detaching all volumes except my RAIDZ1-0 (RAID 10) of 4x 1.5 TB drives (which are the older drives). I did some more testing tonight and found that my read speeds can range from 65 to 100 MB/s. Here's an example. I just sequentially wrote this file yesterday to a volume with sequential free space.

Right.. and you verified this sequential write by using ZDB to examine the actual transactions, right? I'm thinking "no". Because unless you are actually verifying it, you're just making assumptions.. and we all know what assuming does. ;)

scurrier said:
I'm getting 205 MB/s in a dd test. That's with bs=1M and after blowing out the ZFS cache with a 20+ GB file. On my dumpy, 4 year old drives. My large volume is faster (6x new drives) and still shows similar behavior. Something's not right.

From my personal observations of people that use ZFS, you should expect network transfers to never exceed more than about 1/2 of what a dd test does. So in your case I'd expect a best-case scenario of about 100MB/sec. Sure, it's possible to do better, but it's not common and shouldn't be expected. Considering that RAID10 really is using 2 vdevs, and each vdev is basically reading from 1 source, and each source is about 100MB/sec, I'd consider 205MB/sec to be an expected result.

scurrier said:
Does anyone have any suggestions of things to prod and test? Is there some way I can PROVE that I should not expect better?

Sure, read the forums for a few months on ZFS and Samba.. then start running various tests. But I'll tell you, from my experience, I don't see any reason to think you should get higher speeds.

I told you back in my first post what you should do if you care this much.. get more RAM. Not sure why you've ignored that comment, but everyone and their mother knows that more RAM with ZFS = more performance.

eraser · Apr 4, 2014

cyberjock said:
eraser said:

My first impression is that writes are fast because ZFS write caching (txg groups) effectively turns all writes into sequential writes. Consumer disks are pretty good at sequential writes.

Click to expand...

This is wrong. It does NOT make all writes sequential. Not one bit.

Sorry about that. I mis-remembered a blog post that I had recently read. I went back and found the quote I was thinking of when I wrote my reply yesterday:

"The good news here is that ZFS automatically turns random writes into sequential writes through the magic of copy-on-write. One less class of performance problems to take care of."

http://constantin.glez.de/blog/2010...ove-oracle-solaris-zfs-filesystem-performance

scurrier · Apr 7, 2014

Without digging into all of cyberjock's concerns yet, I did a quick test. My intent was to do a quick test that had a possibility of eliminating disk performance as a root cause. I changed my 6 disk RAIDZ2 to a 6 disk stripe. I turned off compression and then wrote my test files to it, in a manner that appears sequential (but which may not be per cyberjock). They again wrote at around 110 MB/s via CIFS and would only read at around 90 MB/s. So I feel like I can rule out disk performance as the root cause in this case. Something else is happening, there is an inefficiency somewhere.

Out of curiosity I went to test on my other computer which has an SSD. That machine had some high peak reads over the network when the transfer first started. Maybe related to bursts coming from the NIC buffer which never bursted so high over the network. But then it settled down to around the same as the original machine.

I'm not sure if I want to give up on this or keep after it. Maybe I will buy an intel NIC for the clients just so that I can sleep at night having eliminated another possible root cause.

Side note: I'm currently running iostat on this volume before testing further since I don't have much time right now. It's absolutely blasting through the test. Paper calcs for the sequential throughput of this volume are 960 MB/s! (assuming 160 MB/s initial performance for these drives)

cyberjock · Apr 7, 2014

If you can do 110MB/sec with writes, then your reads aren't being limited by your NIC.. you already proved your NIC can do more. Now, I always recommend Intel NICs because the others behave erratically(to say the least). I wouldn't necessarily buy one expecting a performance increase as I consider 90MB/sec+ to be excellent network speeds. But, at $25 you could "throw the money away" and not miss it.

If this is for home use I'd say you are a little overzealous with your expectations. You're spending a lot of time trying to fix what isn't broken. That's okay though, I have no doubt you are learning a lot by doing what you are doing. And that can never be a bad thing.

If you did a dd test of your 6 disk stripe. Say, dd if=dev/zero of=/mnt/poolname/testfile bs=1M count=100k I bet you'll get over 400MB/sec. So you've obviously proved that the pool has characteristics of exceeding Gb LAN by a long shot at that point. At that point I'd also direct you to reconsider:

1. More RAM will always make the L2ARC and pre-fetch more efficient. Now whether that will be a major performance improvement(think 3000% like I have personally seen) or minor(5% or so) remains to be seen. There's no good way to know until you add more RAM. I personally wouldn't add more RAM unless you plan to use jails that will need a lot of RAM and you want to keep the performance you currently have. Once you have >8GB of RAM I tell people you can either add more RAM proactively or add more RAM as a corrective action. Which philosophy you choose is up to you and how fat your wallet is.

2. Intel NICs in the server and clients is always a major plus. When I decide to go shopping for my next laptop, it WILL come with an Intel NIC. Period. That's one of my criteria for vetting out the bad laptops from the good. It wasn't really possible when I bought my last laptop in March 2010. But, more and more companies are going with Intel NICs, and I'll be going with it properly. Getting 40MB/sec max from my laptop over the LAN just doesn't put a smile on my face.

scurrier · Apr 8, 2014

I tried the dd command that you recommended on a 20GB file except I excluded the count option, so it read the whole file. Made sure the cache was clear by rebooting. Got 800 MB/s from the 6 disk stripe. These are the 4tb seagate NAS drives for the record.

Still not sure how much more time I'm going to sink in this. Haven't had time to consider it!

Rand · Apr 9, 2014

Well if you choose to follow up i'd be interested in the result;)

Important Announcement for the TrueNAS Community.

SMB share: Writes are GBE fast. Reads are only GBE fast if cached. Why?

Patron

Patron

Inactive Account

Patron

Inactive Account

Explorer

Guru

Contributor

Contributor

Active Member

Contributor

Patron

Patron

Attachments

Inactive Account

Contributor

Patron

Inactive Account

Patron

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SMB share: Writes are GBE fast. Reads are only GBE fast if cached. Why?"

Similar threads