Not able to saturate 10Gb connection

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
Good morning

I have the following server:

MB: Supermicro X9SRL-F
CPU: E5-2650 v2 (8C/16T)
RAM: 64GB 1600MHz DDR3
HBAs: 2x LSI 9211-8i flashed to IT-mode
Drives: 10x10TB WD Red (white label, shucked from WD External drives) in one ZFS2 pool
NIC: Intel X540-T2 2x10Gb RJ45

Other relevant HW:
Networking:
Switch: Netgear SX10

Desktop specs:
CPU: 9600K
RAM: 16GB 3200MHz
SSD: Intel 660p 1TB
NIC: Asus XC-C100C 1x10Gb RJ45

Performance is decent. Crystal Disk Mark reports the following:
performance_CDM.PNG


This is obviously boosted by ARC, as 678MB/s sequential write is more than the theoretical maximum write speed for 10 drives in ZFS2. Read speeds are as expected, more or less saturating my 10Gb connection, minus overhead.

But, copying large files to my Intel 660p 1TB drive sees speeds closer to 400-600MB/s sustained, while I would expect closer to 1GB/s. My Intel 660p is about 40% filled, and should at least maintain beyond 1GB/s for the first several GBs. Also, read speeds are suffering if I choose "cold" files, by which I mean files that are obviously not cached, files that haven't been accessed in several months. If I transer a file to my server, then immediately copy it back to my SSD, speeds are closer to 800-900 MB/s.

Are there any obvious reasons why I dont se sustained sequential read speeds closer to 1GB/s? Is there any optimization that I might be able to do?
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
Update: I looked in to Jumbo frames. I haven't tried it yet. But I am wondering if it might not be such a good idea. My switch supports it, and I can easily set it on both my server and main desktop. But the rest of the devices on the network, mainly an Apple TV 4K and my LG TV which connects to the NAS for local Plex access?

I could enable a VLAN on my switch for my devices that have jumbo frames enabled, but then I believe my TV and Apple TV won't be able to have local access to my Plex Server?
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
Update 2: I am considering using a 1TB Adata 8200X Pro NVMe as cache drive. But is this really required? My usage consists mostly of sustained sequential reads/writes.

I can also upgrade to 128GB RAM for for less money than the SSD. I believe this should be enough to cover most of my needs. And as far as I'm able to see, there isnt a good way to implement sequential write cache. An L2ARC will just increase synchronous write, and honestly, I dont know if sustained write from one client (think transfering 500GB of video files from one client over 10GbE) is synchronous write.

My main goal is to be able to offload 500GB+ at sustained 1GB/s+ speeds to the server, but I dont know how to be able to do this.
 

AVB

Contributor
Joined
Apr 29, 2012
Messages
174
More RAM to start with. Unless something has changed in the last month the rule of thumb is 1GB ram for every TB of storage so you are a little behind. I was a little behind myself (64GB/72TB) and added another 32GB of ram which did increase performance about 5-10% when dealing with large files of 20-40GB. The rest of your questions are past my experience level but I'm sure somebody here will have an answer for you.
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
More RAM to start with. Unless something has changed in the last month the rule of thumb is 1GB ram for every TB of storage so you are a little behind. I was a little behind myself (64GB/72TB) and added another 32GB of ram which did increase performance about 5-10% when dealing with large files of 20-40GB. The rest of your questions are past my experience level but I'm sure somebody here will have an answer for you.
I am planning on upgrading to 128GB, but yes, a lot has changed since 1GB/TB was the norm/required, based on my research. It depends on the setup, of you are using deduplication, etc.

There is some discussion on the topic here:https://linustechtips.com/main/topic/738402-zfs-memory-requirements/

Edit: Which I also have experienced myself. I have 100TB raw space on 64GB, but 16GB is dedicated to VMs, so I have been running 100TB raw space on 48GB RAM with no issues whatsoever. Performance is great. Still not saturating my entire 10 Gb connection, but still very respectable.
Performance CDM 100TB Server.PNG
 
Last edited:

Rand

Guru
Joined
Dec 30, 2013
Messages
906
So first of all,
if you can reach 1028 MB on Q32 Streaming write then you are pretty close to reach 10G speed, so the title is misleading ;)

Then you are discussing multiple issues
a.) your 660p is not able to reach expected performance levels
b.) your primary goal to constantly saturate 1GbE to transfer 500 GB of data
c.) Unexpected high transfer speed on Q32 write

Given that b) is you main goal lets start with that.
Running a 2 GiB test with Q32 or Q8/T8 is not a realistic test for a simple copy job.
1. You need to determine what kind of files you are copying (thing single 500GB file (=stream) vs 500.000 1MB files) - results will be vastly different
2. Then assuming you wil run a single copy job only you need to look at QD1 and blocksize depending on your filesize (lets assume seq here)
3. You need to use larger sample set to rule out the effect of your ARC (memory) since most of the 2GB probably can be cached before big written to disk (-> c)
4. Now in order to gauge the arrays ability to saturate 1GbE lets have a look at the theoretical capabilities

N-wide RAIDZ, parity level p:

  • Read IOPS: Read IOPS of single drive
  • Write IOPS: Write IOPS of single drive
  • Streaming read speed: (N – p) * Streaming read speed of single drive
  • Streaming write speed: (N – p) * Streaming write speed of single drive
  • Storage space efficiency: (N – p)/N
  • Fault tolerance: 1 disk per vdev for Z1, 2 for Z2, 3 for Z3 [p]
So in your case its 10-2=8 * single drive performance (provided its large files = streaming writes, else we look at rather bad 1 disk IOPS speed).
So with 125 MB/s target you need like 15 MB/s per disk - that should be doable.

You should verify if thats working out (real test with a file, QD1 large sample set in CDM or something like Aja)
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
So first of all,
if you can reach 1028 MB on Q32 Streaming write then you are pretty close to reach 10G speed, so the title is misleading ;)
Haha, I kind of agree, but was thinking mainly of write speeds. I am exploring ways to reach close to 10Gbps on both streaming read and write speeds. And to answer your question, my test files were mainly single files in the 10-50GB range (for my 500GB file transfer)

Also, I didnt know that both streaming write and streaming read speeds are equal (N-p)*single read/write speeds, which should give a theoretical maximums for the pool of (10-2)*200MB/s = 1600MB/s, am I correct? (according to testing, write speeds dropped from 220MB/s to 180MB/s going from an empty drive, to a half-filled one). Lets drop this to 150MB/s per disk, it still gives me 1200MB/s theoretical for this specific workload, ignorling all other bottlenecks.

Then you are discussing multiple issues
a.) your 660p is not able to reach expected performance levels
Well, yes and no. It performs as expected from the product, I guess not satisfactory to me. I should have done some more research on it before I bought it. I have already bought an Adata SX8200 pro 1TB which should be able to hit 1GB/s for at least 600GB (copy from server, to SSD) according to Tom's Hardware review of the unit.

b.) your primary goal to constantly saturate 1GbE to transfer 500 GB of data
10GbE*

Given that b) is you main goal lets start with that.
Running a 2 GiB test with Q32 or Q8/T8 is not a realistic test for a simple copy job.
1. You need to determine what kind of files you are copying (thing single 500GB file (=stream) vs 500.000 1MB files) - results will be vastly different
2. Then assuming you wil run a single copy job only you need to look at QD1 and blocksize depending on your filesize (lets assume seq here)

3. You need to use larger sample set to rule out the effect of your ARC (memory) since most of the 2GB probably can be cached before big written to disk (-> c)
Thank you, I will test later. I already did a quick test in CrystalDiskMark with max file size (32GB) and 9 samples, and got very similar results (900-1000MB/s reads and 600-700MB/s for writes on Q32T1, but I dont have the screenshot here. Considering my server has 48GB RAM available to freeNAS, this might still be boosted by ARC.

4. Now in order to gauge the arrays ability to saturate 1GbE lets have a look at the theoretical capabilities
So in your case its 10-2=8 * single drive performance (provided its large files = streaming writes, else we look at rather bad 1 disk IOPS speed).
So with 125 MB/s target you need like 15 MB/s per disk - that should be doable.

You should verify if thats working out (real test with a file, QD1 large sample set in CDM or something like Aja)
Again, the goal is 10Gb, but I think 150MB/s per disk is doable for single stream? Like I mentioned above, my sample files are 10-50GB each. During my first real world performance, I saw write speeds to the server close to 500MB/s over the entire 500GB.

For reads though, I seem to have some sort of performance "issue" when pulling a file that I know isn't in ARC. The first file transfer from the server to my SSD (23GB test file) starts off at 200-300MB/s. On concurrent runs, this increases to a burst of 800-900MB/s until my SSD SLC cache is full, and the speed drops. This to me is kind of weird. I would believe that for streaming reads, I should be able to see speeds far beyond 300MB/s directly from the pool, and not only for cached data.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Ah right, I misread your initial post re primary goal (read 1GbE, not 1 GB/s)

So I don't have real answers unfortunately; for some reasons its very hard to reach 10GbE speeds consistently with ZFS based systems. The systems I see hitting that usually are capable of much more (theoretically).
Now have you applied the 10G tuning tips flying around? Those primarily increase network buffers and such but maybe they do help.

Also its always good to measure the actual pool performance (usually with dd here, but you also can use fio, eg
Code:
fio  --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=posixaio   --size=100G  --bs=1M --iodepth=1 --numjobs=1 --rw=write --filename=/mnt/tank/fio_1.out 
)

That gives you a benchmark to work against; o/c over network there are always losses due to latency as well.
How is the CPU looking? You will do single threaded copy (so single core) most likely unless newer SMB is doing multithreaded for single files now, no idea how much that improved).
Also whats the source disk? The SX8200? Or is that going to the target box too?
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
So I don't have real answers unfortunately; for some reasons its very hard to reach 10GbE speeds consistently with ZFS based systems. The systems I see hitting that usually are capable of much more (theoretically).
Now have you applied the 10G tuning tips flying around? Those primarily increase network buffers and such but maybe they do help.

Also its always good to measure the actual pool performance (usually with dd here, but you also can use fio, eg
Code:
fio  --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=posixaio   --size=100G  --bs=1M --iodepth=1 --numjobs=1 --rw=write --filename=/mnt/tank/fio_1.out 
)

I havent, but definitely will test this later today.

Also whats the source disk? The SX8200? Or is that going to the target box too?
I'm not sure if I follow. When I test server write performance, the source disk is the 660p (to be replaced by SX8200). When I test server read performance, the target disk is also my 660p (again, to be replaced by the SX8200). But I dont perform those test simultaneously, and I must stress that they are merely standard file copy tests in Windows 10. Not exactly perfect, but it shows real world performance I guess.
 
Last edited:

Rand

Guru
Joined
Dec 30, 2013
Messages
906
ok, so the 660p/SX8200 are the drives in your (desktop) pc and you're transferring to the 10 disk array on a remote box (10G, SMB(?), 2,6GHz remote, 3,7GHz locally).
Sorry, shouldnt do this between other things and read properly:/
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
ok, so the 660p/SX8200 are the drives in your (desktop) pc and you're transferring to the 10 disk array on a remote box (10G, SMB(?), 2,6GHz remote, 3,7GHz locally).
Sorry, shouldn't do this between other things and read properly:/
No problem! You have by far been the most helpful user in her, and I really appreciate it! Specs for everything below:

The specs are in my first post, but I have copied them in below, with some additional information:

HW Specs:
Server:
MB: Supermicro X9SRL-F
CPU: E5-2650 v2 (8C/16T, 2,60GHz base, 3,40GHz boost)
RAM: 64GB 1600MHz DDR3
HBAs: 2x LSI 9211-8i flashed to IT-mode
Drives: 10x10TB WD Red (white label, shucked from WD External drives) in one ZFS2 pool
NIC: Intel X540-T2 2x10Gb RJ45

Desktop
CPU: Intel i5 9600K (overclocked to 5,0GHz on all cores)
RAM: 16GB 3200MHz
SSD: Intel 660p 1TB
NIC: Asus XC-C100C 1x10Gb RJ45

Switch: Netgear SX10

The setups is as follows: The server is running freeNAS 11.2U7 with a more or less default config. I have set static IP and a few other things, but I haven't done any optimization. My main pool is a single encrypted ZFS2 with default compression enabled, and deduplication disabled, consisting of 10x10TB WD100EMAZ. Single drive performance is around 200-220MB/s, dropping down towards 180MB/s for 50% filled capacity, which is about where I am today. I run 1 VM (Ubuntu 19, which has been allocated 4 cores (out of 8) and 16GB RAM. This leaves 48GB RAM available to freeNAS/ARC. The VM is only running a teamspeak server, and some game servers, which only rarely are in use, so CPU usage is usually very low.

The shared drive is set up as SMB and default settings.

The Netgear SX10 switch is also set up in a more or less default way. No optimization enabled, except for prioritizing packages from my desktop and NAS over other devices on the network.

My desktop is running the newest version of Windows 10. No network optimization done her either. As stated above, it is running an Asus XG-C100C.

If there is anything else you need to know, please let me know!
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It would be worth it to monitor the system under load to see if you're hitting maximum utilization on a CPU core or anything like that.

CrystalDiskMark implies CIFS, and for CIFS, you've really got a less-than-ideal CPU. CIFS performs best on high frequency CPU's, and you've got a single socket. The E5-16xx v2 family is generally the correct CPU family for that board, although it does accept the much more expensive E5-26xx v2's. Most of the 26xx series offers lower per-core speeds, but does have a few upsides like better cache and the ability to use LRDIMM.

E5-1650v2 (3.5GHz/3.9 turbo) or E5-2643v2 are among the best CPU's you can get in the X9 generation for NAS.


I am planning on upgrading to 128GB, but yes, a lot has changed since 1GB/TB was the norm/required, based on my research. It depends on the setup, of you are using deduplication, etc.

It really hasn't changed that much. The requirements are softer as you get out to larger values of RAM/pool, but that's really only once you get out into the 64-128GB range. ZFS still requires a lot of RAM for metadata, and write speeds can drop precipitously if you do not have RAM sized appropriately to the workload. Almost no one should be using dedupe, which has an entirely different set of RAM sizing rules.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
2.6 GHz on the 2650 is not too bad, but o/c higher single thread performance is better in this case. Keep an eye on single core utilization during a transfer.

Also the network optimizations I mentioned include LRO/TSO to offload things like checksum calculation to the NIC freeing up the CPU as well - give it a try - you also can increase buffers and queues on the windows box although thats usually not necessary any more nowadays.

Another thing to verify is that single thread iperf performance is fine - you have already proved that multithread is ok, but this is another use case...
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
It would be worth it to monitor the system under load to see if you're hitting maximum utilization on a CPU core or anything like that.

CrystalDiskMark implies CIFS, and for CIFS, you've really got a less-than-ideal CPU. CIFS performs best on high frequency CPU's, and you've got a single socket. The E5-16xx v2 family is generally the correct CPU family for that board, although it does accept the much more expensive E5-26xx v2's. Most of the 26xx series offers lower per-core speeds, but does have a few upsides like better cache and the ability to use LRDIMM.

E5-1650v2 (3.5GHz/3.9 turbo) or E5-2643v2 are among the best CPU's you can get in the X9 generation for NAS.




It really hasn't changed that much. The requirements are softer as you get out to larger values of RAM/pool, but that's really only once you get out into the 64-128GB range. ZFS still requires a lot of RAM for metadata, and write speeds can drop precipitously if you do not have RAM sized appropriately to the workload. Almost no one should be using dedupe, which has an entirely different set of RAM sizing rules.
Thank you. I wish someone had told me this earlier. One of the reasons for buying the 2650 v2 was to be able to upgrade to a dual socket motherboard at a later stage, which I now have done. I bought another 2650 v2 (before reading your post), and a Supermicro X9DRL-IF. Still waiting to receive both items. I might have been able to get away with 2x 2643 v2 (12C/24T total) as I am only (initially) planning on running three VMs: freeNAS, 1x Ubuntu, and 1x Windows 10.

On another note. After doing some quick testing, I tried the other port on my NIC. ix0 instead of ix1. With static IP configured. To my surprise, I saw transferspeeds exceeding 1GB/s for the first 5-10 seconds, before it bottlenecked somewhere. I then switched back to DHCP (got a new router, Netgear XR500, and can't decide if it is better to use DHCP for everything, but just use the "reserve IP"-function in DumaOS to reserve IPs to specific MAC addresses, or just set static IP on every client), but after switching to DHCP, speeds "dropped" down to 600-700MB/s. This might have been a fluke, but those were the only things I changed.
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
So, another update. I have received my Intel DC P3700 and installed it. First, I changed sector size to 4kB to increase performance, and tested it in my computer. Performance was as expected, with reads in excess of 2,5GB/s, and writes well beyond 1GB/s. Then I installed it in my server, and set it up as a ZFS pool and made a SMB testfolder. Speeds were more or less identical to my main HDD pool (only tested file sizes up to 40-50GB). I then followed Stux’ guide to set it up with three partitions: SWAP, SLOG and L2ARC with the following sizes: 32GB, 20GB, and approximately 250GB for L2ARC. The weird thing is that this significantly decreased performance, down to around 100-150MB/s for both read and write. I removed SLOG, tested again, with the same Poor performance. I then removed the L2ARC, still with The same performance. I then removed the SWAP and rebooted, and performance returned to normal. I dont know why.

I then re-added SLOG and L2ARC (20GB and 256GB respectively) with success. Performance on a single 25GB file and i 20GB folder with approx 3000 files were identical to the performance without SLOG and L2ARC. Not surprising, as both file size are below my RAM size, and since none of the writes are async.

The big question is, why did performance degrade so much when I added SWAP to the P3700?
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
New update here.

I have installed a new NVMe in my desktop (Adata SX8200 Pro 1TB. I have done some file streaming (just copying to and from the server), and have the following results:

I transfered 13 large video files (approx 20GB per file on average), totaling 269,144MB in 440 seconds, FROM my SSD, over the network, TO my ZFS2 pool. This gives an average of 612MB/s. Not bad!

My problem is read speeds on my server. Transfering the same file several times (to force it being saved in ARC) see speeds climbing towards 900MB/s. Very good. But, read speeds on "cold" files on the server varies from 150MB/s to 300MB/s. This is speculation, but I believe they vary based on where they are stored. Files I stored initially (when the pool was almost empty, stored on the outside of the platters) see speeds around 250-350MB/s. While files I stored with approx 40% filled capacity, see consistently slower speeds, around 140-200MB/s.

My stream/sequential read speeds should be much higher than this, shouldnt they?
 

soft_reset

Dabbler
Joined
Dec 19, 2019
Messages
11
I don't have a solution, but a very similar problem. I get very high writes with 7 z2 drives, and then reads only slightly better than a single drive...

It's weird, and even with windows software raid on the same drives I got better read results than this. With a hardware raid controller I got close to 700MB/s read sustained over a 1TB sequential read test, and the same drives give me 120-180MB/s read in z2, while giving me 400+ write after ARC runs out.

At first I thought it was SMB being SMB, but I get the same results when using DD locally.
Here is my reddit post about the problem:

I got horrible read results even from my NVME ssd (700MB/s while it easily saturated 20Gbit/s in a windows share), but I haven't tested that one locally yet.
 

soft_reset

Dabbler
Joined
Dec 19, 2019
Messages
11
Sorry, when I said "hardware raid", I meant I tested a hardware raid card + windows on the same drives to make sure they work properly. My proper FreeNAS config is in my reddit post and is running from an LSI-SAS2008 HBA:

6x4TB WD Red Raid-Z2
NVME tested for 3400r / 2500w as test share
boot-SSD connected to motherboard sata port
LSI-SAS2008 (Fujitsu branded, flashed to IT-Mode)
32GB DDR3
Xeon E3 1241v3, 4C/8T Haswell
Supermicro X10SLL-F

Mellanox ConnectX3 dual port 10Gbps (in PCIe 2.0 x4, physical x8 slot)
 
Top