Slow NFS read performance over 10 GbE

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
Hardware:
Server: TrueNAS-12.0-U1.1:
  • Supermicro X11SPH-NCTPF
  • Intel Xeon Scalabel Silver 4210R (10C/20T 2.4/2.3 GHz)
  • 128 GB RAM
  • Storage:
    • Data: 6x Seagate Exos X16 ST16000NM002G 16 TB SAS (RAIDZ2)
    • Log: 1x Intel Optane SSD 900P 280 GB PCI express
    • Spare: 1x Seagate Exos X16 ST16000NM002G 16 TB SAS
    • Metadata: 3x INTEL DC S4610 480GB SATA (mirror)
  • Hard disk controllers:
    • SAS: Onboard 3008
    • SATA: C622
  • Network cards
    • Onboard X722
Client: CentOS 7 3.10.0-1160.11.1.el7.x86_64
  • Dell PowerEdge T20
  • Xeon E3-1225 v3 3.2 GHz
  • 16 GB RAM
  • Mellanox ConnectX-3
Setup:
  • Pool: RAIDZ2, encrypted (AES-256-GCM), compressed (lz4)
  • Network: MTU 9000
  • Number of NFS servers: 20
  • TrueNAS tuneables:
    • kern.ipc.maxsockbuf 8388608
    • net.inet.ip.intr_queue_maxlen 2048
    • net.inet.tcp.delayed_ack 0
    • net.inet.tcp.mssdflt 1448
    • net.inet.tcp.recvbuf_inc 524288
    • net.inet.tcp.recvbuf_max 16777216
    • net.inet.tcp.recvspace 524288
    • net.inet.tcp.sendbuf_inc 16384
    • net.inet.tcp.sendbuf_max 16777216
    • net.inet.tcp.sendspace 524288
    • net.route.netisr_maxqlen 2048

Problem description:
Slow sequential read over NFS. I have created a 200 GB file dd if=/dev/urandom of=test bs=1M count=200k. When I read it locally on the TrueNAS host i get around 850 MB/s:
# dd if=test of=/dev/null bs=1M 214748364800 bytes transferred in 249.877236 secs (859415480 bytes/sec)

However, when I do the same but mounted on the client over NFS (default mount opts) I only get around 290 MB/s:
# dd if=test of=/dev/null bs=1M 214748364800 bytes (215 GB) copied, 740,928 s, 290 MB/s

The highest CPU load during the test is around 7-9 % on TrueNAS dashboard and 20-25% WCPU for nfsd in top.

Problem solving steps:
I have tested the TCP network performance between the machines using iperf3 and see no issues:
# iperf3 -c [TrueNAS IP] ... [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.5 GBytes 9.89 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 11.5 GBytes 9.88 Gbits/sec receiver

# iperf3 -c [TrueNAS IP] -R ... [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver

I have tried different settings for the pool (no encryption, no metadata vdev) and the default values for tuneables, number of NFS servers and MTU but don't see any significant difference in the performance.

Does anyone have any advice on how to solve this? Or is this performance expected?
 
Joined
Jul 2, 2019
Messages
648
NOTE: I'm no expert in this, so this may not help...

Just looking over your config. The server SAS controller and drives are 12 Mbit/s, so I don't think the problem is there. I think that there are a couple of pieces of additional information:
  • Are your 10 GbE connections between the client and the server direct? Or are they going through a switch? I expect if they are going through a switch that the switch is capable of supporting 10 GbE between the two ports (I expect that it is, but you could check).
  • On the client end, is the controller and drive capable of supporting 12 Mbit/s?
 
Joined
Jul 2, 2019
Messages
648
Another thought: see this post: Special Allocation Class vdev & ESXi. @HoneyBadger notes that:

"Before we dig too deep, understand that RAIDZ2 with block storage runs relatively poorly. See "The path to success for block storage" for some details on that, including the thread linked therein about the mirrors vs. RAIDZ performance characteristics. I understand you're trying to run a pool for multiple uses here (media + backups are fine for RAIDZ) but just be aware that this might be where the buck stops if you're chasing VMFS performance."

There may be some thoughts there...
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
NOTE: I'm no expert in this, so this may not help...

Just looking over your config. The server SAS controller and drives are 12 Mbit/s, so I don't think the problem is there. I think that there are a couple of pieces of additional information:
  • Are your 10 GbE connections between the client and the server direct? Or are they going through a switch? I expect if they are going through a switch that the switch is capable of supporting 10 GbE between the two ports (I expect that it is, but you could check).
  • On the client end, is the controller and drive capable of supporting 12 Mbit/s?

Thank you for your replies! The connections are through a switch, but I have tried two different switches (Dell x4012, Ubiquiti XG 16) that both support 10 GbE and they don't have any other significant load. Also, I seem to be able to push close to 10 Gbit/s via iperf3 (see first post) so i wouldn't expect the network to be the problem.

Regarding the controller performance on the client, it is a single SATA drive so theoretical max would be 6 Gbit/s. However, I write the data to /dev/null so I don't think the drive comes in to play here. I am not sure what is the limiting factor on /dev/null.

Another thought: see this post: Special Allocation Class vdev & ESXi. @HoneyBadger notes that:

"Before we dig too deep, understand that RAIDZ2 with block storage runs relatively poorly. See "The path to success for block storage" for some details on that, including the thread linked therein about the mirrors vs. RAIDZ performance characteristics. I understand you're trying to run a pool for multiple uses here (media + backups are fine for RAIDZ) but just be aware that this might be where the buck stops if you're chasing VMFS performance."

There may be some thoughts there...

I seem to have more performance available from the RAIDZ2 since I get a lot more when reading locally on the TrueNAS server (see first post), but i might be missing something..
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Hardware:
Server: TrueNAS-12.0-U1.1:
  • Supermicro X11SPH-NCTPF
  • Intel Xeon Scalabel Silver 4210R (10C/20T 2.4/2.3 GHz)
  • 128 GB RAM
  • Storage:
    • Data: 6x Seagate Exos X16 ST16000NM002G 16 TB SAS (RAIDZ2)
    • Log: 1x Intel Optane SSD 900P 280 GB PCI express
    • Spare: 1x Seagate Exos X16 ST16000NM002G 16 TB SAS
    • Metadata: 3x INTEL DC S4610 480GB SATA (mirror)
  • Hard disk controllers:
    • SAS: Onboard 3008
    • SATA: C622
  • Network cards
    • Onboard X722
Client: CentOS 7 3.10.0-1160.11.1.el7.x86_64
  • Dell PowerEdge T20
  • Xeon E3-1225 v3 3.2 GHz
  • 16 GB RAM
  • Mellanox ConnectX-3
Setup:
  • Pool: RAIDZ2, encrypted (AES-256-GCM), compressed (lz4)
  • Network: MTU 9000
  • Number of NFS servers: 20
  • TrueNAS tuneables:
    • kern.ipc.maxsockbuf 8388608
    • net.inet.ip.intr_queue_maxlen 2048
    • net.inet.tcp.delayed_ack 0
    • net.inet.tcp.mssdflt 1448
    • net.inet.tcp.recvbuf_inc 524288
    • net.inet.tcp.recvbuf_max 16777216
    • net.inet.tcp.recvspace 524288
    • net.inet.tcp.sendbuf_inc 16384
    • net.inet.tcp.sendbuf_max 16777216
    • net.inet.tcp.sendspace 524288
    • net.route.netisr_maxqlen 2048

Problem description:
Slow sequential read over NFS. I have created a 200 GB file dd if=/dev/urandom of=test bs=1M count=200k. When I read it locally on the TrueNAS host i get around 850 MB/s:
# dd if=test of=/dev/null bs=1M 214748364800 bytes transferred in 249.877236 secs (859415480 bytes/sec)

However, when I do the same but mounted on the client over NFS (default mount opts) I only get around 290 MB/s:
# dd if=test of=/dev/null bs=1M 214748364800 bytes (215 GB) copied, 740,928 s, 290 MB/s

The highest CPU load during the test is around 7-9 % on TrueNAS dashboard and 20-25% WCPU for nfsd in top.

Problem solving steps:
I have tested the TCP network performance between the machines using iperf3 and see no issues:
# iperf3 -c [TrueNAS IP] ... [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.5 GBytes 9.89 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 11.5 GBytes 9.88 Gbits/sec receiver

# iperf3 -c [TrueNAS IP] -R ... [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver

I have tried different settings for the pool (no encryption, no metadata vdev) and the default values for tuneables, number of NFS servers and MTU but don't see any significant difference in the performance.

Does anyone have any advice on how to solve this? Or is this performance expected?
You have done a great job of testing things individually. What's your performance if you remove all the tunables?
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
You have done a great job of testing things individually. What's your performance if you remove all the tunables?

Thank you for your reply!
I removed the tunables, rebooted the server for good measure and flushed the NFS cache on the client and got very similar results, 282 MB/s.
# dd if=test of=/dev/null bs=1M 214748364800 bytes (215 GB) copied, 760,263 s, 282 MB/s

I also reran the same test on the server to make sure there are no changes there. This was slightly slower now (782 MB/s) but still much faster than over NFS.
# dd if=test of=/dev/null bs=1M 214748364800 bytes transferred in 274.384381 secs (782655209 bytes/sec)
 
Joined
Jul 2, 2019
Messages
648
I wonder, could it be PCI bus contention?
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
I wonder, could it be PCI bus contention?
Would you have any good method for testing this?

I tried running dd on the server while running iperf3 from the client receiving data from the server at the same time, but neither seemed to be affected by the other:
dd on TrueNAS (779 MB/s):
# dd if=test of=/dev/null bs=1M 214748364800 bytes transferred in 275.390656 secs (779795394 bytes/sec)

iperf3 on receiving from TrueNAS:
# iperf3 -c grid.data.gp3.asymptotic.ai -t 300 -R ... [ ID] Interval Transfer Bandwidth [ 4] 0.00-284.69 sec 0.00 Bytes 0.00 bits/sec sender [ 4] 0.00-284.69 sec 328 GBytes 9.89 Gbits/sec receiver

I also kept an eye on the number of interrupts while running this using vmstat -i -w 1 resulting in approximately 37k interrupts/second.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Would you have any good method for testing this?

I tried running dd on the server while running iperf3 from the client receiving data from the server at the same time, but neither seemed to be affected by the other:
dd on TrueNAS (779 MB/s):
# dd if=test of=/dev/null bs=1M 214748364800 bytes transferred in 275.390656 secs (779795394 bytes/sec)

iperf3 on receiving from TrueNAS:
# iperf3 -c grid.data.gp3.asymptotic.ai -t 300 -R ... [ ID] Interval Transfer Bandwidth [ 4] 0.00-284.69 sec 0.00 Bytes 0.00 bits/sec sender [ 4] 0.00-284.69 sec 328 GBytes 9.89 Gbits/sec receiver

I also kept an eye on the number of interrupts while running this using vmstat -i -w 1 resulting in approximately 37k interrupts/second.
Have you read through the 10gig thread? I'm on mobile right now so it's kinda hard to link but just do a quick search. I think there is sometimes some tuning needed to get full speed with some network drivers. I would also suggest disabling jumbo frames until the end. That will usually get you the last 10% but it's not the big ticket performance improver.
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
Have you read through the 10gig thread? I'm on mobile right now so it's kinda hard to link but just do a quick search. I think there is sometimes some tuning needed to get full speed with some network drivers. I would also suggest disabling jumbo frames until the end. That will usually get you the last 10% but it's not the big ticket performance improver.
You are referring to this thread? https://www.truenas.com/community/threads/10-gig-networking-primer.25749/
I've been searching fairly extensively, but all concrete tips seem to be related to the tunables and number of NFS servers, which I have already tried to modify. Do you know any more specific things i should look after?

However, I have found some people having issues with the x722 NIC but also some that had success using it. I ordered a Chelisio T520-CR which I can try if I don't find any solution before it arrives.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
You might want to check if its protocol dependent or not. Same test with a SMB mount should tell.
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
You might want to check if its protocol dependent or not. Same test with a SMB mount should tell.
Thank you, good idea!

I ran via SMB and found it to be slightly faster than NFS, but not much (315 MB/s):
# dd if=test of=/dev/null bs=1M 214748364800 bytes (215 GB) copied, 681,811 s, 315 MB/s
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
So it seems to be network in general and not NFS in particular..
whats your sync option on the dataset?

Unfortunately its quite normal to loose a lot of performance over the network and its difficult to say what is the usual amount and whats uncommon
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
So it seems to be network in general and not NFS in particular..
whats your sync option on the dataset?

Unfortunately its quite normal to loose a lot of performance over the network and its difficult to say what is the usual amount and whats uncommon
I haven't touched the sync settings so they are "standard".

I would expect some loss but this is almost 70% which feels a bit too much.. Also it is easy to find other people (with at least similar hardware) that are reaching far higher bandwidth over NFS.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
dd is usually a bad test since its single threaded and not necessarily a realistic scenario, use of fio is always recommended.

But I get why a quick dd is a good indicator, so I ran the same test for you on my environment (to see if I have same, more or less loss percentage than you do)

Here are the results:

locally
dd if=test of=/dev/null bs=1M
204800+0 records in
204800+0 records out
214748364800 bytes transferred in 51.593842 secs (4162286736 bytes/sec)


Remote
dd if=test of=/dev/null bs=1M
204800+0 records in
204800+0 records out
214748364800 bytes (215 GB, 200 GiB) copied, 216.044 s, 994 MB/s

Given that I have a 6 disk Z2 array Its safe to assume the 200G were actually cached in memory considering the speed.
However, you notice that local read speed was 4.2 times faster than remote so only 23% of the local speed

In comparison your remote speed is way faster than mine with only a 2.8 factor (36% of local)...


As long as there is no RDMA based access local will always be significantly faster then networked unfortunately.
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
dd is usually a bad test since its single threaded and not necessarily a realistic scenario, use of fio is always recommended.

But I get why a quick dd is a good indicator, so I ran the same test for you on my environment (to see if I have same, more or less loss percentage than you do)

Here are the results:

locally
dd if=test of=/dev/null bs=1M
204800+0 records in
204800+0 records out
214748364800 bytes transferred in 51.593842 secs (4162286736 bytes/sec)


Remote
dd if=test of=/dev/null bs=1M
204800+0 records in
204800+0 records out
214748364800 bytes (215 GB, 200 GiB) copied, 216.044 s, 994 MB/s

Given that I have a 6 disk Z2 array Its safe to assume the 200G were actually cached in memory considering the speed.
However, you notice that local read speed was 4.2 times faster than remote so only 23% of the local speed

In comparison your remote speed is way faster than mine with only a 2.8 factor (36% of local)...


As long as there is no RDMA based access local will always be significantly faster then networked unfortunately.
Thank you so much for your effort to test this! How much RAM do you have in your system? Is this on 10 Gbit/s network? Which version of FreeNAS/TrueNAS are you running?

I have made some progress! I tried to run FreeNAS 11.3-U5 instead of TrueNAS-12.0-U1.1 with the same setup (except metadata cache) and got better results (557 MB/s).

# dd if=test of=/dev/null bs=1M 214282797056 bytes (214 GB) copied, 384,825694 s, 557 MB/s 214748364800 bytes (215 GB) copied, 385,635 s, 557 MB/s

I also tried with a smaller file (25 GB) that should fit in ARC on the server. I flush the local NFS cache on the client to not read from RAM locally.
# echo 3 > /proc/sys/vm/drop_caches # dd if=test_25G of=/dev/null bs=1M 26843545600 bytes (27 GB) copied, 21,9576 s, 1,2 GB/s
Here I get 1.2 GB/s.

I also made a discovery that if i run dd on the 200 GB file on the server at the same time as on the client i get good speeds (985 MB/s):
Server:
# dd if=test of=/dev/null bs=1M 214748364800 bytes transferred in 220.053926 secs (975889724 bytes/sec)

Client:
# dd if=test of=/dev/null bs=1M 214748364800 bytes (215 GB) copied, 218,083 s, 985 MB/s

So I'm not sure what's going on here. Will reading over NFS affect the caching differently from reading locally?

I will go back to TrueNAS again and see if it behaves similarly.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
I am running 12U1.1, same board as you with a Gold 5122 and currently 640G memory. Its 10G but on an extra MLX CX3 nic since I am debugging an issue with the X722 chipset (link issues, not performance)

Probably its caching the data when you read it once and the second process does not have to read from disk...
try the same with two different files to validate
 

pyakex

Dabbler
Joined
Aug 14, 2016
Messages
16
I am running 12U1.1, same board as you with a Gold 5122 and currently 640G memory. Its 10G but on an extra MLX CX3 nic since I am debugging an issue with the X722 chipset (link issues, not performance)

Probably its caching the data when you read it once and the second process does not have to read from disk...
try the same with two different files to validate
Then the 200 GB file will probably be fully cached in your case.

As you suspected that seems to be the case. Reading locally from another file made the transfer to the client even slower (still on FreeNAS).

So I guess this can be narrowed down to something with the caching when transferring over network since
a) Network performance itself seems OK (iperf3, NFS on small files in ARC cache)
b) Local reading seems OK (dd locally on FreeNAS)

Is it expected that reading non cached files via NFS should be far worse than reading non cached files locally?

I'm also still unclear on why I seem to get quite much worse performance on TrueNAS compared to FreeNAS.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Its actually not caching but network'ed IO that is slow, so yes anything remote is slower than the same operation locally.

Now why its so much slower in TNC vs FN is an interesting question - I'd recommend to open a ticket about that.
They can tell you why and if its to be expected or what could be done.
I am sure this is of interest for a wide variety of users...
 
Joined
Jul 2, 2019
Messages
648
Thought: Has the driver changed for the NIC?
--Edit--
Thought 2: Could there be additional tunables between FN and TN?
 
Top