Replication bottleneck on 10GbE

helloha · Apr 20, 2017

I finally got data replication to work to my secondary server, but now I notice that the speed hovers at around 200MB/s while my pool can reach speeds of up to 700 MB/s over my 10GbE connection.

I am running low power CPU's in my setup and when i run TOP I see this:

Would the bottleneck be the CPU? I disabled encryption completely and stream compression is set to OFF (since it's mainly incompressible data).

THX!
K.

Ericloewe · Apr 20, 2017

How low-power? That CPU certainly looks maxed-out, but it may be waiting for memory or something.

helloha · Apr 20, 2017

Ericloewe said:
How low-power? That CPU certainly looks maxed-out, but it may be waiting for memory or something.

It's a xeon L5630.

http://ark.intel.com/products/47927/Intel-Xeon-Processor-L5630-12M-Cache-2_13-GHz-5_86-GTs-Intel-QPI

Ericloewe · Apr 20, 2017

What speed is the RAM running at? It could conceivably be the bottleneck.

helloha · Apr 20, 2017

It's running at 1333Mhz. Could this be the issue? Because internally I can max out my pools. I even did stripe tests with 14 disks and could reach speeds up to 2400 MB/s.

Ericloewe · Apr 20, 2017

No, Avoton with dual-channel at 1600MHz can do at least 5Gb speeds with SMB, so triple-channel 1333MHz makes a RAM bottleneck unlikely (yes, it's a simplistic comparison, but Atom is a simpler core than Westmere and has less cache).

However, I am somewhat concerned that it's running so fast, since the CPU is only rated for 1066MHz and the microcode might not like that, even if the memory controller works fine (my i7-930 ran fine for years at 1600MHz). Wouldn't be causing your trouble, though.

helloha · Apr 20, 2017

I could try to experiment with E5540's that I have around. But it seems only marginally faster. Is ssh multi threaded? It doen't seem like it.

But I don't feel like breaking my back again by moving 80 pound servers around... i'll see what I can do and if not, incremental backups don't take that long anyway...

helloha · Apr 26, 2017

Can anyone tell me what happens when you interrupt the data replication? Does it completely start over next time it runs? Or does it work incremental where it continues where it left of?

nojohnny101 · Apr 26, 2017

To my knowledge, replication has to start all over. This is one advantage that rsync has over replication, if rsync is interrupted, it can pick right back up where it left off.

helloha · Apr 27, 2017

When I'm sending the snapshot the SSH command is taking 75% of the cpu, would it be logical to assume that this also impacts transfer speeds when I use other protocols to pull data of the server simultaneously?

My snapshot takes about 8-10h to complete at 200 MB/S. (11TB). When making copies to a USB ssd that can write 350-400 MB/s I seem to hit 100MB/s.

Was wondering if upgrading the CPU would make much difference? I also have additional CPU's so I could populate the additional socket to see what happens. But don't know if that would make much difference?

Code:

last pid: 32115;  load averages:  2.91,  3.02,  3.05							 up 0+06:49:11  18:50:32

52 processes:  5 running, 47 sleeping

CPU:  7.0% user,  0.0% nice, 23.2% system,  3.7% interrupt, 66.1% idle

Mem: 686M Active, 38G Inact, 7231M Wired, 646M Cache, 242M Free

ARC: 5526M Total, 4242M MFU, 996M MRU, 33M Anon, 38M Header, 217M Other

Swap: 44G Total, 452M Used, 44G Free, 1% Inuse, 4K In, 1628K Out


  PID USERNAME	THR PRI NICE   SIZE	RES STATE   C   TIME	WCPU COMMAND

4922 root		  1  94	0 56752K  4812K CPU7	7 283:02  74.27% ssh

4920 root		  1  47	0 12340K  2716K CPU1	1 142:35  37.60% dd

4919 root		  1  52	0 12340K  2720K RUN	 6 161:57  36.28% dd

26750 root		  1  33	0   146M 15232K CPU0	0  12:37  20.36% afpd

4917 root		  2  27	0 48620K  2368K pipewr  3  54:41  13.57% zfs

4921 root		  1  27	0  9256K  1756K select  2  50:48  12.89% pipewatcher

23911 root		  1  22	0   142M  8576K select  3   3:53   3.96% afpd

2107 root		  1 -52   r0  6304K  2272K nanslp  4   1:57   0.20% watchdogd

11839 root		  1  20	0 65332K  9752K select  2   3:50   0.10% cnid_dbd

3619 root		  1  52	0   261M 17220K select  0   0:41   0.10% python2.7

Also not too sure on how to read the top command, is the CPU completely loaded? Because when I look under ZFS reporting it seems only under 40% load.

Important Announcement for the TrueNAS Community.

Replication bottleneck on 10GbE

helloha

Contributor

Ericloewe

Server Wrangler

helloha

Contributor

Ericloewe

Server Wrangler

helloha

Contributor

Ericloewe

Server Wrangler

helloha

Contributor

helloha

Contributor

nojohnny101

Wizard

helloha

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Replication bottleneck on 10GbE

Contributor

Server Wrangler

Contributor

Server Wrangler

Contributor

Server Wrangler

Contributor

Contributor

Wizard

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication bottleneck on 10GbE"

Similar threads