TrueNAS replication performance not what's expected

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
I am making a second copy of a large 240TB database via TrueNAS replication and the performance is significantly less than testing. The systems specs are below:
  • Source system
    • Dell PowerEdge R650 - TrueNAS Core 13.0 U1
      • 2 x Xeon Gold 515Y CPUs, 16 cores total @ 3.2Ghz
      • 256GB DDR4 RDIMM @ 2933 MT/s
      • Intel dual port 10Gbe, used for management
      • Broadcom dual port 25Gbe - used for storage connection
      • Dell HBA355e 4 port 12GB SAS external HBA - Dell's rebrand of the HBA 9500-16e
      • Pool config - Yes I know it's a bizarre zpool config, there's a long story behind why it's set up this way
        • SLOG: 2 x 1.92TB Dell P5500 Gen 4 NVME U.2 SSD - Mirror
        • 10 vdevs
          • 5 x 18TB Ultrastar HC550 12GB SAS HDD in RAIDZ3
          • 10 18TB Ultrastar HC550 12GB SAS HDD as hot spares
  • Destination system
    • TrueNAS Enterprise M60 HA - TrueNAS Enterprise 12.0 U8.1
      • 2 x Xeon Gold 6226 CPUs, 32 cores total @ 2.9Ghz
      • 768GB DDR4 RDIMM @ 2933 MT/s
      • Intel dual port 10Gbe, used for management
      • Broadcom dual port 25Gbe - used for storage connection
      • Not sure what IXSystems use for their external HBAs in the M60, but it has three four port 12GB SAS cards per controller
      • Pool config
        • SLOG: 1 x 32GB Micron NVDIMM
        • 2 vdevs
          • 16 x 15.36TB Samsung PM1633a 12GB SAS SSD - RAIDZ2

In testing, I was able to replicate snapshots at around 1.8GB/sec but in practice with the actual data, it's only replicating at between 400 and 550MB/sec. I'm puzzled at the disparity in throughput from testing to actual. I've verified that the connection is using the 25Gbe NICs, the zpools themselves are encrypted but I specifically set the replication connection to not use encryption to help speed things along. The only thing I can see is that on the source server, two threads are almost always pegged at 100% utilization with the other threads either idle or almost idle. Is the replication service single threaded?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I am certain you thought about this already. But still: What are the differences between the environments and/or the approach taken?
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
I am certain you thought about this already. But still: What are the differences between the environments and/or the approach taken?
The only difference is that the source system has actual data. It's been shipped out to the customer's site and back. Different locations in our rack, but otherwise everything is the same.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Do you still have the test system? If so, to exclude hardware issues/differences, could you transfer data there and see whether this changes anything?

If the test did not have real data on it, how did you do the reference performance test? Could some caching, due to "static" data, have been in the game?

Edit: Do different rack locations mean different LAN cables and/or ports on the switch?
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
The two devices listed in the OP were the test systems. Prior to sending the first one to the customer, these systems were tested to ensure that specific timelines could be met once the first system was returned to us by the customer. For other reasons, that timeline was blown out completely.

The cables are the same but the ports are different than the test system. I did investigate the possibility that the difference in ports might be causing an issue. That was dispelled by a quick iperf test which showed that the systems were able run at approximately 20Gb/sec which is approximately 25Gb/sec minus the existing replication traffic and a little overhead.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
What about the data used in the original testing vs. in production?
 
Top