80TB Transfer over NFS stops to almost zero transfer

moxxa

Cadet
Joined
Nov 7, 2023
Messages
4
Source NAS
  • TrueNAS-13.0-U5.3
  • Host: Fujitsu PRIMERGY RX25030 M5
  • CPU: Dual Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
  • RAM: 192GB
  • Boot Disk: M.2-SSD
  • Pool Disk: 60x Western Digital Ultrastar 16TB SAS
  • JBOD Shelf: Western Digital 4U60 G2
  • SAS Card: 2x Broadcom MegaRAID 9480-8i8e SAS
  • Network: 2x X710-DA2 2x10Gb SFP

Destination NAS
  • TrueNAS-13.0-U5.3
  • Host: Fujitsu PRIMERGY RX25030 M5
  • CPU: Dual Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
  • RAM: 192GB
  • Boot Disk: M.2-SSD
  • Pool Disk: 90x Western Digital Ultrastar 16TB SAS
  • JBOD Shelf: Western Digital 4U102 G2
  • SAS Card: 2x Broadcom MegaRAID 9480-8i8e SAS
  • Network: 2x X710-DA2 2x10Gb SFP


Hi all, second ever post here - so thank you for hearing me out.

I have built the above two monsters for one of our datacenter product lines, and they are working great. The destination NAS triggers a snapshot and replication task on the Source and copies over into our second datacenter making full use of our 20Gbps pipe. It is all pure backup data, nothing else. No VM's no Apps running. Just pure raw backup storage.

Just today I have had to transfer some data to the Source NAS over NFS around 80-90TB, which starts at 7.8GB/s and sustains for a 4-5 hours before dropping to next to nothing for 1-2 hours before then picking back up for another 2 hours @ 7.8GB's - rinse and repeat.

If I restart the transfer, then I get peak speeds again for a few hours and I get the same if I stop the NFS service and restart it. (on the source NAS)

Network wise, there is no other traffic going through the TOR switching, and if I do test transfers from other machines DC to DC I get 7-8GB/s all while the NAS's are stalled. So happy it isn't a network infrastructure issue.


I have searched around the forums, and can see posts about NFS performance from back in the freeNAS days, where one of the troubleshooting steps, the chap found that if he logs into CLi and runs any command what so ever, performance picks back up. I just tried it - and slap me down with a wet trout! after running an ls command, transfer speed pops back up to 7.8-8Gb/s.

Thread here: https://www.truenas.com/community/threads/nfs-dies-under-load.14346/


I don't think this thread ever concluded what the issue was, and I guess this wont be too big of an issue once I get the 80TB across, as we wont be using NFS often in this way. Just wondering if anyone else has come across this, or has any ideas on what may be the cause / fix?

All the best.

Mox
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
Wouldnt be any type of power saving kicking in on the drives or something, or temp issue throttling anything?

Buffers or something getting backedup either in the NIC's or I/O to the drives?

Can you monitor the ARC to see what the cache is doing?

You say about 80TB, are these mainly large file sets, or a mix of large and tiny files? causing massive seek read and writes?

Since you go into CLI and do something it kicks back in, then not likely to be any switching gear or routing issues...
*scratches head...*
 

moxxa

Cadet
Joined
Nov 7, 2023
Messages
4
Morning Mr,

Thanks for the reply - really appreciate it.

I guess it is relational really. The small files are 100's of GB's each with the large files being multiple TB's each. So I guess in comparison to each other yes there are large and small file sets, there are some 3,800 files in the transfer - so you might be onto something.

ARC Size remains constant: 110GB - 111GB
Hit Ratio fluctuates between 99.91 and 96.74 but is up there constantly.
ARC Requests: 24,000 with hits @23,000
Metadata Demand: 1.4m with 79 misses
Prefetch Metadata: Misses seems to spike with during the lulls in transfer speeds.

Just tapped up my good buddy, ChatGPT to help me understand what ARC is and what the effects of missed ARC requests for prefetch metadata would be. Sounds like this could be it.


Just so I can understand the inner workings a bit better - what would cause the transfer to suddenly pick up again after a few hours? Is it just down the characteristics of the next file in the transfer?
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
Ya the drop and start up again is the head scratcher...why would suddenly doing something in CLI trigger a process to continue....

Curious, are you able to do a transfer of say, the 100's of gigs in size files, leaving out the TB sized files? just to see? and then try to move some of the TB files and see if the same happens? Might be a useless exercise, but my brain always goes to process of elimination..

Or are there any logs that could be tracked to show when the slowdown / stop happens and what file it was on? Possibly a read lock or something and it times out trying to copy it, but when you come into the CLI, magic happens?
 

moxxa

Cadet
Joined
Nov 7, 2023
Messages
4
Its a good one isn't it!

Sounds like we are of a similar mindset, with testing - the smaller hundreds of gb file sets are flying through, it seems to lag on the larger tb+ file sets, as though as you originally said some kind of buffer is being met. I may have been mistaken about the CLi bit. Although at the time I was convinced, every time I did it, the transfer speeds would rocket. However I haven't been able to re-create it since.

I am beginning to wonder if I was just really lucky / unlucky depending on the way you look at it, that I was sending null CLi commands just as a tb+ file set completed and a gb file set started giving the impression that this seemed to kick something back into life.

Now that I am aware of this speed drop, I have been watching the snapshot transfer process between the two systems, and I am seeing the same behavior. Smaller snapshots are flying, larger tb+ snapshots are starting well but dropping off from 7.8Gbps to 150Mbps until completion.

Feels buffer related to me but I have to admit, I am no expert in this area.

I turned off (disabled) sync for the target pool just incase, but still the same.

I suppose my issue now is, initially I figured once this one-off transfer was completed it wouldn't be an issue, but now I realize this snapshot transfers are also effected.

I will see if I can get approval on a reboot for some time this week and see what happens.

She's a good-un eh!
 
Top