Poor 10G and SMB Performance on All Flash Array

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
I've just moved from an old Thecus NAS (running Rockstor) with 8*4TB NL SAS drives to a Dell R740XD with 8 * 3.84TB SAS SSD drives running TrueNAS. Each SSD is rated > 12Gbps sequential read/write.

Although I really like the general utility of TrueNAS, I'm struggling to make it good. The performance should be awesome, and yet, I'm struggling to get near the performance of the spinning drives.

I have the following problems:

Networking
  1. 10GB networking on the TrueNAS Core is slow using iperf(2/3). Looking through all the links I can on the forum, I can't break 2Gb/s OOTB, and with tuning, can manage 3.9Gb/s UDP and 1.2Gb/s TCP using iperf3.
  2. Using the old Thecus NAS with a 10G network card in, I get > 8Gb/s OOTB.
  3. Spin up a TrueNAS Scale VM with exactly the same network and test iperf, and I get 8Gb/s OOTB immediately.
  4. Using a 2nd 10G NIC for a storage network (NFS) back to the hypervisor, and I get wire speed on the sequential read in Windows 10 VMs using it.
  5. Turn on jumbo to test from my machine, and I can get 8Gb/s on all links, but I can't enable jumbo because many devices use the shares. Jumbo is only usable for the private storage network.
Samba
  1. Performance of copying large files is just _awful_ on TN. It's obviously constrained by the network to start with at around 220MB/s, but after a while copying it drops to a few MB/s. Check out the 2 pictures. The top is the new, the bottom is the old (apples and oranges I know ZFS vs BTRFS). Source files coming from NVME PCI4 .
  2. 1645632922555.png
  3. When the speed drops, there appears to be nothing much going on at all on the TN. All graphs show no activity.
Local performance:

Single dd command (1M BS to 1M RS, no compression, sync) = 424MB/s (disappointing given the hardware, and strange that crystal disk mark from a Windows 11VM over NFS gets 1000MB/s).
16 parallel dd commands = 162MB/s PER thread, so 2592MB/s total.

I don't know if this is a red herring, but I experimented with dedup on a dataset, and practically killed the box at one point trying to move some large files :oops:. I removed the dataset involved and rebuilt it, however, I've not rebuilt the pool or anything. I'm trying to remember if that was the point everything went bad.

I have about 100 tabs open trying to work through all the tuning options, but nothing seems to be making a difference, and I'm hoping someone who knows more can point me in a good direction.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I don't know if this is a red herring, but I experimented with dedup on a dataset, and practically killed the box at one point trying to move some large files :oops:. I removed the dataset involved and rebuilt it, however, I've not rebuilt the pool or anything. I'm trying to remember if that was the point everything went bad.

Dedup can only be expunged by destroying the pool and recreating.

Also,

Dell R740XD

you haven't really described your hardware but these often come with PERC RAID controllers that are not really compatible. Please see


to see if this includes you, as unexpectedly poor speeds are a common side effect.
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
Thanks for the quick reply! This is a custom built R740XD so it has an HBA - I already put the hardware and switching details in my signature :smile:

I found this very useful article... https://forums.servethehome.com/ind...-card-to-hba330-12gbps-hba-it-firmware.25498/

and used the J7TNV model to ensure all was good in advance, as it comes all ready for something this.

Sounds like I need to sync up with my backup NAS, destroy the pool and rebuild. Sooooo pleased I experimented with dedup...thought I was immune to dedup performance woes with high performance SSDs. Pride before a fall!

Any hints on how I can debug the poor network performance without jumbo frames?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
and used the J7TNV model to ensure all was good in advance, as it comes all ready for something this.

So I keep hearing slightly different stories on this, and I'm the kind who prefers proof and then only on stuff I've worked with myself, just due to the number of variables. Is this showing up under the MPR driver, or MRSAS?

Sooooo pleased I experimented with dedup...thought I was immune to dedup performance woes with high performance SSDs.

@Stilez has an excellent tale of woes with dedup.


Any hints on how I can debug the poor network performance without jumbo frames?

Debugging hypervisor problems is never fun. Make sure you've reserved sufficient CPU, make sure you're following best practices for ESXi and your switchgear. As tempting as some of the fancy features on the Force10 switches like cut-thru switching are, some of them don't actually help in the ways you might expect. You're actually best off following the guidance for virtualization I posted in these forums years ago,


which has you trying the configuration on bare metal. This is generally instructive, but because you've got Dell's crap-grade Broadcom ethernets, that's not going to work well unless you happen to have a decent card like an Intel X710 or Chelsio 520 laying around that you can slot in. Once it works well on bare metal, then you understand what the actual platform potential is, and you can try to shoot for 50%-75% of that under virtualization.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hello fellow Badger. A couple general questions here about the configuration and the VMware integration.

You're running 2x16-core pCPUs and you've assigned 32 vCPUs to a single machine - vSphere is probably having a rough time with CSTP and allocation of resources, since it's trying to prioritize physical over logical cores for scheduling, and we're spanning NUMA nodes. I'd start by chopping that down to something like 8 vCPUs.

Is the HBA passed through to the TrueNAS VM via PCIe passthrough/VT-d/IOMMU?

Re: dedup misadventures - can you run the command zpool status -D yourpoolname and post the output inside of "Code" tags, or attached as a .txt? That will tell us for sure if you've got leftover DDT records.
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
A fellow badger! Mine is nickname from my daughter due to having black and white hair...

Is the HBA passed through to the TrueNAS VM via PCIe passthrough/VT-d/IOMMU?
Yes it is ...
1645643012671.png

1645642966376.png

You're running 2x16-core pCPUs and you've assigned 32 vCPUs to a single machine - vSphere is probably having a rough time with CSTP and allocation of resources, since it's trying to prioritize physical over logical cores for scheduling, and we're spanning NUMA nodes. I'd start by chopping that down to something like 8 vCPUs.
I did this to assign half the CPU power of the machine over to TrueNAS. At high data rates with compression, the VM was comfortably using all the CPU power thrown at it. Maybe I got this wrong, but each socket has 16C/32HT, so I have 32C/64HT to assign. I assumed each vCPU is a HT.
1645642916325.png

Re: dedup misadventures - can you run the command zpool status -D yourpoolname and post the output inside of "Code" tags, or attached as a .txt? That will tell us for sure if you've got leftover DDT records.
pool: Data state: ONLINE config: NAME STATE READ WRITE CKSUM Data ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/712a789e-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/71246293-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/7132d710-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/713b3569-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/71539837-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/716db2ac-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/717132ca-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 gptid/7174a594-8e8d-11ec-af4f-000c29751e28 ONLINE 0 0 0 errors: No known data errors dedup: no DDT entries

So I keep hearing slightly different stories on this, and I'm the kind who prefers proof and then only on stuff I've worked with myself, just due to the number of variables. Is this showing up under the MPR driver, or MRSAS?
I guess this means MPR...

mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000a> enclosureHandle<0x0002> slot 0 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000b> enclosureHandle<0x0002> slot 1 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000c> enclosureHandle<0x0002> slot 2 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000d> enclosureHandle<0x0002> slot 3 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000e> enclosureHandle<0x0002> slot 4 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000f> enclosureHandle<0x0002> slot 5 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x0010> enclosureHandle<0x0002> slot 6 mpr0: At enclosure level 1 and connector name ( ) mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x0011> enclosureHandle<0x0002> slot 7 mpr0: At enclosure level 1 and connector name ( )
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
A fellow badger! Mine is nickname from my daughter due to having black and white hair...

Take heart that you've at least got hair left to go white.

I did this to assign half the CPU power of the machine over to TrueNAS. At high data rates with compression, the VM was comfortably using all the CPU power thrown at it. Maybe I got this wrong, but each socket has 16C/32HT, so I have 32C/64HT to assign. I assumed each vCPU is a HT.
View attachment 53436

You're right in that you have 64 threads, but the vSphere scheduler is going to prioritize putting vCPU threads onto full pCPU cores (rather than the "half-core" provided by HT) so it'll be targeting pCPU0, pCPU2, etc. So this makes that 32-vCPU machine end up spanning both physical sockets - you'll have a bunch of bottlenecking on the UPI (cross-processor) link, especially if you have the HBA330 mapped to physical socket 0 PCIe lanes and there's a write process on a thread housed on physical socket 1, which is reading from network/RAM that's attached to physical socket 0 ... back and forth across the UPI link we go, with every round-trip adding latency and hurting performance.

Hence my thought of chopping the vCPU count down and letting it live on a single NUMA node. Bump it up later if you see bottlenecking and actual 100% CPU usage under heavy workloads but definitely start smaller and scale up. With a system full of 12G SAS drives you can definitely make use of more resources than the average bear, but you're starting out a bit high IMO. Plus, this way you actually have some CPU resources left for other non-TrueNAS workloads. If you assigned 32 vCPUs to that VM, everything else (including the vSphere vmkernel, which likes to run on pCPU0/pCPU1) will be battling for thread time with it. That could be causing the network switching to choke up as well if vmkernel and drivers have to jockey for time to run on physical cores, especially when we're talking about multiple 10Gbps network links.

dedup: no DDT entries

That's the magical line I wanted to see, you successfully killed off any records that referenced deduplication.
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
Debugging hypervisor problems is never fun.
I'm not sure the problem is hypervisor level - as I mention, I was able to spin up a TN Scale VM, and got good speed OOTB on the network. I need to finish my backup sync (got a bit dated when I switched from Rockstor to TN), rebuild the pool and try in Scale now we actually have a full release today.
Once it works well on bare metal, then you understand what the actual platform potential is, and you can try to shoot for 50%-75% of that under virtualization.
I did actually do this. And made a record of the local performance stats. Most of the results I got were within a small tolerance margin of each other.

Hence my thought of chopping the vCPU count down and letting it live on a single NUMA node
Reduced down to 12vCPU as multiple rsync jobs were pushing to 100% on 8.

I tried copying 2 files together via SMB, and the slow down to a couple of MB/s was much quicker (in a few seconds). Honestly I think rebuilding the pool at this stage is a sensible thing to do, but I'd really like to know for sure why this happened.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'm not sure

Well, the thing I'm saying is "get sure."

Virtualization is making a house of cards resting on top of another house of cards. Interactions are not always obvious or simplistic. Every virtualization admin has horror stories.
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
Well, the thing I'm saying is "get sure."
I hear you - that's why I tried the TBScale VM using the exact same network layout, and it behaved very differently to TN Core. I have to test Scale with the HBA combination (exhaustively) next.
Incidentally, the network with the jumbos behaving fine is the broadcom, the network card having issues in an Intel X540. I should add that to my signature.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
So I keep hearing slightly different stories on this, and I'm the kind who prefers proof and then only on stuff I've worked with myself, just due to the number of variables. Is this showing up under the MPR driver, or MRSAS?
MPR, certainly. Give me a few minutes and I'll spin up FreeBSD on an idle Dell R6515 and confirm the HBA330 Mini as well.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
HBA330 Mini, mpr driver:

1645664107654.png

Apologies for the screenshot, had to take it via VNC to a workstation local to the server, running the HTML5 iKVM.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
MPR should be 100% solid. Make sure you're using firmware 16.00.12.00 available here:


I have no immediate ideas on the network issue.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The PCI form-factor card may still need to be crossflashed to take stock LSI firmware, though. The mini card needs Dell firmware, unfortunately.

Dell's firmware is something like 16.17.00.00, which has me wondering just how closely they're tracking upstream fixes from LSI before adding their junk moderately-useful features.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
The PCI form-factor card may still need to be crossflashed to take stock LSI firmware, though. The mini card needs Dell firmware, unfortunately.

Dell's firmware is something like 16.17.00.00, which has me wondering just how closely they're tracking upstream fixes from LSI before adding their junk moderately-useful features.

Code:
Additional details Show All | Hide All
 Important Information
This release is urgent only for Linux operating systems. There are no fixes or changes for Windows or VMWare operating systems.

Downgrading to earlier firmware versions must be done via Dell Update Package (DUP) run on the server operating system. Downgrading of HBA firmware will be blocked on newer versions of Lifecycle Controller. If your boot device is a T10-PI enabled drive and you downgrade, you could possibly render your system unbootable.

Firmware Package Version 16.17.01.00 includes:
Core Firmware Version : 16.00.11.00
BIOS version : 8.37.02.00
UEFI Driver Version :18.00.03.00
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
Already on latest...
1645698204039.png
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ah, very interesting, I hadn't actually dug that far. So they're still using 16.00.11.
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
Well - I just started the TN Scale VM and passed the HBA through and imported the pool, and everything came up perfectly.

I get 1GB/s on SMB copies and I can't "break" the ZFS cache (e.g. it's writing it out of cache just as fast as it can copy it in). With sync=always, I'm getting between 600MB/s and 850MB/s.

I think I know the future direction of this build - now to do some soak testing.
 

OldBadger

Cadet
Joined
Feb 23, 2022
Messages
8
How many cores did you give the SCALE VM?
12, and it looks to have handled the load pretty well so far.

NFS reads are about twice as fast on Scale (1947MB/s), but the writes are much slower (260MB/s), so I have some tuning to do on the NFS side.
 
Top