Poor 10G and SMB Performance on All Flash Array

OldBadger · Feb 23, 2022

I've just moved from an old Thecus NAS (running Rockstor) with 8*4TB NL SAS drives to a Dell R740XD with 8 * 3.84TB SAS SSD drives running TrueNAS. Each SSD is rated > 12Gbps sequential read/write.

Although I really like the general utility of TrueNAS, I'm struggling to make it good. The performance should be awesome, and yet, I'm struggling to get near the performance of the spinning drives.

I have the following problems:

Networking

10GB networking on the TrueNAS Core is slow using iperf(2/3). Looking through all the links I can on the forum, I can't break 2Gb/s OOTB, and with tuning, can manage 3.9Gb/s UDP and 1.2Gb/s TCP using iperf3.
Using the old Thecus NAS with a 10G network card in, I get > 8Gb/s OOTB.
Spin up a TrueNAS Scale VM with exactly the same network and test iperf, and I get 8Gb/s OOTB immediately.
Using a 2nd 10G NIC for a storage network (NFS) back to the hypervisor, and I get wire speed on the sequential read in Windows 10 VMs using it.
Turn on jumbo to test from my machine, and I can get 8Gb/s on all links, but I can't enable jumbo because many devices use the shares. Jumbo is only usable for the private storage network.

Samba

Performance of copying large files is just _awful_ on TN. It's obviously constrained by the network to start with at around 220MB/s, but after a while copying it drops to a few MB/s. Check out the 2 pictures. The top is the new, the bottom is the old (apples and oranges I know ZFS vs BTRFS). Source files coming from NVME PCI4 .
When the speed drops, there appears to be nothing much going on at all on the TN. All graphs show no activity.

Local performance:

Single dd command (1M BS to 1M RS, no compression, sync) = 424MB/s (disappointing given the hardware, and strange that crystal disk mark from a Windows 11VM over NFS gets 1000MB/s).
16 parallel dd commands = 162MB/s PER thread, so 2592MB/s total.

I don't know if this is a red herring, but I experimented with dedup on a dataset, and practically killed the box at one point trying to move some large files

. I removed the dataset involved and rebuilt it, however, I've not rebuilt the pool or anything. I'm trying to remember if that was the point everything went bad.

I have about 100 tabs open trying to work through all the tuning options, but nothing seems to be making a difference, and I'm hoping someone who knows more can point me in a good direction.

jgreco · Feb 23, 2022

OldBadger said:
I don't know if this is a red herring, but I experimented with dedup on a dataset, and practically killed the box at one point trying to move some large files . I removed the dataset involved and rebuilt it, however, I've not rebuilt the pool or anything. I'm trying to remember if that was the point everything went bad.

Dedup can only be expunged by destroying the pool and recreating.

Also,

OldBadger said:
Dell R740XD

you haven't really described your hardware but these often come with PERC RAID controllers that are not really compatible. Please see

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

to see if this includes you, as unexpectedly poor speeds are a common side effect.

OldBadger · Feb 23, 2022

Thanks for the quick reply! This is a custom built R740XD so it has an HBA - I already put the hardware and switching details in my signature

I found this very useful article... https://forums.servethehome.com/ind...-card-to-hba330-12gbps-hba-it-firmware.25498/

and used the J7TNV model to ensure all was good in advance, as it comes all ready for something this.

Sounds like I need to sync up with my backup NAS, destroy the pool and rebuild. Sooooo pleased I experimented with dedup...thought I was immune to dedup performance woes with high performance SSDs. Pride before a fall!

Any hints on how I can debug the poor network performance without jumbo frames?

jgreco · Feb 23, 2022

OldBadger said:
and used the J7TNV model to ensure all was good in advance, as it comes all ready for something this.

So I keep hearing slightly different stories on this, and I'm the kind who prefers proof and then only on stuff I've worked with myself, just due to the number of variables. Is this showing up under the MPR driver, or MRSAS?

OldBadger said:
Sooooo pleased I experimented with dedup...thought I was immune to dedup performance woes with high performance SSDs.

@Stilez has an excellent tale of woes with dedup.

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

OldBadger said:
Any hints on how I can debug the poor network performance without jumbo frames?

Debugging hypervisor problems is never fun. Make sure you've reserved sufficient CPU, make sure you're following best practices for ESXi and your switchgear. As tempting as some of the fancy features on the Force10 switches like cut-thru switching are, some of them don't actually help in the ways you might expect. You're actually best off following the guidance for virtualization I posted in these forums years ago,

"Absolutely must virtualize FreeNAS!" ... a guide to not completely losing your data.

[---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some specific things known to be problematic ----] [---- 2014/12/24: Note, there is another post discussing how to deploy a small FreeNAS VM instance for basic file...

www.truenas.com

which has you trying the configuration on bare metal. This is generally instructive, but because you've got Dell's crap-grade Broadcom ethernets, that's not going to work well unless you happen to have a decent card like an Intel X710 or Chelsio 520 laying around that you can slot in. Once it works well on bare metal, then you understand what the actual platform potential is, and you can try to shoot for 50%-75% of that under virtualization.

HoneyBadger · Feb 23, 2022

Hello fellow Badger. A couple general questions here about the configuration and the VMware integration.

You're running 2x16-core pCPUs and you've assigned 32 vCPUs to a single machine - vSphere is probably having a rough time with CSTP and allocation of resources, since it's trying to prioritize physical over logical cores for scheduling, and we're spanning NUMA nodes. I'd start by chopping that down to something like 8 vCPUs.

Is the HBA passed through to the TrueNAS VM via PCIe passthrough/VT-d/IOMMU?

Re: dedup misadventures - can you run the command zpool status -D yourpoolname and post the output inside of "Code" tags, or attached as a .txt? That will tell us for sure if you've got leftover DDT records.

OldBadger · Feb 23, 2022

A fellow badger! Mine is nickname from my daughter due to having black and white hair...

HoneyBadger said:
Is the HBA passed through to the TrueNAS VM via PCIe passthrough/VT-d/IOMMU?

Yes it is ...

HoneyBadger said:
You're running 2x16-core pCPUs and you've assigned 32 vCPUs to a single machine - vSphere is probably having a rough time with CSTP and allocation of resources, since it's trying to prioritize physical over logical cores for scheduling, and we're spanning NUMA nodes. I'd start by chopping that down to something like 8 vCPUs.

I did this to assign half the CPU power of the machine over to TrueNAS. At high data rates with compression, the VM was comfortably using all the CPU power thrown at it. Maybe I got this wrong, but each socket has 16C/32HT, so I have 32C/64HT to assign. I assumed each vCPU is a HT.

HoneyBadger said:
Re: dedup misadventures - can you run the command zpool status -D yourpoolname and post the output inside of "Code" tags, or attached as a .txt? That will tell us for sure if you've got leftover DDT records.


  pool: Data
 state: ONLINE
config:
        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/712a789e-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/71246293-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/7132d710-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/713b3569-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/71539837-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/716db2ac-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/717132ca-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0
            gptid/7174a594-8e8d-11ec-af4f-000c29751e28  ONLINE       0     0     0

errors: No known data errors

 dedup: no DDT entries

jgreco said:
So I keep hearing slightly different stories on this, and I'm the kind who prefers proof and then only on stuff I've worked with myself, just due to the number of variables. Is this showing up under the MPR driver, or MRSAS?

I guess this means MPR...

mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000a> enclosureHandle<0x0002> slot 0
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000b> enclosureHandle<0x0002> slot 1
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000c> enclosureHandle<0x0002> slot 2
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000d> enclosureHandle<0x0002> slot 3
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000e> enclosureHandle<0x0002> slot 4
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x000f> enclosureHandle<0x0002> slot 5
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x0010> enclosureHandle<0x0002> slot 6
mpr0: At enclosure level 1 and connector name (    )
mpr0: Found device <c01<SspTarg,Direct>,End Device> <12.0Gbps> handle<0x0011> enclosureHandle<0x0002> slot 7
mpr0: At enclosure level 1 and connector name (    )

HoneyBadger · Feb 23, 2022

OldBadger said:
A fellow badger! Mine is nickname from my daughter due to having black and white hair...

Take heart that you've at least got hair left to go white.

OldBadger said:
I did this to assign half the CPU power of the machine over to TrueNAS. At high data rates with compression, the VM was comfortably using all the CPU power thrown at it. Maybe I got this wrong, but each socket has 16C/32HT, so I have 32C/64HT to assign. I assumed each vCPU is a HT.
View attachment 53436

You're right in that you have 64 threads, but the vSphere scheduler is going to prioritize putting vCPU threads onto full pCPU cores (rather than the "half-core" provided by HT) so it'll be targeting pCPU0, pCPU2, etc. So this makes that 32-vCPU machine end up spanning both physical sockets - you'll have a bunch of bottlenecking on the UPI (cross-processor) link, especially if you have the HBA330 mapped to physical socket 0 PCIe lanes and there's a write process on a thread housed on physical socket 1, which is reading from network/RAM that's attached to physical socket 0 ... back and forth across the UPI link we go, with every round-trip adding latency and hurting performance.

Hence my thought of chopping the vCPU count down and letting it live on a single NUMA node. Bump it up later if you see bottlenecking and actual 100% CPU usage under heavy workloads but definitely start smaller and scale up. With a system full of 12G SAS drives you can definitely make use of more resources than the average bear, but you're starting out a bit high IMO. Plus, this way you actually have some CPU resources left for other non-TrueNAS workloads. If you assigned 32 vCPUs to that VM, everything else (including the vSphere vmkernel, which likes to run on pCPU0/pCPU1) will be battling for thread time with it. That could be causing the network switching to choke up as well if vmkernel and drivers have to jockey for time to run on physical cores, especially when we're talking about multiple 10Gbps network links.

OldBadger said:
dedup: no DDT entries

That's the magical line I wanted to see, you successfully killed off any records that referenced deduplication.

OldBadger · Feb 23, 2022

jgreco said:
Debugging hypervisor problems is never fun.

I'm not sure the problem is hypervisor level - as I mention, I was able to spin up a TN Scale VM, and got good speed OOTB on the network. I need to finish my backup sync (got a bit dated when I switched from Rockstor to TN), rebuild the pool and try in Scale now we actually have a full release today.

jgreco said:
Once it works well on bare metal, then you understand what the actual platform potential is, and you can try to shoot for 50%-75% of that under virtualization.

I did actually do this. And made a record of the local performance stats. Most of the results I got were within a small tolerance margin of each other.

HoneyBadger said:
Hence my thought of chopping the vCPU count down and letting it live on a single NUMA node

Reduced down to 12vCPU as multiple rsync jobs were pushing to 100% on 8.

I tried copying 2 files together via SMB, and the slow down to a couple of MB/s was much quicker (in a few seconds). Honestly I think rebuilding the pool at this stage is a sensible thing to do, but I'd really like to know for sure why this happened.

jgreco · Feb 23, 2022

OldBadger said:
I'm not sure

Well, the thing I'm saying is "get sure."

Virtualization is making a house of cards resting on top of another house of cards. Interactions are not always obvious or simplistic. Every virtualization admin has horror stories.

OldBadger · Feb 23, 2022

jgreco said:
Well, the thing I'm saying is "get sure."

I hear you - that's why I tried the TBScale VM using the exact same network layout, and it behaved very differently to TN Core. I have to test Scale with the HBA combination (exhaustively) next.
Incidentally, the network with the jumbos behaving fine is the broadcom, the network card having issues in an Intel X540. I should add that to my signature.

Ericloewe · Feb 23, 2022

jgreco said:
So I keep hearing slightly different stories on this, and I'm the kind who prefers proof and then only on stuff I've worked with myself, just due to the number of variables. Is this showing up under the MPR driver, or MRSAS?

MPR, certainly. Give me a few minutes and I'll spin up FreeBSD on an idle Dell R6515 and confirm the HBA330 Mini as well.

Ericloewe · Feb 23, 2022

HBA330 Mini, mpr driver:

Apologies for the screenshot, had to take it via VNC to a workstation local to the server, running the HTML5 iKVM.

jgreco · Feb 23, 2022

MPR should be 100% solid. Make sure you're using firmware 16.00.12.00 available here:

LSI 9300-xx Firmware Update

JoshDW19 submitted a new resource: LSI 9300-xx Firmware Update - Fixes a controller reset issue in LSI 9300-xx HBA cards Hey Community, If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset...

www.truenas.com

I have no immediate ideas on the network issue.

Ericloewe · Feb 23, 2022

The PCI form-factor card may still need to be crossflashed to take stock LSI firmware, though. The mini card needs Dell firmware, unfortunately.

Dell's firmware is something like 16.17.00.00, which has me wondering just how closely they're tracking upstream fixes from LSI before adding their ~~junk~~ moderately-useful features.

Jessep · Feb 23, 2022

Ericloewe said:
The PCI form-factor card may still need to be crossflashed to take stock LSI firmware, though. The mini card needs Dell firmware, unfortunately.

Dell's firmware is something like 16.17.00.00, which has me wondering just how closely they're tracking upstream fixes from LSI before adding their ~~junk~~ moderately-useful features.

Dell HBA330 Adapter firmware version 16.17.01.00 | Driver Details | Dell Australia

Code:

Additional details Show All | Hide All
 Important Information
This release is urgent only for Linux operating systems. There are no fixes or changes for Windows or VMWare operating systems.

Downgrading to earlier firmware versions must be done via Dell Update Package (DUP) run on the server operating system. Downgrading of HBA firmware will be blocked on newer versions of Lifecycle Controller. If your boot device is a T10-PI enabled drive and you downgrade, you could possibly render your system unbootable.

Firmware Package Version 16.17.01.00 includes:
Core Firmware Version : 16.00.11.00
BIOS version : 8.37.02.00
UEFI Driver Version :18.00.03.00

OldBadger · Feb 24, 2022

Already on latest...

Ericloewe · Feb 24, 2022

Ah, very interesting, I hadn't actually dug that far. So they're still using 16.00.11.

OldBadger · Feb 24, 2022

Well - I just started the TN Scale VM and passed the HBA through and imported the pool, and everything came up perfectly.

I get 1GB/s on SMB copies and I can't "break" the ZFS cache (e.g. it's writing it out of cache just as fast as it can copy it in). With sync=always, I'm getting between 600MB/s and 850MB/s.

I think I know the future direction of this build - now to do some soak testing.

HoneyBadger · Feb 24, 2022

Odd. Might just be some weird edge case in the FREEBASE mpr driver or an issue with the virtualization.

How many cores did you give the SCALE VM?

OldBadger · Feb 24, 2022

HoneyBadger said:
How many cores did you give the SCALE VM?

12, and it looks to have handled the load pretty well so far.

NFS reads are about twice as fast on Scale (1947MB/s), but the writes are much slower (260MB/s), so I have some tuning to do on the NFS side.

Important Announcement for the TrueNAS Community.

Poor 10G and SMB Performance on All Flash Array

Cadet

Resident Grinch

Cadet

Resident Grinch

actually does care

Cadet

actually does care

Cadet

Resident Grinch

Cadet

Server Wrangler

Server Wrangler

Resident Grinch

Server Wrangler

Patron

Cadet

Server Wrangler

Cadet

actually does care

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Poor 10G and SMB Performance on All Flash Array"

Similar threads