Slow NFS Performance

FireWire

Cadet
Joined
Aug 10, 2023
Messages
4
Hello,

Over the past day or so we have been working on moving a handful of fairly large virtual machines from one Proxmox node to another separate cluster. All virtual machines were running off a TrueNas scale NFS share running on the below specs and needed to be moved to a Ceph cluster running on the Proxmox cluster.The first couple of virtual machines moved over fast and pretty quick at about 200mb/s, however once we got about 500gb into the transfer everything slowed down to about 10mb/s. Figured maybe the ZFS cache was filled up, but no its sitting at 62gb of the 112gb usable (see below arcstat). Maybe disk IO was saturated? Nope, all disks were only at 125kb/s for read. Ran a couple DD commands to test writing and reading to the pool and got 1.2gb/s write and 533mb/s read. iperf3 inbound and outbound from the all nodes to the TrueNas system was at 1gb (theres a hop between the two switches thats limited to 1gb). I honestly have no idea what else to test or why this is even happening at all. Any help would be appreciated.

TrueNas Scale Version: TrueNAS-SCALE-22.12.0
Server: R740XD
CPU: 2x Xeon Gold 6134
RAM: 128GB
Boot Array: 2x 400GB intel SSD
HDD: 24x 2.4TB Seagate SAS disks
Network: 6x RJ45 1GB

Pool
------
Data VDEVS: RaidZ2; 20 wide
Dedup VDEVS: 1x mirror; 2 wide
Spare VDEVS: 2

Arcstat
---------
admin@truenas2[~]# arcstat
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
13:30:17 0 0 0 0 0 0 0 0 0 62G 62G 49G
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi @FireWire

Welcome to the forums. There's a bit to unpack here, so pardon the length of the comment.

For clarification, you're migrating these machines from your TrueNAS machine to a Ceph cluster on Proxmox?

Can you provide a model number for the Seagate drives? I don't want to assume the model here, but I could hazard a guess that it's the Exos 10E2400 family (ST2400MM0129) hybrid drives, with embedded eMLC NAND.

A 20-wide RAIDZ2 is definitely a wider vdev than is generally recommended, especially for a high-IOPS use cases such as VMs over NFS. Generally we'd recommend mirrors for HDDs in this use case, even hybrid ones.

You're using deduplication, which can have significant performance impacts - and if the same Seagate HDDs are being used for the dedup vdev, that will likely be the major throttle point as the updates to your deduplication tables for the deletes (as the VMs migrate off) - if you're able to open an ssh session to the TrueNAS VM, have a look at the output of gstat -dp and identify your dedup devices. If they're getting hammered with writes and are 100% busy, this is likely the chokepoint.

VM workloads also tend to demand synchronous (or "safe") writes, especially over NFS, this is also a case where a separate log device or "SLOG" could pay dividends; but that would be during a workload where you're writing to (or running VMs on) the TrueNAS machine.
 

FireWire

Cadet
Joined
Aug 10, 2023
Messages
4
Hey HoneyBadger,

The VM's were running on another Proxmox server separate from the proxmox cluster that was using a Truenas NFS share to provide shared file storage between the old separated Proxmox server and the new Proxmox cluster. This was done because of the fact Proxmox has yet to implement a way to migrate between clusters.

All drives in the raidz2 pool are model number DL2400MM0159, including the de-dup vdevs. While the boot mirror drives are both INTEL_SSDSC2BA400G4R.

Unfortunately, it looks like gstat isn't installed by default. Is iostat a good alternative? I've seen some instances of it across the forums, however even with virtually no load on the system it appears to be the same output as with load.

admin@truenas2[~]# sudo gstat -dp
[sudo] password for admin:
sudo: gstat: command not found

Thanks for the tip on the wide vdev, if i ever get a chance to recreate the pool, I'll make sure to keep that in mind. Unfortunately, the data stored on this box is mission critical and it's unlikely we'll ever be able to reconfigure it. It does look like some data did get lost to the void. So backup time! Woo!

Guess i should specify that TrueNas is running native on the R740, I know it tends to be picky about running in a VM.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
All drives in the raidz2 pool are model number DL2400MM0159, including the de-dup vdevs. While the boot mirror drives are both INTEL_SSDSC2BA400G4R.
Those look like a Dell model number for the ST2400MM0159, so it's the hybrid drive. HDDs aren't suited for use as deduplication devices due to the random I/O patterns of the table updates. Your boot devices are actually far more suited for this duty (Intel DC S3710) as they're SSDs with inline power-loss-prevention, so they would have orders of magnitude faster response times for random I/O.

The knock-on question here though would be "is deduplication actually doing anything for you?" Can you share the output of zpool status -D YourPoolName in CODE tags?

Unfortunately, it looks like gstat isn't installed by default. Is iostat a good alternative? I've seen some instances of it across the forums, however even with virtually no load on the system it appears to be the same output as with load.

My fault - I keep slipping back into the FreeBSD default world. iostat will do the trick, but you need to run it in a looping manner such as sudo iostat 5 - look for your dedup vdevs by their sdX identifier and see if they're being barraged with small IO (using the -x parameter with iostat can help this as well, since it will show more fields including average read/write request size)

What's the drive controller/HBA you're using? The 14th gen Dell servers do have the option of the HBA330 and HBA350i, which are the ideal out-of-the-box options - other HBAs could also work, but hardware RAID PERC H730 cards should be avoided.
 

FireWire

Cadet
Joined
Aug 10, 2023
Messages
4
Code:
root@truenas2[~]# zpool status -D Datastore
  pool: Datastore
 state: ONLINE
  scan: scrub repaired 0B in 09:16:51 with 0 errors on Sun Aug  6 09:16:53 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        Datastore                                 ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            626510d6-5350-4194-8fd9-b27198fbc434  ONLINE       0     0     0
            bc90d7af-770b-4a95-aedf-b84dde5bbeaa  ONLINE       0     0     0
            77235ec4-5e38-4a86-9ffe-f65456a04484  ONLINE       0     0     0
            a2f6a62a-2d1e-4b31-b1d5-dde9a014cc5f  ONLINE       0     0     0
            4959c23a-c770-4dd6-a8d7-a0e2334e9fce  ONLINE       0     0     0
            11181f8a-95c3-4510-aa0c-f7c61f94e916  ONLINE       0     0     0
            89766e2a-084b-46fd-8f0d-b20b1e2af81f  ONLINE       0     0     0
            19c96154-5d72-4ba9-a153-11665ab0c38c  ONLINE       0     0     0
            b0357687-51aa-4c61-976e-4c5ef4d7ddb5  ONLINE       0     0     0
            125d1fd2-663a-4c3b-821e-a2596e4d6952  ONLINE       0     0     0
            c814a7b8-315f-40cf-9214-6da0ca00eaa9  ONLINE       0     0     0
            92f8faa4-6811-4add-a3b7-a68509fe07cb  ONLINE       0     0     0
            2d74ae82-a55b-4b9a-abe4-1dce3fe8cb64  ONLINE       0     0     0
            554431b8-ebd5-4a5d-9c60-24ecacb15d2c  ONLINE       0     0     0
            6856f1ca-1c97-4a96-9813-55169ee11966  ONLINE       0     0     0
            8aba7e22-0798-4762-ad32-69811a8555c0  ONLINE       0     0     0
            93a461f6-9e24-4d8a-88df-58278f3668bd  ONLINE       0     0     0
            79bb1ee9-86ac-4608-bb0d-a7357fb9adc8  ONLINE       0     0     0
            2835909e-88bf-4b15-9fae-13e04e6d4ea0  ONLINE       0     0     0
            6b5a9fac-3347-4262-b84e-a722d4fcc6c0  ONLINE       0     0     0
        dedup
          mirror-1                                ONLINE       0     0     0
            30544a02-ec42-4460-ad8a-7230858d1477  ONLINE       0     0     0
            0046d3d8-9748-467c-a3b6-1948c0319b50  ONLINE       0     0     0
        spares
          e40eefc6-9924-490e-933a-ad4ee5d7e682    AVAIL   
          a2fab547-633c-422f-b225-300c28e17059    AVAIL   

errors: No known data errors

 dedup: DDT entries 91977, size 493B on disk, 159B in core

bucket              allocated                       referenced         
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    70.3K   5.71G   3.58G   4.03G    70.3K   5.71G   3.58G   4.03G
     2    18.7K   1.79G   1.14G   1.24G    41.1K   4.03G   2.49G   2.71G
     4      755   85.5M   46.0M   49.7M    3.42K    393M    206M    225M
     8       77   6.53M   5.40M   5.72M      684   54.3M   44.8M   47.9M
    16        4     17K      9K   42.7K       91    346K    186K    970K
    32        4      2K      2K   42.7K      159   79.5K   79.5K   1.66M
 Total    89.8K   7.58G   4.77G   5.33G     116K   10.2G   6.32G   7.02G


Heres a Pastebin to
Code:
iostat
output with various dd commands.
https://pastebin.com/ycZFuP8p

Drives sdc and sdd are the two disks being used for de-dup. Did notice that sdk, which is a drive in the raidz2, was showing high usage during the urandom write tests. Short SMART test gave a success. Started a long test, i'll let you know the results once its finished.
In terms of HBA,
Code:
lspci
is showing a LSI SAS3008.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Hello FireWire. The issue here is that the pool topology you've chosen is inherently going to cause you pain and suffering. Going wider than ~9 drives in a VDEV is asking for trouble, not only from an IOPs standpoint (because you only have 1 vdev) but also from a data safety standpoint.

Couple that with the use of deduplication? It's going to be a snail and it probably won't work as you expect.

I cannot stress enough how much I feel you should not go into production with a system configured like this... Here are some resources for you to helpyou understand.



 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Code:
 dedup: DDT entries 91977, size 493B on disk, 159B in core

bucket              allocated                       referenced         
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    70.3K   5.71G   3.58G   4.03G    70.3K   5.71G   3.58G   4.03G
     2    18.7K   1.79G   1.14G   1.24G    41.1K   4.03G   2.49G   2.71G
     4      755   85.5M   46.0M   49.7M    3.42K    393M    206M    225M
     8       77   6.53M   5.40M   5.72M      684   54.3M   44.8M   47.9M
    16        4     17K      9K   42.7K       91    346K    186K    970K
    32        4      2K      2K   42.7K      159   79.5K   79.5K   1.66M
 Total    89.8K   7.58G   4.77G   5.33G     116K   10.2G   6.32G   7.02G

The good news is that dedup doesn't seem to have been used extensively thus far - with only 91977 entries x 159 bytes, that's just around 14MB of footprint for your ~10G of data. The downside is that because your data vdev is RAIDZ2, you can't just remove the dedup vdev. I'd recommend disabling dedup on any active datasets and not using it on any new ones.

With only 10GB of data on this server, I'd strongly suggest moving everything off of it, pausing any further ingest, and reconfiguring the pool. Probably the same 20 data disks, but in a 10x2-way mirror setup, two more disks as spares, and two high-performance SSDs for SLOG devices - your Intel S3710's might be acceptable short-term, but you'll likely want to investigate something with more horsepower.

Your dd write tests also appear to be using an outfile of /root/testWrite which will be a path residing on your boot SSD, so it isn't really testing the pool. You'll need to point to a file under /mnt/Datastore or similar.

lspci showing a SAS3008 is good - it's likely the HBA330 then.
 
Joined
Jun 15, 2022
Messages
674
I'll chime in here, @HoneyBadger and @NickF are right on their observations, which in summary the configuration is not well-suited to running most types of VM.
 

FireWire

Cadet
Joined
Aug 10, 2023
Messages
4
Your dd write tests also appear to be using an outfile of /root/testWrite which will be a path residing on your boot SSD, so it isn't really testing the pool. You'll need to point to a file under /mnt/Datastore or similar.
I probably should've specified that this was running on an LXC container that was running on the proxmox cluster using an NFS share attached to the dataset. Sorry.

We have managed to get everything moved off one of the machines and are currently working on re-configuring it. Thank you all for the suggestions.

I've spent the last couple weeks testing various settings and doing research and decided a 2x10 wide raidz2 with two hot spares and a 1x2 mirror for dedup would probably be best. Any suggestions to improve the storage efficiency, reliability, or speed?

As for the lack of SLOG devices. We have decided to use async writes. Yes, we are familiar with the risks and have multiple tested battery backups, redundant power supplies, and generators on site. We mostly made this decision on the fact that the SSDs might not be fast enough to support the speeds we require. It is not final and we will be doing a good bit of testing with and without SLOG devices before anything is put into production.
 
Top