Very slow R/W speeds

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
Is this represented by the traffic spike at around 1402h-1404h in the interface traffic in your second attached image?
Yes, you are correct.
What operation did you start at 1347h? It spiked up rapidly to 600Mbps and then dropped down rapidly to first about 350Mbps and then 200Mbps. Was this "before pool recreation" using RAIDZ and the latter was the same migration with a stripe pool?
I started a migration (back to using RAIDZ and never tried stripe, only RAIDZ and Mirror) and was watching speeds then it dropped. I stopped the operation (after speeds fizzled out) because I forgot to turn off sync like you suggested.

In my experience, and since my home operation is small, NFS is a better choice as it is much more reliable than iSCUSI (more specifically VMFS) since I do not have a dedicated backup power source like generators. VMFS is too fragile for my linking. Now at my 9-5 job my compellents are using iSCUSI, with VMFS data stores.

2nd Edit: I forgot to mention the last two tests (at 1347h and 1402-1404h tests were with the SSDs in RAIDZ. I stopped testing with the 8TBs because to me I don't feel this is a disk issue anymore and more a networking issue.

At this point I migrated all storage off the TrueNAS so whatever kinds of tests ya'll want me to do is fine. There will be no data loss or issues.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Your network iperf tests are good, and your local disk speed tests through dd to datasets with compression off are good - so I think what's happening here is just a classic sync-write bottleneck.

I'd like to see what happens if you configure your 4x SAS HDDs into a stripe, make a dataset with sync=disabled and try cloning/migrating a test VM to them. I suspect you'll have significantly higher write performance doing that.

Let's leverage an old script from Adam Leventhal. Open an SSH session, use vi or nano to create a file called dirty.d and paste this text in there. You could also do it on a client system and then copy it to your pool, but you'll then need to get to that same pool directory.

Code:
txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}


Then from a shell do dtrace -s dirty.d YourPool and wait. You'll see a bunch of lines that look like the following:

Code:
dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  4  56342                 none:txg-syncing   62MB of 4096MB used
  4  56342                 none:txg-syncing   64MB of 4096MB used
  5  56342                 none:txg-syncing   64MB of 4096MB used


Start up a big copy and watch the numbers in the "X MB of Y MB used" area. If you look like you're using more than 60% of your outstanding data, you're getting write-throttled. What I expect based on your "transfer rate drops" is that X will rapidly approach Y and hang out roughly a few dozen MB's below it.

With regards to NFS vs iSCSI and VMFS stability - check the resource about "Why is NFS so slow vs iSCSI" here by @jgreco

https://www.truenas.com/community/r...-esxi-nfs-so-slow-and-why-is-iscsi-faster.40/

The "out of the box" settings have NFS being "slow and safe" with iSCSI being "fast and dangerous" - but notably you can make either one achieve a desired end-state (including "relatively fast but still safe") with some tuning to your sync setting and presence of a proper SLOG device.
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
Code:
txg-syncing
{
this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
`zfs_dirty_data_max / 1024 / 1024);
}



Then from a shell do dtrace -s dirty.d YourPool and wait. You'll see a bunch of lines that look like the following:

Code:
dtrace: script 'dirty.d' matched 2 probes
CPU ID FUNCTION:NAME
4 56342 none:txg-syncing 62MB of 4096MB used
4 56342 none:txg-syncing 64MB of 4096MB used
5 56342 none:txg-syncing 64MB of 4096MB used

Not having much luck with this code. Just showing you what I have done and its output. Not very familiar with the FreeBSD commands.

dirty.d file
1613189996674.png

Order of operations and output.
1613189985112.png


I'm assuming there needs to be some sort of value for $$1 but not sure what is needed. Pool?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm assuming there needs to be some sort of value for $$1 but not sure what is needed. Pool?
Yes, you need to supply the pool name as the variable, so assuming you've named your pool Test-01 your command would be

dtrace -s /mnt/Test-01/sheet-01/dirty.d Test-01
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
Yes, you need to supply the pool name as the variable, so assuming you've named your pool Test-01 your command would be

dtrace -s /mnt/Test-01/sheet-01/dirty.d Test-01

Figured it was something like that, lol. Thanks.

Here's the results. I highlighted in SS two of where I left off on the log of SS one.
1613232659831.png

1613232674289.png

513MB was the highest peek which is roughly 15% / 16% if we're rounding up.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
So the other test you can run is a dd test over your nfs/smb mount which every you are using.

write:
dd if=/dev/zero of=/mnt/freenas/dataset/nocompression bs=1M count=1000

read:
dd if=/mnt/freenas/dataset/nocompression of=/dev/null bs=1M

probably just a gig is reasonable, you can write more if you feel the test needs more time.
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
So the other test you can run is a dd test over your nfs/smb mount which every you are using.

I ran these and didn't see any numbers outside of what I have already been posting. I forgot to record them but I do remember them not telling me much.

Just for giggles, I took the 4x500GB SSDs and made them a stripe (1.8TB) Zil SLOG for the 4x8TB drives (probably overkill by 4 folds, I know). In System>Advanced>Storage, I set the swap to 8GiB and the Log to 800GiB. TBH, not really sure what that was for but it felt like that was to tell the system how much Zil SLOG was allowed to be used (was set to 2GiB prior). I cloned just over 3TB in two, twelve hour sessions to the 21TB pool with RAIDz1, compression, and sync on.

Not sure if that is a good standard transfer rate or not. At this point I'm just screwin around to see what happens when I do what.

Still looking for advice but this is where I am at.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Figured it was something like that, lol. Thanks.

Here's the results. I highlighted in SS two of where I left off on the log of SS one.

513MB was the highest peek which is roughly 15% / 16% if we're rounding up.

You're nowhere near the point of the write throttle (defaults kick in at 60%) so there goes that theory. Got me a bit puzzled honestly. Is there sufficient cooling on your HBA?
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
Is there sufficient cooling on your HBA?

Currently sits in a basement where it is about 60F inside installed in a rack mount case that has 3, 80mm fans to remove heat off the HDDs and that heat passes over the HBA, CPU and other hardware. Those said fans have not even increased speed due to temps. I can touch the heat of the HBA and its lukewarm.
CPU is staying between 32C and 33C. Not seen it go over 5% utilization.
Been pushing data over to the TrueNAS and keeping an eye on the 1.8TB of ZIL and noticed the drives are writing data at the same speed as the the spinning disks. which has been around 13-16MB/s

Here are some pics of all the status at once.
case.jpg
case top.jpg

1613523682864.png

1613523711961.png

1613523759063.png

1613523769228.png

1613523777119.png
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I ran these and didn't see any numbers outside of what I have already been posting. I forgot to record them but I do remember them not telling me much.

Just for giggles, I took the 4x500GB SSDs and made them a stripe (1.8TB) Zil SLOG for the 4x8TB drives (probably overkill by 4 folds, I know). In System>Advanced>Storage, I set the swap to 8GiB and the Log to 800GiB. TBH, not really sure what that was for but it felt like that was to tell the system how much Zil SLOG was allowed to be used (was set to 2GiB prior). I cloned just over 3TB in two, twelve hour sessions to the 21TB pool with RAIDz1, compression, and sync on.

Not sure if that is a good standard transfer rate or not. At this point I'm just screwin around to see what happens when I do what.

Still looking for advice but this is where I am at.
If your dd speeds are maxing out your network then it's possible the client is what is transferring the data slow. It could also be workflow, if you are transferring lots of smaller files that is going to be harder to write to the system.
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
iperf test between both the source and destination. Server: ReadyNAS (destination) Client: TrueNAS (Source)
1613528604589.png

I've got some data transfers in progress so I was not expecting full speeds. I've tried VMs, Large CAB files, Video files, small files, fat files, skinny files and files with polka dots. Though speeds are slow, they're also consistent :confused:
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Your network speeds through iperf are fine, which seems to rule out things like bad network drivers, cables or SFPs being janky or wrong wavelength.

Your raw disk and pool speeds are fine, which seems to rule out your disks being bad and pool layout, and we even took sync writes out of the picture by disabling it at the dataset level.

But when everything is being asked to play nice together it just falls down. This is a headscratcher for sure.

Have you tried remote access via protocols other than NFS (eg: SMB or iSCSI) to see if the problem still manifests there?
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
Your network speeds through iperf are fine, which seems to rule out things like bad network drivers, cables or SFPs being janky or wrong wavelength.

Your raw disk and pool speeds are fine, which seems to rule out your disks being bad and pool layout, and we even took sync writes out of the picture by disabling it at the dataset level.

But when everything is being asked to play nice together it just falls down. This is a headscratcher for sure.

Have you tried remote access via protocols other than NFS (eg: SMB or iSCSI) to see if the problem still manifests there?


Don't judge me but I haven't figured out the SMB for TrueNAS, yet lol. I'm working on it though.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Don't judge me but I haven't figured out the SMB for TrueNAS, yet lol. I'm working on it though.

No judgement. Just trying to see if we can isolate the performance issue; if it doesn't appear when using SMB or iSCSI then it might be something in the NFS configuration (server or client side) but if remote access is still slow then we continue chasing gremlins.
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
if it doesn't appear when using SMB
well duck a duck... it did a solid 85-90MB/s with SMB. Doesn't make a lot of sense to me because since everything that was using NFS is designed to run NFS. ReadyNAS 214, 3, ESXi 6.7 servers with vSphere 6.7. The two ESXI servers that was hosting the VMs have 2 10/100/1000 NICs dedicated for the VMs.
Not saying anything is at fault, just cornfused.

I created an SMB share to a VM (which is on a datastore that is connected via NFS) and I could upload at about 60MB/s-80MB/s so there's that.

I guess we can consider this closed? idk. Still open to do more tests if you want to dig deeper.
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
Also, I do appreciate you @HoneyBadger and @SweetAndLow for your willingness to help me with this. I've got some learning to do but overall this TrueNAS seems to be a solid choice for my private clients. (I own a small MSP company that focus on free IT labor to the SMBs and animal hospitals that need IT but do not have a budget for IT. Getting their network infrastructure secure and a proper DR is the main objective).

Looking forward to learning more about this product. Thanks.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
well duck a duck... it did a solid 85-90MB/s with SMB. Doesn't make a lot of sense to me because since everything that was using NFS is designed to run NFS. ReadyNAS 214, 3, ESXi 6.7 servers with vSphere 6.7. The two ESXI servers that was hosting the VMs have 2 10/100/1000 NICs dedicated for the VMs.
Not saying anything is at fault, just cornfused.

I created an SMB share to a VM (which is on a datastore that is connected via NFS) and I could upload at about 60MB/s-80MB/s so there's that.

Uploading to an SMB share on a VM that's running on an NFS export on FreeNAS made it go fast? It's almost as if there's something in the default NFS settings on the clients that makes the file shipping go slowly, but ESXi/vSphere is able to write to it just fine when it's acting as a man-in-the-middle. I don't use NFS at home but I might need to set something up and poke around to see if I can reproduce this.

(I own a small MSP company that focus on free IT labor to the SMBs and animal hospitals that need IT but do not have a budget for IT. Getting their network infrastructure secure and a proper DR is the main objective).

That's a wonderfully wholesome way to donate your time and knowledge. Happy to help in any way I can.
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
It's almost as if there's something in the default NFS settings on the clients that makes the file shipping go slowly, but ESXi/vSphere is able to write to it just fine when it's acting as a man-in-the-middle.
I took the 4 SSD ZILs out of the vdev for the 8TB spinning disks and made them their own separate pool to be used as the boot drives for the VM's OS.
Migrating an OSSIM VM that is about 500GB and this is the speeds. Still not wrapping my brain around where the bottleneck is...
ada5 is an SSD, da3 is a spinning.
1613670074780.png
 

TheUsD

Contributor
Joined
May 17, 2013
Messages
116
I have three NICs,
Address for GUI access
192.168.22.2/28 em0
Addresses for the NFS storage:
192.168.22.22/28 igb0
192.168.22.42/28 igb1

ESXi Servers: 192.168.20.0/24 network
192.168.20.10 (contains VMs)
192.168.20.11 (contains VMs
192.168.20.12 (vsphere only hosted on this)

Went into vSphere, mounted 2 new datastores.
Datastore 1: Type: NFS 4.1 Folder: datasheet from Pool-01 Servers: using both .22 and .42 as the servers
Datastore 2: Type: NFS 4.1 Folder: datasheet from Pool-02 Servers: using both .22 and .42 as the servers
Started a VM migration of a VM disk only.

I noticed that traffic was leaving and entering on all three NICs (shown here)
1614197285131.png

In an attempt to fix this, I set static routes in TrueNAS for each NIC to see if I could force specific traffic over the specified NICs

Destination:______Gateway:
192.168.20.0/24 192.168.22.22
192.168.20.0/24 192.168.22.42
192.168.25.0/24 192.168.22.2 (192.168.25.0/24 is my subnet for workstations)

Ran a VM migration and traffic still moving across all three NICS. I then deleted said routes from TrueNAS and set the same static routes on the router. Traffic still routing across all three NICs. Traffic still routed accross all three NICs. I then wondered if I was just doing some terrible networking and decided to run one NFS share on igb0 and the other NFS share on igb1
Traffic still routing on all three NICS.

Removed NFS share from igb1 and remounted it on igb0, ran VM migration, traffic now only on igb0 but speeds still 11-13MiB/s in both directions.
Did the same but now both NFS sharesare on igb1, ran VM migration, speeds still 11-13MiB/s in both directions.
One last time but now NFS shares are on em0. Speeds increased to 55-65MiB/s in both directions.

All this troubleshooting and pretty sure its just a bad dual port, SFP card. How frustrating.
tenor.gif
 

Attachments

  • 1614197256272.png
    1614197256272.png
    229.3 KB · Views: 250
Joined
Dec 29, 2014
Messages
1,135
If I interpret that correctly, that means that all your storage traffic is transiting your router at layer 3. What kind of router is it, and are you SURE it is up to doing that? Just because something has a gigabit interface doesn't mean it is capable of filling that interface on a consistent basis. I have a dedicated storage network that only has the FreeNAS and the NIC from my ESXi hosts on it. All the storage traffic between FreeNAS and ESXi is only layer 2 traffic from the perspective of the switch. I would be really curious to know if you got better throughput if your ESXi host(s) and FreeNAS were on the same IP network.
 
Top