Very Slow Pool performance. 49MB/s Write on 2x RAIDZ1 with 4 4tb N300s per Vdev - Please help!

djkay2637 · Sep 21, 2022

Thanks for reading my post. To start with.

My system
SuperMicro with 2x Xeon(R) CPU E5-2640 v3 and 96GB RAM running ESXi 7.0.
VM with 8 cores, 32GB RAM (locked), SAS2008 PCI-Express Fusion-MPT SAS-2 HBA card passed through
8 x Toshiba N300 4TB, 2x RAIDZ1 Vdevs with 4 X HDD per Vdev. 75% capacity used.

My issue
It has been running for over a year but now the write performance is very slow and appears to be the huge bottleneck. I have run the following Code and it would appear that the Pool performance is the bottleneck. The pool is set to Standard Sync, and LZ4 compression.

Code:

fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m


fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-                                                                                        1024KiB, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [W(1)][98.2%][eta 00m:02s]
test: (groupid=0, jobs=1): err= 0: pid=55519: Wed Sep 21 10:32:10 2022
  write: IOPS=46, BW=46.8MiB/s (49.1MB/s)(5120MiB/109424msec); 0 zone resets
    slat (nsec): min=14626, max=80450, avg=32511.15, stdev=6826.22
    clat (usec): min=271, max=4594.8k, avg=21338.39, stdev=289557.43
     lat (usec): min=290, max=4594.9k, avg=21370.90, stdev=289557.10
    clat percentiles (usec):
     |  1.00th=[    281],  5.00th=[    293], 10.00th=[    306],
     | 20.00th=[    330], 30.00th=[    338], 40.00th=[    347],
     | 50.00th=[    359], 60.00th=[    367], 70.00th=[    375],
     | 80.00th=[    396], 90.00th=[    408], 95.00th=[    424],
     | 99.00th=[    449], 99.50th=[3170894], 99.90th=[4278191],
     | 99.95th=[4529849], 99.99th=[4596958]
   bw (  KiB/s): min=53067, max=405928, per=100.00%, avg=321392.74, stdev=92992.36, samples=31
   iops        : min=   51, max=  396, avg=313.39, stdev=90.92, samples=31
  lat (usec)   : 500=99.47%
  lat (msec)   : >=2000=0.53%
  cpu          : usr=0.11%, sys=0.07%, ctx=5122, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5120,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=46.8MiB/s (49.1MB/s), 46.8MiB/s-46.8MiB/s (49.1MB/s-49.1MB/s), io=5120MiB (5369MB), run=109424-109424msec
root@truenas[~]#

I have a basic-intermediate knowledge at best and am really keen to learn further and looking for some wisdom from the community. Is there any extra troubleshooting tips or next steps?

Many thanks,
Dan.

c77dk · Sep 21, 2022

I've got no experience running TN through ESXi, but those IOPS looks quite low in my eye.
I assume you haven't enabled deduplication? (a sure killer). When you run the test, how busy are the disks ? Can be seen using gstat, and also which version of TN are you using?

djkay2637 · Sep 21, 2022

Thank you for your reply. I have run gstat whilst running another test and i can see one disk is getting battered but the rest are not. Ddup is off too.

Code:

  1     42      0      0    0.0     42  36508   23.0   97.4| da0
    0      2      0      0    0.0      1     19    0.4    0.8| da1
    0      0      0      0    0.0      0      0    0.0    0.0| da2
    0      0      0      0    0.0      0      0    0.0    0.0| da3
    0      2      0      0    0.0      1     23    0.2    0.3| da4
    0      0      0      0    0.0      0      0    0.0    0.0| da5
    0      0      0      0    0.0      0      0    0.0    0.0| da6
    0      0      0      0    0.0      0      0    0.0    0.0| da7
    0      2      0      0    0.0      1     23    0.2    0.3| da8
    0      2      0      0    0.0      1     23    0.2    0.5| da9
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    0      0      0      0    0.0      0      0    0.0    0.0| da0p1
    1     42      0      0    0.0     42  36508   23.0   97.4| da0p2
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/539dd184-97ea-11eb-96e7-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| da1p1
    0      2      0      0    0.0      1     19    0.4    0.8| da1p2
    0      0      0      0    0.0      0      0    0.0    0.0| da3p1
    0      0      0      0    0.0      0      0    0.0    0.0| da3p2
    0      0      0      0    0.0      0      0    0.0    0.0| da4p1
    0      2      0      0    0.0      1     23    0.2    0.3| da4p2
    0      0      0      0    0.0      0      0    0.0    0.0| da5p1
    0      0      0      0    0.0      0      0    0.0    0.0| da5p2
    0      0      0      0    0.0      0      0    0.0    0.0| da6p1
    0      0      0      0    0.0      0      0    0.0    0.0| da6p2
    0      0      0      0    0.0      0      0    0.0    0.0| da7p1
    0      0      0      0    0.0      0      0    0.0    0.0| da7p2
    0      0      0      0    0.0      0      0    0.0    0.0| da8p1
    0      2      0      0    0.0      1     23    0.2    0.3| da8p2
    0      0      0      0    0.0      0      0    0.0    0.0| da9p1
    0      2      0      0    0.0      1     23    0.2    0.5| da9p2
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap1
    0      0      0      0    0.0      0      0    0.0    0.0| iso9660/TRUENAS
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap0.eli
    0      2      0      0    0.0      1     19    0.4    0.8| gptid/389e6c83-1c7b-11ec-a764-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/b8c17be1-1590-11ec-84b4-000c2923f789
    0      2      0      0    0.0      1     23    0.2    0.3| gptid/3f4a02b0-1c7b-11ec-a764-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap3
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/ce2fb270-1590-11ec-84b4-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap2

djkay2637 · Sep 21, 2022

da0 is my boot drive for TrueNas. I am using TrueNAS-12.0-U7. I was using the command

Code:

fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m

to run the test however, i have to be honest, I don't understand what all the parameters are doing.

djkay2637 · Sep 21, 2022

This is what gstat looks like when transfering a 16GB file from one network share to another ( on the same VDEV) It is moving at 10MB/s.

Tony-1971 · Sep 21, 2022

Hello,
Before running fio command you must change directory (not /root but something under /mnt)
Berst Regards,
Antonio

djkay2637 · Sep 21, 2022

Thank you for your replies so far.

Code:

root@truenas[/mnt/NewPoolToshNas]# cd NFSShare
root@truenas[/mnt/NewPoolToshNas/NFSShare]# ls
root@truenas[/mnt/NewPoolToshNas/NFSShare]# fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
test: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=386MiB/s][w=386 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=59886: Wed Sep 21 17:10:01 2022
  write: IOPS=658, BW=659MiB/s (691MB/s)(5120MiB/7770msec); 0 zone resets
    slat (usec): min=10, max=1564, avg=55.12, stdev=80.76
    clat (nsec): min=1247, max=32859k, avg=1460864.32, stdev=1680683.91
     lat (usec): min=253, max=32885, avg=1515.99, stdev=1662.84
    clat percentiles (nsec):
     |  1.00th=[    1432],  5.00th=[    1800], 10.00th=[  158720],
     | 20.00th=[  254976], 30.00th=[  259072], 40.00th=[  276480],
     | 50.00th=[  618496], 60.00th=[ 2113536], 70.00th=[ 2506752],
     | 80.00th=[ 2637824], 90.00th=[ 2899968], 95.00th=[ 3031040],
     | 99.00th=[ 3457024], 99.50th=[11075584], 99.90th=[21889024],
     | 99.95th=[22151168], 99.99th=[32899072]
   bw (  KiB/s): min=331776, max=3433051, per=100.00%, avg=685124.60, stdev=833843.77, samples=15
   iops        : min=  324, max= 3352, avg=668.73, stdev=814.24, samples=15
  lat (usec)   : 2=5.59%, 4=1.50%, 10=0.14%, 20=0.16%, 50=0.51%
  lat (usec)   : 100=0.70%, 250=4.55%, 500=35.20%, 750=2.03%, 1000=0.06%
  lat (msec)   : 2=6.27%, 4=42.40%, 10=0.10%, 20=0.64%, 50=0.16%
  cpu          : usr=1.72%, sys=0.60%, ctx=5863, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5120,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=659MiB/s (691MB/s), 659MiB/s-659MiB/s (691MB/s-691MB/s), io=5120MiB (5369MB), run=7770-7770msec

These results are much better and showing the speed I would expect form this pool. Now my journey continues to find out why file transfers from one network share to another is 10MB/s.

To add further detail to my setup - the NFS share tested above is one of my ESXI's DataStore. Inside of this datastore is the virtual disk for one of my VMS. This is all inside the same physical server so not traversing the rest of my network.

What tests would you do next?

Thank you all for reading and commenting so far.

HoneyBadger · Sep 21, 2022

Your fio command is benchmarking 1MB sequential writes - if you're storing large media files, then it's an accurate reflection of the performance, but since you're using the TrueNAS storage as a "loopback" datastore for your ESXi host, it'll be drastically different. You're subject to a couple of different bottlenecks - the first of them being that NFS uses synchronous writes by default:

Resource - Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

This post is not specific to ESXi, however, ESXi users typically experience more trouble with this topic due to the way VMware works. When an application on a client machine wishes to write something to storage, it issues a write request. On a...

www.truenas.com

and the second being that RAIDZ performs poorly for block or block-like workloads (including VM storage):

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files...

www.truenas.com

While the first is relatively easy to correct by either adding a high-performance SLOG device (or disabling sync, if you're willing to risk data loss in a power outage!) the latter unfortunately doesn't have an easy fix as vdev geometry can't be changed after creation.

Your pool also being 75% full will have a negative impact on performance as well - this is discussed in the second resource link to some degree.

(You should probably look for and delete the 5GB test file you accidentally created on your boot volume as well.)

djkay2637 said:
SuperMicro with 2x Xeon(R) CPU E5-2640 v3 and 96GB RAM running ESXi 7.0.
VM with 8 cores, 32GB RAM (locked), SAS2008 PCI-Express Fusion-MPT SAS-2 HBA card passed through
8 x Toshiba N300 4TB, 2x RAIDZ1 Vdevs with 4 X HDD per Vdev. 75% capacity used.

Some quick thoughts on your hardware and VM configuration, but likely aren't as relevant as your underlying storage configuration (sync/RAIDZ):

Your pCPUs have 6 cores each (12 threads) but you've configured the TrueNAS VM with 8 vCPUs - ESXi will prioritize physical cores first and has likely given you a 2-socket vNUMA configuration, which means that at any given point your I/O has a 50% chance of having to cross the QPI link between sockets to hit your HBA. Disregard - I was thinking of the earlier 2640's. Unless you're hitting 100% CPU across all threads simultaneously, consider reducing the vCPU allocation, I'd suggest 4 vCPU personally.

Related to the above, consider also using the VM advanced settings to pin it to the same NUMA node where the PCIe slot containing the HBA is:

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-A80A6337-7B99-48C8-B024-EE47E2366C1B.html

djkay2637 · Sep 21, 2022

Thank you HoneyBadger for spending the time to compose such a comprehensive response. The E5-2640 v3 i have are 8 cores and 16 threads each. My thinking was to give it an entire CPU worth of oomph just because I can I guess. I will lower the core count and see how it goes. I do have a decent spare SSD unutilised so I will use that as a SLOG and see how it goes. My rational for using NFS was the ability to access the NFSShare across different OSs but having learned about everything a bit more, i will perhaps transition to an ISCSI DataStore in ESXI and migrate the VMS to the ISCSI. Do you think this could help?

Again, I really appreciate spending the time.

HoneyBadger · Sep 21, 2022

djkay2637 said:
Thank you HoneyBadger for spending the time to compose such a comprehensive response. The E5-2640 v3 i have are 8 cores and 16 threads each.

Apologies, you're right - I was thinking of the original Sandy Bridge 2640 which is a hexcore. I'd still suggest dropping the core count and pinning the cores to the same physical CPU as the PCIe slot with the HBA though.

djkay2637 said:
My thinking was to give it an entire CPU worth of oomph just because I can I guess. I will lower the core count and see how it goes.

Most systems are limited by network or storage far before they are CPU limited. I imagine 4 vCPUs will be fine.

djkay2637 said:
I do have a decent spare SSD unutilised so I will use that as a SLOG and see how it goes.

What make and model is it? SLOG usage causes heavy write workloads, and it can burn through the endurance rating of consumer SSDs very quickly.

djkay2637 said:
My rational for using NFS was the ability to access the NFSShare across different OSs but having learned about everything a bit more, i will perhaps transition to an ISCSI DataStore in ESXI and migrate the VMS to the ISCSI. Do you think this could help?

No, the underlying storage will still be subject to the same limitations as far as the RAIDZ layout and the dependence on sync writes. iSCSI may benchmark faster, but it defaults to "unsafe" asynchronous writes - if you're willing to accept the risk of data loss on a power outage, you can achieve the same results by setting sync=disabled on your ESXi NFS exports - or, if the performance with your SLOG SSD is acceptable, you can keep sync=standard with NFS.

Important Announcement for the TrueNAS Community.

Very Slow Pool performance. 49MB/s Write on 2x RAIDZ1 with 4 4tb N300s per Vdev - Please help!

djkay2637

Cadet

c77dk

Patron

djkay2637

Cadet

djkay2637

Cadet

djkay2637

Cadet

Tony-1971

Contributor

djkay2637

Cadet

HoneyBadger

actually does care

Resource - Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

djkay2637

Cadet

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Very Slow Pool performance. 49MB/s Write on 2x RAIDZ1 with 4 4tb N300s per Vdev - Please help!

Cadet

Patron

Cadet

Cadet

Cadet

Contributor

Cadet

actually does care

Cadet

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Very Slow Pool performance. 49MB/s Write on 2x RAIDZ1 with 4 4tb N300s per Vdev - Please help!"

Similar threads