Very Slow Pool performance. 49MB/s Write on 2x RAIDZ1 with 4 4tb N300s per Vdev - Please help!

djkay2637

Cadet
Joined
Sep 21, 2022
Messages
6
Thanks for reading my post. To start with. :smile:

My system
SuperMicro with 2x Xeon(R) CPU E5-2640 v3 and 96GB RAM running ESXi 7.0.
VM with 8 cores, 32GB RAM (locked), SAS2008 PCI-Express Fusion-MPT SAS-2 HBA card passed through
8 x Toshiba N300 4TB, 2x RAIDZ1 Vdevs with 4 X HDD per Vdev. 75% capacity used.

My issue
It has been running for over a year but now the write performance is very slow and appears to be the huge bottleneck. I have run the following Code and it would appear that the Pool performance is the bottleneck. The pool is set to Standard Sync, and LZ4 compression.

Code:
fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m


fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-                                                                                        1024KiB, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [W(1)][98.2%][eta 00m:02s]
test: (groupid=0, jobs=1): err= 0: pid=55519: Wed Sep 21 10:32:10 2022
  write: IOPS=46, BW=46.8MiB/s (49.1MB/s)(5120MiB/109424msec); 0 zone resets
    slat (nsec): min=14626, max=80450, avg=32511.15, stdev=6826.22
    clat (usec): min=271, max=4594.8k, avg=21338.39, stdev=289557.43
     lat (usec): min=290, max=4594.9k, avg=21370.90, stdev=289557.10
    clat percentiles (usec):
     |  1.00th=[    281],  5.00th=[    293], 10.00th=[    306],
     | 20.00th=[    330], 30.00th=[    338], 40.00th=[    347],
     | 50.00th=[    359], 60.00th=[    367], 70.00th=[    375],
     | 80.00th=[    396], 90.00th=[    408], 95.00th=[    424],
     | 99.00th=[    449], 99.50th=[3170894], 99.90th=[4278191],
     | 99.95th=[4529849], 99.99th=[4596958]
   bw (  KiB/s): min=53067, max=405928, per=100.00%, avg=321392.74, stdev=92992.36, samples=31
   iops        : min=   51, max=  396, avg=313.39, stdev=90.92, samples=31
  lat (usec)   : 500=99.47%
  lat (msec)   : >=2000=0.53%
  cpu          : usr=0.11%, sys=0.07%, ctx=5122, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5120,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=46.8MiB/s (49.1MB/s), 46.8MiB/s-46.8MiB/s (49.1MB/s-49.1MB/s), io=5120MiB (5369MB), run=109424-109424msec
root@truenas[~]#



I have a basic-intermediate knowledge at best and am really keen to learn further and looking for some wisdom from the community. Is there any extra troubleshooting tips or next steps?

Many thanks,
Dan.
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
467
I've got no experience running TN through ESXi, but those IOPS looks quite low in my eye.
I assume you haven't enabled deduplication? (a sure killer). When you run the test, how busy are the disks ? Can be seen using gstat, and also which version of TN are you using?
 

djkay2637

Cadet
Joined
Sep 21, 2022
Messages
6
Thank you for your reply. I have run gstat whilst running another test and i can see one disk is getting battered but the rest are not. Ddup is off too.


Code:
  1     42      0      0    0.0     42  36508   23.0   97.4| da0
    0      2      0      0    0.0      1     19    0.4    0.8| da1
    0      0      0      0    0.0      0      0    0.0    0.0| da2
    0      0      0      0    0.0      0      0    0.0    0.0| da3
    0      2      0      0    0.0      1     23    0.2    0.3| da4
    0      0      0      0    0.0      0      0    0.0    0.0| da5
    0      0      0      0    0.0      0      0    0.0    0.0| da6
    0      0      0      0    0.0      0      0    0.0    0.0| da7
    0      2      0      0    0.0      1     23    0.2    0.3| da8
    0      2      0      0    0.0      1     23    0.2    0.5| da9
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    0      0      0      0    0.0      0      0    0.0    0.0| da0p1
    1     42      0      0    0.0     42  36508   23.0   97.4| da0p2
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/539dd184-97ea-11eb-96e7-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| da1p1
    0      2      0      0    0.0      1     19    0.4    0.8| da1p2
    0      0      0      0    0.0      0      0    0.0    0.0| da3p1
    0      0      0      0    0.0      0      0    0.0    0.0| da3p2
    0      0      0      0    0.0      0      0    0.0    0.0| da4p1
    0      2      0      0    0.0      1     23    0.2    0.3| da4p2
    0      0      0      0    0.0      0      0    0.0    0.0| da5p1
    0      0      0      0    0.0      0      0    0.0    0.0| da5p2
    0      0      0      0    0.0      0      0    0.0    0.0| da6p1
    0      0      0      0    0.0      0      0    0.0    0.0| da6p2
    0      0      0      0    0.0      0      0    0.0    0.0| da7p1
    0      0      0      0    0.0      0      0    0.0    0.0| da7p2
    0      0      0      0    0.0      0      0    0.0    0.0| da8p1
    0      2      0      0    0.0      1     23    0.2    0.3| da8p2
    0      0      0      0    0.0      0      0    0.0    0.0| da9p1
    0      2      0      0    0.0      1     23    0.2    0.5| da9p2
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap1
    0      0      0      0    0.0      0      0    0.0    0.0| iso9660/TRUENAS
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap0.eli
    0      2      0      0    0.0      1     19    0.4    0.8| gptid/389e6c83-1c7b-11ec-a764-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/b8c17be1-1590-11ec-84b4-000c2923f789
    0      2      0      0    0.0      1     23    0.2    0.3| gptid/3f4a02b0-1c7b-11ec-a764-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap3
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/ce2fb270-1590-11ec-84b4-000c2923f789
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap2
 

djkay2637

Cadet
Joined
Sep 21, 2022
Messages
6
da0 is my boot drive for TrueNas. I am using TrueNAS-12.0-U7. I was using the command
Code:
fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m
to run the test however, i have to be honest, I don't understand what all the parameters are doing.
 

djkay2637

Cadet
Joined
Sep 21, 2022
Messages
6
gstat.JPG


This is what gstat looks like when transfering a 16GB file from one network share to another ( on the same VDEV) It is moving at 10MB/s.
 

Tony-1971

Contributor
Joined
Oct 1, 2016
Messages
147
Hello,
Before running fio command you must change directory (not /root but something under /mnt)
Berst Regards,
Antonio
 

djkay2637

Cadet
Joined
Sep 21, 2022
Messages
6
Thank you for your replies so far.

Code:
root@truenas[/mnt/NewPoolToshNas]# cd NFSShare
root@truenas[/mnt/NewPoolToshNas/NFSShare]# ls
root@truenas[/mnt/NewPoolToshNas/NFSShare]# fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
test: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=386MiB/s][w=386 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=59886: Wed Sep 21 17:10:01 2022
  write: IOPS=658, BW=659MiB/s (691MB/s)(5120MiB/7770msec); 0 zone resets
    slat (usec): min=10, max=1564, avg=55.12, stdev=80.76
    clat (nsec): min=1247, max=32859k, avg=1460864.32, stdev=1680683.91
     lat (usec): min=253, max=32885, avg=1515.99, stdev=1662.84
    clat percentiles (nsec):
     |  1.00th=[    1432],  5.00th=[    1800], 10.00th=[  158720],
     | 20.00th=[  254976], 30.00th=[  259072], 40.00th=[  276480],
     | 50.00th=[  618496], 60.00th=[ 2113536], 70.00th=[ 2506752],
     | 80.00th=[ 2637824], 90.00th=[ 2899968], 95.00th=[ 3031040],
     | 99.00th=[ 3457024], 99.50th=[11075584], 99.90th=[21889024],
     | 99.95th=[22151168], 99.99th=[32899072]
   bw (  KiB/s): min=331776, max=3433051, per=100.00%, avg=685124.60, stdev=833843.77, samples=15
   iops        : min=  324, max= 3352, avg=668.73, stdev=814.24, samples=15
  lat (usec)   : 2=5.59%, 4=1.50%, 10=0.14%, 20=0.16%, 50=0.51%
  lat (usec)   : 100=0.70%, 250=4.55%, 500=35.20%, 750=2.03%, 1000=0.06%
  lat (msec)   : 2=6.27%, 4=42.40%, 10=0.10%, 20=0.64%, 50=0.16%
  cpu          : usr=1.72%, sys=0.60%, ctx=5863, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5120,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=659MiB/s (691MB/s), 659MiB/s-659MiB/s (691MB/s-691MB/s), io=5120MiB (5369MB), run=7770-7770msec


These results are much better and showing the speed I would expect form this pool. Now my journey continues to find out why file transfers from one network share to another is 10MB/s.

To add further detail to my setup - the NFS share tested above is one of my ESXI's DataStore. Inside of this datastore is the virtual disk for one of my VMS. This is all inside the same physical server so not traversing the rest of my network.

What tests would you do next?

Thank you all for reading and commenting so far.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Your fio command is benchmarking 1MB sequential writes - if you're storing large media files, then it's an accurate reflection of the performance, but since you're using the TrueNAS storage as a "loopback" datastore for your ESXi host, it'll be drastically different. You're subject to a couple of different bottlenecks - the first of them being that NFS uses synchronous writes by default:


and the second being that RAIDZ performs poorly for block or block-like workloads (including VM storage):


While the first is relatively easy to correct by either adding a high-performance SLOG device (or disabling sync, if you're willing to risk data loss in a power outage!) the latter unfortunately doesn't have an easy fix as vdev geometry can't be changed after creation.

Your pool also being 75% full will have a negative impact on performance as well - this is discussed in the second resource link to some degree.

(You should probably look for and delete the 5GB test file you accidentally created on your boot volume as well.)

SuperMicro with 2x Xeon(R) CPU E5-2640 v3 and 96GB RAM running ESXi 7.0.
VM with 8 cores, 32GB RAM (locked), SAS2008 PCI-Express Fusion-MPT SAS-2 HBA card passed through
8 x Toshiba N300 4TB, 2x RAIDZ1 Vdevs with 4 X HDD per Vdev. 75% capacity used.

Some quick thoughts on your hardware and VM configuration, but likely aren't as relevant as your underlying storage configuration (sync/RAIDZ):

Your pCPUs have 6 cores each (12 threads) but you've configured the TrueNAS VM with 8 vCPUs - ESXi will prioritize physical cores first and has likely given you a 2-socket vNUMA configuration, which means that at any given point your I/O has a 50% chance of having to cross the QPI link between sockets to hit your HBA. Disregard - I was thinking of the earlier 2640's. Unless you're hitting 100% CPU across all threads simultaneously, consider reducing the vCPU allocation, I'd suggest 4 vCPU personally.

Related to the above, consider also using the VM advanced settings to pin it to the same NUMA node where the PCIe slot containing the HBA is:

 
Last edited:

djkay2637

Cadet
Joined
Sep 21, 2022
Messages
6
Thank you HoneyBadger for spending the time to compose such a comprehensive response. The E5-2640 v3 i have are 8 cores and 16 threads each. My thinking was to give it an entire CPU worth of oomph just because I can I guess. I will lower the core count and see how it goes. I do have a decent spare SSD unutilised so I will use that as a SLOG and see how it goes. My rational for using NFS was the ability to access the NFSShare across different OSs but having learned about everything a bit more, i will perhaps transition to an ISCSI DataStore in ESXI and migrate the VMS to the ISCSI. Do you think this could help?

Again, I really appreciate spending the time.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Thank you HoneyBadger for spending the time to compose such a comprehensive response. The E5-2640 v3 i have are 8 cores and 16 threads each.

Apologies, you're right - I was thinking of the original Sandy Bridge 2640 which is a hexcore. I'd still suggest dropping the core count and pinning the cores to the same physical CPU as the PCIe slot with the HBA though.

My thinking was to give it an entire CPU worth of oomph just because I can I guess. I will lower the core count and see how it goes.

Most systems are limited by network or storage far before they are CPU limited. I imagine 4 vCPUs will be fine.

I do have a decent spare SSD unutilised so I will use that as a SLOG and see how it goes.

What make and model is it? SLOG usage causes heavy write workloads, and it can burn through the endurance rating of consumer SSDs very quickly.

My rational for using NFS was the ability to access the NFSShare across different OSs but having learned about everything a bit more, i will perhaps transition to an ISCSI DataStore in ESXI and migrate the VMS to the ISCSI. Do you think this could help?

No, the underlying storage will still be subject to the same limitations as far as the RAIDZ layout and the dependence on sync writes. iSCSI may benchmark faster, but it defaults to "unsafe" asynchronous writes - if you're willing to accept the risk of data loss on a power outage, you can achieve the same results by setting sync=disabled on your ESXi NFS exports - or, if the performance with your SLOG SSD is acceptable, you can keep sync=standard with NFS.
 
Top