Help needed - Low performance on NVME pool

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
hello,

im trying to build fast NVME storage.

i have this server:

amd EPYC 7302P

with 5 MICRON 9300PRO 15TB U2 NVME - 3.5 GB/s

SAMSUNG DDR4 2933 16GB x4

making a raidZ or Stripe pool make the disk work relativly slow from what it is capable:

-no sync
-no compression
-Record Size: 1M

Code:
dd if=/dev/zero of=/mnt/pool1/test bs=1M count=50000


pool1                                   46.3G  27.7T      0  19.1K      0  3.25G
 gptid/b48c1ff8-1f31-11ea-bb92-b42e99a45c0d  23.1G  13.9T      0  9.59K      0  1.63G
 gptid/b4c09563-1f31-11ea-bb92-b42e99a45c0d  23.1G  13.9T      0  9.49K      0  1.62G
--------------------------------------  -----  -----  -----  -----  -----  -----



dd if=/dev/zero of=/mnt/pool2/test bs=1M count=50000



same test on only one nvme:



pool2                                   26.1G  13.8T      0  3.27K      0  3.10G
  gptid/aae97065-1f35-11ea-bb92-b42e99a45c0d  26.1G  13.8T      0  3.27K      0  3.10G


if i create larger pool each disk work slower:




pool1                                   64.6G  55.4T      0  38.1K      0  3.44G
  raidz1                                64.6G  55.4T      0  38.1K      0  3.44G
    gptid/8418f539-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.99K      0  1.17G
    gptid/84530d77-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.99K      0  1.17G
    gptid/8494e236-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.85K      0  1.17G
    gptid/84cef370-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.92K      0  1.17G



root@freenas[~]# dd if=/dev/zero of=/mnt/pool1/test bs=1M count=50000
50000+0 records in
50000+0 records out
52428800000 bytes transferred in 14.801266 secs (3542183421 bytes/sec) = 3.5GB...







can any one point to what can be the issue or what can i try to change in order to get the max performance?
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Your using RAIDZ 1. Try any configuration with multiple vdevs. ZFS performance scales with vdevs not just disks.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
If you don't care about fault tolerance, you could just make a strip of independant vdevs.
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
with stripe:

Code:
pool1                                   46.6G  69.3T      0  23.6K  4.00K  3.09G
  gptid/db932bdb-1f45-11ea-8162-b42e99a45c0d  9.34G  13.9T      0  4.66K      0   634M
  gptid/dbce3e39-1f45-11ea-8162-b42e99a45c0d  9.35G  13.9T      0  4.86K  4.00K   643M
  gptid/dc029b8b-1f45-11ea-8162-b42e99a45c0d  9.31G  13.9T      0  4.83K      0   632M
  gptid/dc3cc62a-1f45-11ea-8162-b42e99a45c0d  9.29G  13.9T      0  4.49K      0   627M
  gptid/dc7c6dc5-1f45-11ea-8162-b42e99a45c0d  9.34G  13.9T      0  4.73K      0   632M
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
made 2 vdev and stripe, again no passing 1.2G barrier on those disks...it shuld hit 3G with every disk....


pool 40.1G 27.7T 0 29.4K 0 2.09G
mirror 20.1G 13.9T 0 14.7K 0 1.04G
gptid/6b20ee47-1f49-11ea-8162-b42e99a45c0d - - 0 11.0K 0 1.04G
gptid/6b5ab09b-1f49-11ea-8162-b42e99a45c0d - - 0 10.9K 0 1.04G
mirror 20.0G 13.9T 0 14.8K 0 1.05G
gptid/9dad175d-1f49-11ea-8162-b42e99a45c0d - - 0 11.1K 0 1.05G
gptid/9debf53c-1f49-11ea-8162-b42e99a45c0d - - 0 11.1K 0 1.05G
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
That is interesting. That is a build I am looking for to build soonTM as well. And I do need those IOPS..
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
no matter what i try i cannot pass 3GB write....i really need some help here....
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
With NVMe you may need to spend some time adjusting the maximum write operations that can be queued against each disk, since the defaults are "rather conservative" shall we say.

Since your writes are async right now (are they going to be in the real world?) start by increasing the value of the sysctl vfs.zfs.vdev.async_write_max_active - the default is 10. Given that you're on NVMe I would start right off the bat by doubling it to 20 and seeing what the results are.
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
Hi thanks for the input.

this is how sysctl.conf look like:

Code:
kern.metadelay=3
kern.dirdelay=4
kern.filedelay=5
kern.coredump=1
kern.sugid_coredump=1
vfs.timestamp_precision=300
net.link.lagg.lacp.default_strict_mode=0
vfs.zfs.vdev.async_write_max_active=3
vfs.zfs.vdev.sync_write_max_active=30
# Force minimal 4KB ashift for new top level ZFS VDEVs.
vfs.zfs.min_auto_ashift=12
vfs.zfs.vdev.trim_max_active=20



didnt noitice any big different in preformance.

Code:
root@freenas[~]# dd if=/dev/zero of=/mnt/pool/test bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes transferred in 38.431617 secs (2728420211 bytes/sec)
pool                                    27.8G  27.7T      0  26.2K      0  2.81G
  mirror                                13.9G  13.9T      0  13.3K      0  1.40G
    gptid/6f60472f-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.93K      0  1.40G
    gptid/6f9988d8-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.96K      0  1.40G
  mirror                                13.9G  13.9T      0  12.9K      0  1.41G
    gptid/8b8da598-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.81K      0  1.41G
    gptid/8bc4ab30-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.78K      0  1.41G




i didnt see any change if i put
vfs.zfs.vdev.sync_write_max_active=300 or 10...

more advice will be more then appreciate.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
You're changing the wrong attribute, since your pool/dataset is not set as sync=always you want vfs.zfs.vdev.async_write_max_active (not "sync_write_max_active") which for some reason is set to 3 on your system. Push that number up, you can probably keep the sync version at 30 though since you are all-NVMe and those devices have extremely deep queues.

Did you enable autotune? (If you haven't - don't.)
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
Hi, i try the auto-tune option but after i saw it wont help i deleted all the conf it made.
sysctl.conf:
Code:
kern.metadelay=3
kern.dirdelay=4
kern.filedelay=5
kern.coredump=1
kern.sugid_coredump=1
vfs.timestamp_precision=3
net.link.lagg.lacp.default_strict_mode=0
vfs.zfs.vdev.async_write_max_active=360
# Force minimal 4KB ashift for new top level ZFS VDEVs.
vfs.zfs.min_auto_ashift=12


Code:
this is how it looks now,
after change i do 
service sysctl restart
and
/etc/rc.d/sysctl restart

is that enough? or do i need restart to apply?


i also have "vfs.zfs.vdev.async_write_max_active=360" this value on TUNBLE but it won't change nothing in the sysctl.conf...(even after restart.)
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
im connected via ssh:
when putting the command it won't accept it:
zsh: command not found: vfs.zfs.vdev.async_write_max_active

i cannot find the working command for this, anyway i preform many tests with tunables.
any more leads?
 

potzkin

Dabbler
Joined
Dec 15, 2019
Messages
16
You are using dd which is a single threaded application - basically you're doing the same as I did here (https://www.ixsystems.com/community/threads/pool-performance-scaling-at-1j-qd1.80417/) - with the same problems: There is no scaling of fast drives with additional vdevs on a single process

if your future mode of operation includes having multiple users or processes you should test with more threads/deeper queue, eg with fio
can you help with the command syntax?
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Code:
fio  --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=posixaio --name="a"  --runtime=300 --size=100G --time_based  --bs=64k --iodepth=1 --numjobs=1 --rw=write --filename=/mnt/<pool>/<dataset>/out.fio


This runs a 64K blocksize test with 1 thread, queue depth 1, streaming write, duration 300s

Code:
fio  --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=posixaio --name="a"  --size=100G --bs=64k --iodepth=1 --numjobs=1 --rw=write --filename=/mnt/<pool>/<dataset>/out.fio


This is the same with a test that ends when a 100G test file was created

further options:
Code:
#--rw=
    #read       Sequential reads.
    #write        Sequential writes.
    #trim        Sequential trims (Linux block devices and SCSI character devices only).
    #randread        Random reads.
    #randwrite        Random writes.
    #randtrim        Random trims (Linux block devices and SCSI character devices only).
    #rw,readwrite        Sequential mixed reads and writes.
    #randrw        Random mixed reads and writes.
    #trimwrite        Sequential trim+write sequences. Blocks will be trimmed first, then the same blocks will be written to.


To run with more threads change numjobs, to add more QueueDepth (stacked requests of a single thread) change iodepth.
At some point you will become CPU bound when scaling up (usually when you reach threads=# of cores unless they are very fast).

But remember, its of no use whatsoever to reach gigantic numbers in benchmarks if it does not reflect the actual use case that you have. Measure for your need and not to see huge numbers:)


Edit: Fixed double entry, added info re scaling+ comments
 
Last edited:
Top