Help needed - Low performance on NVME pool

potzkin · Dec 15, 2019

hello,

im trying to build fast NVME storage.

i have this server:

R272-Z32 (rev. 100) | Rack Servers - GIGABYTE Global

Supports GRAID SupremeRAID SR-1000 NVMe/NVMe-oF RAID CardSingle AMD EPYC™ 7002 series processor family8-Channel RDIMM/LRDIMM DDR4, 16 x DIMMs2 x 1Gb/s LAN p...

www.gigabyte.com

amd EPYC 7302P

with 5 MICRON 9300PRO 15TB U2 NVME - 3.5 GB/s

SAMSUNG DDR4 2933 16GB x4

making a raidZ or Stripe pool make the disk work relativly slow from what it is capable:

-no sync
-no compression
-Record Size: 1M

Code:

dd if=/dev/zero of=/mnt/pool1/test bs=1M count=50000


pool1                                   46.3G  27.7T      0  19.1K      0  3.25G
 gptid/b48c1ff8-1f31-11ea-bb92-b42e99a45c0d  23.1G  13.9T      0  9.59K      0  1.63G
 gptid/b4c09563-1f31-11ea-bb92-b42e99a45c0d  23.1G  13.9T      0  9.49K      0  1.62G
--------------------------------------  -----  -----  -----  -----  -----  -----



dd if=/dev/zero of=/mnt/pool2/test bs=1M count=50000



same test on only one nvme:



pool2                                   26.1G  13.8T      0  3.27K      0  3.10G
  gptid/aae97065-1f35-11ea-bb92-b42e99a45c0d  26.1G  13.8T      0  3.27K      0  3.10G


if i create larger pool each disk work slower:




pool1                                   64.6G  55.4T      0  38.1K      0  3.44G
  raidz1                                64.6G  55.4T      0  38.1K      0  3.44G
    gptid/8418f539-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.99K      0  1.17G
    gptid/84530d77-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.99K      0  1.17G
    gptid/8494e236-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.85K      0  1.17G
    gptid/84cef370-1f3a-11ea-a940-b42e99a45c0d      -      -      0  9.92K      0  1.17G



root@freenas[~]# dd if=/dev/zero of=/mnt/pool1/test bs=1M count=50000
50000+0 records in
50000+0 records out
52428800000 bytes transferred in 14.801266 secs (3542183421 bytes/sec) = 3.5GB...

can any one point to what can be the issue or what can i try to change in order to get the max performance?

kdragon75 · Dec 15, 2019

Your using RAIDZ 1. Try any configuration with multiple vdevs. ZFS performance scales with vdevs not just disks.

kdragon75 · Dec 15, 2019

If you don't care about fault tolerance, you could just make a strip of independant vdevs.

potzkin · Dec 15, 2019

kdragon75 said:
If you don't care about fault tolerance, you could just make a strip of independant vdevs.

for testing i made also pool with stripe...same resualts...

potzkin · Dec 15, 2019

potzkin said:
for testing i made also pool with stripe...same resualts...

i need some redundancy...

potzkin · Dec 15, 2019

with stripe:

Code:

pool1                                   46.6G  69.3T      0  23.6K  4.00K  3.09G
  gptid/db932bdb-1f45-11ea-8162-b42e99a45c0d  9.34G  13.9T      0  4.66K      0   634M
  gptid/dbce3e39-1f45-11ea-8162-b42e99a45c0d  9.35G  13.9T      0  4.86K  4.00K   643M
  gptid/dc029b8b-1f45-11ea-8162-b42e99a45c0d  9.31G  13.9T      0  4.83K      0   632M
  gptid/dc3cc62a-1f45-11ea-8162-b42e99a45c0d  9.29G  13.9T      0  4.49K      0   627M
  gptid/dc7c6dc5-1f45-11ea-8162-b42e99a45c0d  9.34G  13.9T      0  4.73K      0   632M

potzkin · Dec 15, 2019

made 2 vdev and stripe, again no passing 1.2G barrier on those disks...it shuld hit 3G with every disk....

pool 40.1G 27.7T 0 29.4K 0 2.09G
mirror 20.1G 13.9T 0 14.7K 0 1.04G
gptid/6b20ee47-1f49-11ea-8162-b42e99a45c0d - - 0 11.0K 0 1.04G
gptid/6b5ab09b-1f49-11ea-8162-b42e99a45c0d - - 0 10.9K 0 1.04G
mirror 20.0G 13.9T 0 14.8K 0 1.05G
gptid/9dad175d-1f49-11ea-8162-b42e99a45c0d - - 0 11.1K 0 1.05G
gptid/9debf53c-1f49-11ea-8162-b42e99a45c0d - - 0 11.1K 0 1.05G

Herr_Merlin · Dec 16, 2019

That is interesting. That is a build I am looking for to build soonTM as well. And I do need those IOPS..

potzkin · Dec 16, 2019

no matter what i try i cannot pass 3GB write....i really need some help here....

HoneyBadger · Dec 16, 2019

With NVMe you may need to spend some time adjusting the maximum write operations that can be queued against each disk, since the defaults are "rather conservative" shall we say.

Since your writes are async right now (are they going to be in the real world?) start by increasing the value of the sysctl vfs.zfs.vdev.async_write_max_active - the default is 10. Given that you're on NVMe I would start right off the bat by doubling it to 20 and seeing what the results are.

potzkin · Dec 17, 2019

Hi thanks for the input.

this is how sysctl.conf look like:

Code:

kern.metadelay=3
kern.dirdelay=4
kern.filedelay=5
kern.coredump=1
kern.sugid_coredump=1
vfs.timestamp_precision=300
net.link.lagg.lacp.default_strict_mode=0
vfs.zfs.vdev.async_write_max_active=3
vfs.zfs.vdev.sync_write_max_active=30
# Force minimal 4KB ashift for new top level ZFS VDEVs.
vfs.zfs.min_auto_ashift=12
vfs.zfs.vdev.trim_max_active=20

didnt noitice any big different in preformance.

Code:

root@freenas[~]# dd if=/dev/zero of=/mnt/pool/test bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes transferred in 38.431617 secs (2728420211 bytes/sec)
pool                                    27.8G  27.7T      0  26.2K      0  2.81G
  mirror                                13.9G  13.9T      0  13.3K      0  1.40G
    gptid/6f60472f-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.93K      0  1.40G
    gptid/6f9988d8-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.96K      0  1.40G
  mirror                                13.9G  13.9T      0  12.9K      0  1.41G
    gptid/8b8da598-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.81K      0  1.41G
    gptid/8bc4ab30-20bb-11ea-addf-b42e99a45c0d      -      -      0  6.78K      0  1.41G

i didnt see any change if i put
vfs.zfs.vdev.sync_write_max_active=300 or 10...

more advice will be more then appreciate.

potzkin · Dec 17, 2019

xattr=sa did some little improvement but not near what it should hit..

HoneyBadger · Dec 17, 2019

You're changing the wrong attribute, since your pool/dataset is not set as sync=always you want vfs.zfs.vdev.async_write_max_active (not "sync_write_max_active") which for some reason is set to 3 on your system. Push that number up, you can probably keep the sync version at 30 though since you are all-NVMe and those devices have extremely deep queues.

Did you enable autotune? (If you haven't - don't.)

potzkin · Dec 17, 2019

Hi, i try the auto-tune option but after i saw it wont help i deleted all the conf it made.
sysctl.conf:

Code:

kern.metadelay=3
kern.dirdelay=4
kern.filedelay=5
kern.coredump=1
kern.sugid_coredump=1
vfs.timestamp_precision=3
net.link.lagg.lacp.default_strict_mode=0
vfs.zfs.vdev.async_write_max_active=360
# Force minimal 4KB ashift for new top level ZFS VDEVs.
vfs.zfs.min_auto_ashift=12

Code:

this is how it looks now,
after change i do 
service sysctl restart
and
/etc/rc.d/sysctl restart

is that enough? or do i need restart to apply?

i also have "vfs.zfs.vdev.async_write_max_active=360" this value on TUNBLE but it won't change nothing in the sysctl.conf...(even after restart.)

HoneyBadger · Dec 17, 2019

You shouldn't need to set that in the sysctl.conf, just set it from the command line/shell and run the benchmarks again.

potzkin · Dec 17, 2019

im connected via ssh:
when putting the command it won't accept it:
zsh: command not found: vfs.zfs.vdev.async_write_max_active

i cannot find the working command for this, anyway i preform many tests with tunables.
any more leads?

Rand · Dec 17, 2019

you need to use sysctl

Rand · Dec 17, 2019

You are using dd which is a single threaded application - basically you're doing the same as I did here (https://www.ixsystems.com/community/threads/pool-performance-scaling-at-1j-qd1.80417/) - with the same problems: There is no scaling of fast drives with additional vdevs on a single process

if your future mode of operation includes having multiple users or processes you should test with more threads/deeper queue, eg with fio

potzkin · Dec 17, 2019

Rand said:
You are using dd which is a single threaded application - basically you're doing the same as I did here (https://www.ixsystems.com/community/threads/pool-performance-scaling-at-1j-qd1.80417/) - with the same problems: There is no scaling of fast drives with additional vdevs on a single process

if your future mode of operation includes having multiple users or processes you should test with more threads/deeper queue, eg with fio

can you help with the command syntax?

Rand · Dec 17, 2019

Code:

fio  --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=posixaio --name="a"  --runtime=300 --size=100G --time_based  --bs=64k --iodepth=1 --numjobs=1 --rw=write --filename=/mnt/<pool>/<dataset>/out.fio

This runs a 64K blocksize test with 1 thread, queue depth 1, streaming write, duration 300s

Code:

fio  --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=posixaio --name="a"  --size=100G --bs=64k --iodepth=1 --numjobs=1 --rw=write --filename=/mnt/<pool>/<dataset>/out.fio

This is the same with a test that ends when a 100G test file was created

further options:

Code:

#--rw=
    #read       Sequential reads.
    #write        Sequential writes.
    #trim        Sequential trims (Linux block devices and SCSI character devices only).
    #randread        Random reads.
    #randwrite        Random writes.
    #randtrim        Random trims (Linux block devices and SCSI character devices only).
    #rw,readwrite        Sequential mixed reads and writes.
    #randrw        Random mixed reads and writes.
    #trimwrite        Sequential trim+write sequences. Blocks will be trimmed first, then the same blocks will be written to.

To run with more threads change numjobs, to add more QueueDepth (stacked requests of a single thread) change iodepth.
At some point you will become CPU bound when scaling up (usually when you reach threads=# of cores unless they are very fast).

But remember, its of no use whatsoever to reach gigantic numbers in benchmarks if it does not reflect the actual use case that you have. Measure for your need and not to see huge numbers:)

Edit: Fixed double entry, added info re scaling+ comments

Important Announcement for the TrueNAS Community.

Help needed - Low performance on NVME pool

Dabbler

Wizard

Wizard

Dabbler

Dabbler

Dabbler

Dabbler

Patron

Dabbler

actually does care

Dabbler

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Guru

Guru

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Help needed - Low performance on NVME pool"

Similar threads