ZFS arraw performance terrible when using RaidZ

Joined
Sep 6, 2022
Messages
5
I'm really confused here. I'm running an old HX5510 server (Lenovo Server optimized for Software Definde Storage) with an optimal SAS3008 controller in IT mode and enterprise drives (12GPS SAS 7200RPM 6TB). All drivers are recognized nicly and if I run with ARC enabled i reach obviously incredible performance.

But the moment I have a sync=always to check the raid performance, I reach max 5!!! MB/s on a 4+2 Raidz. I tried to do two ldev with 2+1 instead but with only marginal improvement.

These drives to 200MB/s on sequential workload, but once I use raidz its down to almost nothing. going with mirror is significantly better, but obviosly not an option considering the loss of capacity.

I did test unraid on the same hardware, and achieve 500MB/s of array performance with XFS, so something is really odd.

I read in another forum post of somebody testing truenas core and solving some of the issue, but this is not an option as I'm looking to run containers and vms too...

Any idieas to get zfs to performan in scale on my hardware config?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Any idieas to get zfs to performan in scale on my hardware config?


TL/DR: Use Mirrors, not RAIDZ for Sync write performance and use a fast SLOG.

Alternative: (since you seem unwilling/unable to change your hardware) sync=disable and live with the risk. (UPS and frequent snapshots/replication to mitigate that)
 
Last edited:
Joined
Sep 6, 2022
Messages
5


TL/DR: Use Mirrors, not RAIDZ for Sync write performance and use a fast SLOG.

Alternative: (since you seem unwilling/unable to change your hardware) sync=disable and live with the risk. (UPS and frequent snapshots/replication to mitigate that)
Thanks for the feedback.

I checked the path to success and would have a question there. its stating that performance my be as bad as signle drive, but that would be actually okey (e.g. ballpark of 200mb per sec) but I'm at 5mb/s which is where I just can't believe this is all zfs will do.

Its not that I not willing to change my setup, but what would be then better? because reading the recommendation I should be quite ok.

Doing the sync=disable is for sure an option, but with an array performance of 5MB a second it will just never written down, and no UPS will hold this server up for hours to write it all down...

Best regards
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Thanks for the feedback.

I checked the path to success and would have a question there. its stating that performance my be as bad as signle drive, but that would be actually okey (e.g. ballpark of 200mb per sec) but I'm at 5mb/s which is where I just can't believe this is all zfs will do.

Its not that I not willing to change my setup, but what would be then better? because reading the recommendation I should be quite ok.

Doing the sync=disable is for sure an option, but with an array performance of 5MB a second it will just never written down, and no UPS will hold this server up for hours to write it all down...

Best regards
5MB/s will be a function of the test workload...

What software is testing
What access protocol
I/O size vs record size
Queue depth of test software.

You need to specify your tests well before anyone can see whether this is expected or not.

Are you IOPS or bandwidth oriented for your use-case.. what reliability is required? Important database or photos?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112

My knowledge on these is a little rusty, but those were HCI appliances with 2x SSD and 6x HDD up front. They should be a good match for both TrueNAS CORE or SCALE.

Are you using the embedded 64GB SATADOM as a boot device here, or was that removed?

Regarding the sync question - a significant (sometimes phrased as "massive") performance reduction is expected when running small records on spinning disks, even more so when using RAIDZ. This is less of a "ZFS limitation" and more of a "laws of physics" limitation at smaller records.

If your intent is to run VMs and containers on the TN SCALE machine only, then sync is less critical - you're significantly less likely to have a situation where your storage crashes and your apps don't, because the "compute" and "storage" layers are in the same physical machine.

Can you post some details on the server (you mentioned a 4+2 RAIDZ2 - what's in the other two bays, the SSDs?) and the benchmark you are using to get the aforementioned 5MB/s number?
 
Joined
Sep 6, 2022
Messages
5
5MB/s will be a function of the test workload...

What software is testing
What access protocol
I/O size vs record size
Queue depth of test software.

You need to specify your tests well before anyone can see whether this is expected or not.

Are you IOPS or bandwidth oriented for your use-case.. what reliability is required? Important database or photos?
I did quite a few tests but here is the simples one:

Context: I have another all flash pool with two nvme drives in mirror, so all iops oriented workloads should be placed there anyhow. the raid pool is targeted for high volume data (primary home and group drives). Access directly by NAS and NextCloud, photoprism or simlar tools

I did test with fio, DD, network copy from my other nas ... here an example output of DD direcly running on the truenas box to just look on the filesystems behave, before any SMB etc:

I created 4 different datasets to evaluate the impact of sync and compression:

drwx-w---- 2 1000 root 3 Sep 7 01:34 all_off
drwxrwxr-x 2 root root 3 Sep 6 08:27 default
drwxrwxr-x 2 root root 4 Sep 6 08:27 no_compress
drwxrwxr-x 2 root root 3 Sep 6 07:59 no_sync

then I did run a DD with bandwith focus with different setup on these 4 datasets:

Skip caching ot test array performance
dd if=/dev/random of=/mnt/raid/all_off/testfile bs=1024000 count=500 oflag=dsync
dd if=/dev/random of=/mnt/raid/default/testfile bs=1024000 count=500 oflag=dsync
dd if=/dev/random of=/mnt/raid/no_compress/testfile bs=1024000 count=500 oflag=dsync
dd if=/dev/random of=/mnt/raid/no_sync/testfile bs=1024000 count=500 oflag=dsync


With caching
dd if=/dev/random of=/mnt/raid/all_off/testfile bs=1024000 count=500
dd if=/dev/random of=/mnt/raid/default/testfile bs=1024000 count=500
dd if=/dev/random of=/mnt/raid/no_compress/testfile bs=1024000 count=500
dd if=/dev/random of=/mnt/raid/no_sync/testfile bs=1024000 count=500

With zeros to eliminate CPU bottleneck
dd if=/dev/zero of=/mnt/raid/all_off/testfile bs=1024000 count=500
dd if=/dev/zero of=/mnt/raid/default/testfile bs=1024000 count=500
dd if=/dev/zero of=/mnt/raid/no_compress/testfile bs=1024000 count=500
dd if=/dev/zero of=/mnt/raid/no_sync/testfile bs=1024000 count=500


this results in the following:

500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.84614 s, 277 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 18.8364 s, 27.2 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 18.2582 s, 28.0 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 18.1279 s, 28.2 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.86237 s, 275 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.87071 s, 274 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.85611 s, 276 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.91697 s, 267 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.289289 s, 1.8 GB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.343046 s, 1.5 GB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.354501 s, 1.4 GB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.355857 s, 1.4 GB/s

Especially the marked in red once are surprising for me... (ps its now 28mb since I added two more drives and now have 2x a vdev of 3+1 raidz vs the 5+1 vdev where I only got 5MB/s in this tests.

If I now add nother 256GB slog nvme drive, I can solve the issue, but still I'm not able to really understand why zfs would end of with 27MB in a test like that:

500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.82277 s, 281 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.77786 s, 184 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.37096 s, 216 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.99992 s, 171 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.0491 s, 250 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.89032 s, 271 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.03548 s, 252 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.85154 s, 277 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.33542 s, 1.5 GB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.363928 s, 1.4 GB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.357674 s, 1.4 GB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.331725 s, 1.5 GB/s


I hope this gives some insight :)
 
Joined
Sep 6, 2022
Messages
5
My knowledge on these is a little rusty, but those were HCI appliances with 2x SSD and 6x HDD up front. They should be a good match for both TrueNAS CORE or SCALE.
Correct, but I change to pci cards with nvme for the flash, as the ssd were done after 5 years, so I have 8disk in the front options for HDD
Are you using the embedded 64GB SATADOM as a boot device here, or was that removed?
Correct, this is what I have installed on
Regarding the sync question - a significant (sometimes phrased as "massive") performance reduction is expected when running small records on spinning disks, even more so when using RAIDZ. This is less of a "ZFS limitation" and more of a "laws of physics" limitation at smaller records.
Thats clear, I was testing with 1MB chucks to go more for a big file nas use case.
If your intent is to run VMs and containers on the TN SCALE machine only, then sync is less critical - you're significantly less likely to have a situation where your storage crashes and your apps don't, because the "compute" and "storage" layers are in the same physical machine.
Thats clear, I have a seperate NVME pool for this purpose, I'm not live with this , I'm trying out the system and learning, but was just really surpised by what I got of 6x and now 8x hdd . I'm not expecting here an IOPS monster, but if my 8 year old synology outperforms this server even though I have now enterprise drives, im a little confused.
Can you post some details on the server (you mentioned a 4+2 RAIDZ2 - what's in the other two bays, the SSDs?) and the benchmark you are using to get the aforementioned 5MB/s number?
Check my other response, I have listed one of the test in detail:), thanks for your time
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm still a little puzzled about the purpose of the test, I suppose.

But the moment I have a sync=always to check the raid performance,

Do you plan to hit the disks with a heavy sync-write workload in production? If so, try running something like the below command:

fio --name=testrun --ioengine=posixaio --bs=1M --numjobs=1 --size=10G --iodepth=1 --runtime=60 --time_based --fsync=1 --rw=write

Breaking it down, that's writing 1MB blocks, from a single job, with a sync after each block - if you run this on a dataset with sync=disabled then those syncs will be ignored, but otherwise they'll be honored by ZFS.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I created 4 different datasets to evaluate the impact of sync and compression:

drwx-w---- 2 1000 root 3 Sep 7 01:34 all_off
drwxrwxr-x 2 root root 3 Sep 6 08:27 default
drwxrwxr-x 2 root root 4 Sep 6 08:27 no_compress
drwxrwxr-x 2 root root 3 Sep 6 07:59 no_sync

then I did run a DD with bandwith focus with different setup on these 4 datasets:

Skip caching ot test array performance
dd if=/dev/random of=/mnt/raid/all_off/testfile bs=1024000 count=500 oflag=dsync
dd if=/dev/random of=/mnt/raid/default/testfile bs=1024000 count=500 oflag=dsync
dd if=/dev/random of=/mnt/raid/no_compress/testfile bs=1024000 count=500 oflag=dsync
dd if=/dev/random of=/mnt/raid/no_sync/testfile bs=1024000 count=500 oflag=dsync





this results in the following:

500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.84614 s, 277 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 18.8364 s, 27.2 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 18.2582 s, 28.0 MB/s
500+0 records in
500+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 18.1279 s, 28.2 MB/s
500+0 records in
500+0 records out



I hope this gives some insight :)
dd with oflag=dsync is always slow. It requires each write to be committed individually. If you disable the ability to acknowledge a WRITE from RAM or from SLOG, its very slow. Its like having a conversation with an astronaut....

https://superuser.com/questions/943952/why-am-i-getting-very-slow-dd-dsync-test-results-in-linux
 
Joined
Sep 6, 2022
Messages
5
I'm still a little puzzled about the purpose of the test, I suppose.
Purpose is to understand the array performance. because this is finally the potential limiter... Main reason why I noticed is that by default an SMB share is configed to force sync and I could not even saturate a 1gbit nic with a big file transfer. Of couse I could tune the share setting and disable sync, but yeah, that setting should be there for a reason.
Do you plan to hit the disks with a heavy sync-write workload in production? If so, try running something like the below command:

fio --name=testrun --ioengine=posixaio --bs=1M --numjobs=1 --size=10G --iodepth=1 --runtime=60 --time_based --fsync=1 --rw=write

Breaking it down, that's writing 1MB blocks, from a single job, with a sync after each block - if you run this on a dataset with sync=disabled then those syncs will be ignored, but otherwise they'll be honored by ZFS.
this results in 40MB/second worst case

result on default:

Code:
root@truenas[~]# cd /mnt/raid/default
root@truenas[/mnt/raid/default]# fio --name=testrun --ioengine=posixaio --bs=1M --numjobs=1 --size=10G --iodepth=1 --runtime=60 --time_based --fsync=1 --rw=write
testrun: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=1
fio-3.25
Starting 1 process
testrun: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=43.0MiB/s][w=43 IOPS][eta 00m:00s]
testrun: (groupid=0, jobs=1): err= 0: pid=66116: Thu Sep  8 01:22:01 2022
  write: IOPS=40, BW=40.0MiB/s (41.0MB/s)(2401MiB/60003msec); 0 zone resets
    slat (usec): min=25, max=403, avg=58.06, stdev=11.66
    clat (usec): min=320, max=911, avg=500.34, stdev=132.77
     lat (usec): min=374, max=972, avg=558.40, stdev=133.55
    clat percentiles (usec):
     |  1.00th=[  355],  5.00th=[  388], 10.00th=[  420], 20.00th=[  429],
     | 30.00th=[  433], 40.00th=[  437], 50.00th=[  441], 60.00th=[  441],
     | 70.00th=[  453], 80.00th=[  660], 90.00th=[  766], 95.00th=[  783],
     | 99.00th=[  807], 99.50th=[  824], 99.90th=[  865], 99.95th=[  865],
     | 99.99th=[  914]
   bw (  KiB/s): min=14336, max=53248, per=100.00%, avg=41011.63, stdev=7111.81, samples=119
   iops        : min=   14, max=   52, avg=40.05, stdev= 6.95, samples=119
  lat (usec)   : 500=78.01%, 750=8.70%, 1000=13.29%
  fsync/fdatasync/sync_file_range:
    sync (msec): min=5, max=118, avg=24.41, stdev=12.32
    sync percentiles (msec):
     |  1.00th=[   11],  5.00th=[   14], 10.00th=[   16], 20.00th=[   18],
     | 30.00th=[   19], 40.00th=[   20], 50.00th=[   21], 60.00th=[   23],
     | 70.00th=[   26], 80.00th=[   31], 90.00th=[   36], 95.00th=[   43],
     | 99.00th=[   90], 99.50th=[   95], 99.90th=[  105], 99.95th=[  118],
     | 99.99th=[  120]
  cpu          : usr=0.40%, sys=0.27%, ctx=4804, majf=6, minf=46
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2401,0,2401 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=40.0MiB/s (41.0MB/s), 40.0MiB/s-40.0MiB/s (41.0MB/s-41.0MB/s), io=2401MiB (2518MB), run=60003-60003msec



result on disabled all :

Code:
root@truenas[/mnt/raid/all_off]# fio --name=testrun --ioengine=posixaio --bs=1M --numjobs=1 --size=10G --iodepth=1 --runtime=60 --time_based --fsync=1 --rw=write
testrun: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=1
fio-3.25
Starting 1 process
testrun: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=513MiB/s][w=513 IOPS][eta 00m:00s]
testrun: (groupid=0, jobs=1): err= 0: pid=74092: Thu Sep  8 01:25:27 2022
  write: IOPS=463, BW=464MiB/s (486MB/s)(27.2GiB/60002msec); 0 zone resets
    slat (usec): min=18, max=313, avg=51.65, stdev=11.81
    clat (usec): min=312, max=16047, avg=2066.07, stdev=672.52
     lat (usec): min=358, max=16103, avg=2117.71, stdev=673.32
    clat percentiles (usec):
     |  1.00th=[  388],  5.00th=[  660], 10.00th=[  766], 20.00th=[ 1778],
     | 30.00th=[ 1860], 40.00th=[ 1926], 50.00th=[ 1991], 60.00th=[ 2089],
     | 70.00th=[ 2278], 80.00th=[ 2769], 90.00th=[ 2966], 95.00th=[ 3032],
     | 99.00th=[ 3195], 99.50th=[ 3294], 99.90th=[ 3720], 99.95th=[ 4178],
     | 99.99th=[10421]
   bw (  KiB/s): min=301056, max=1445888, per=99.94%, avg=474336.92, stdev=198416.44, samples=119
   iops        : min=  294, max= 1412, avg=463.22, stdev=193.77, samples=119
  lat (usec)   : 500=1.82%, 750=8.04%, 1000=0.55%
  lat (msec)   : 2=40.44%, 4=49.08%, 10=0.04%, 20=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=843, max=893402, avg=20543.36, stdev=13375.54
    sync percentiles (usec):
     |  1.00th=[   16],  5.00th=[   17], 10.00th=[   17], 20.00th=[   18],
     | 30.00th=[   18], 40.00th=[   19], 50.00th=[   19], 60.00th=[   20],
     | 70.00th=[   20], 80.00th=[   21], 90.00th=[   24], 95.00th=[   29],
     | 99.00th=[   58], 99.50th=[   68], 99.90th=[  215], 99.95th=[  306],
     | 99.99th=[  515]
  cpu          : usr=3.84%, sys=1.31%, ctx=55671, majf=0, minf=49
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,27811,0,27810 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=464MiB/s (486MB/s), 464MiB/s-464MiB/s (486MB/s-486MB/s), io=27.2GiB (29.2GB), run=60002-60002msec
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Purpose is to understand the array performance. because this is finally the potential limiter.

It's a limit if you actually require the safety of synchronous writes. Most general "file server" workloads don't - sync writes are necessary when you are "editing directly on the storage" so to speak, databases or virtual machine images being the most common. If the workflow is "open file, edit file, save file" then the risk is significantly reduced because a network interruption can be resolved by "press the save button again" or "save it locally first, then stream to network in the background" in the workflow.

Main reason why I noticed is that by default an SMB share is configed to force sync

SMB is set to use strict sync=yes by default now, but that's different from a dataset itself being set as synchronous. If a client requests synchronous writes (MacOS) then it will see this - otherwise (Windows) it won't.

Can you post the output of zfs get sync? That should show the output for all pools/datasets in your system. I'd be very surprised to see sync=always be a default value.

(fio line) results in 40MB/second worst case

That seems unusually slow. I ran the same command against a simple 2-disk mirror (sync=standard, compression=lz4) and got 36MiB/s. Even with fsync=1 sequential writes should be scaling better against a RAIDZ setup.

Run status group 0 (all jobs): WRITE: bw=36.2MiB/s (37.9MB/s), 36.2MiB/s-36.2MiB/s (37.9MB/s-37.9MB/s), io=2171MiB (2276MB), run=60053-60053msec
 
Top