TrueNAS 13 way slower than 12

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
I feel like I should know better than to make another "why is Truenas slow" post, but really I can't figure this one out.

I often store large (50+) GB files on one dataset and then transfer them to another dataset within the same pool. On TrueNAS 12-U8 I was getting about 1.2GBps out of the box on fresh reads, and about 1.5GBps on reads from ARC. Adding only the tunable vfs.zfs.zfetch.max_distance and setting it to 2147483648 increased speeds to about 1.5GBps on fresh reads, and over 2GBps reading from ARC (where I copy the same file again). I tried lower numbers of that tunable, and performance increased until I hit that number, so I stuck with it. Yes, I know it's way beyond the defaults.

File transfers have been attempted using SMB from Windows 11 where server side copy takes effect, and by using a simple cp command via ssh. I can download via 10gbps network at 1GBps (gigabytes, no bits) easily, so definitely no network issues. In fact, I can easily transfer 10 50GB files at the same time, and it still saturates the 10gb link without the disks breaking a sweat. So to me, this is not a "why is my 10gb network/smb slow" question. There is no other activity when I'm doing this. Disks and CPU are idle until the transfer is attempted.

On TrueNAS 13-U1.1 I hit about 600MBps on file transfers between datasets with fresh data. Re-transferring after it gets stored in ARC brings speeds to about 1GBps. Tunable makes no difference on or off and my performance is pretty much cut in half now.

Now before you tell me that 600MBps is great and I should be happy with that performance, especially since I'm reading and writing to the same pool at the same time, let me explain why I think it should be better.

Here are my dd speed tests writing and then reading back a 500GB file:
root@nas:~/Files/Downloads # dd if=/dev/zero of=test.dat bs=1m count=512000
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 94.790463 secs (5663765067 bytes/sec)
root@nas:~/Files/Downloads # dd of=/dev/null if=test.dat bs=1m count=512000
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 133.549504 secs (4020014270 bytes/sec)

Yes, that is most definitely on a dataset with compression turned off, atime off, and all datasets involved are using 1MB block sizes.

Scrubs typically run at around 8GBps. I scrub through about 20TB of data in an hour or so, probably limited by too slow of a CPU.

The pool consists of 16 SATA SSDS connected to an 9400-16i, and another 16 SATA SSDs connected to another 9400-16i for a total of 32 SATA SSDs. Each 9400-16i is connected to it's own PCIe 8x slot, and the slots do not split bandwidth on the motherboard. It is made up of 4 RAIDZ2 vdevs. I have 256GB of memory installed, CPU is a W-2125.

This is disk usage during a fresh transfer:

mRemoteNG_Dz61IEURoD.png

Here is the same transfer again when it's already in ARC:
mRemoteNG_6WliuHTS81.png

CPU usage:
1659847536406.png


This system has always performed way slower than I thought it should, but I've just lived with it. Now with my performance cut in half just from an update, it's getting silly.

Reverting back to 12 U8 brings performance immediately back. Just to be sure, I reverted back, then re-upgraded to 13. Same exact thing.

Please let me know what I'm missing here.
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Just to add another data point/test. I have another pool on the same system consisting of 4 SATA SSDs connected to the MB SATA ports, in mirror vdevs. If I copy a file from the larger pool to a test dataset on the mirror pool, and then turn around and copy that same file to a different dataset on the Main pool from where it started (and obviously being read from ARC at that point) performance problems show up with every leg of the transfer. This transfer being done using cp via ssh.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Quite a few benchmarks evaluated 13 as faster than 12, so it's very likely a local issue.
It is also usually suggested to use fio instead of dd for performance testing.
Anyway, have fun troubleshooting.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Quite a few benchmarks evaluated 13 as faster than 12, so it's very likely a local issue.
It is also usually suggested to use fio instead of dd for performance testing.
Anyway, have fun troubleshooting.
I'd happily run some testing with fio, but having trouble finding the right options for doing a sequential read/write test. I don't think the results I'm getting which show 20GBps are going to be useful. Maybe you could share something specific to clue me in?
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Ok, this looks correct to me. I watched the data write, so it's not a result of compression:
Code:
 fio --randrepeat=1 --ioengine=posixaio --direct=1 --name=test --filename=test --bs=4M --size=100G --readwrite=write --ramp_time=4
test: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=posixaio, iodepth=1
fio-3.28
Starting 1 process
test: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [W(1)][84.0%][w=4812MiB/s][w=1202 IOPS][eta 00m:04s]
test: (groupid=0, jobs=1): err= 0: pid=50162: Sat Aug  6 23:23:39 2022
  write: IOPS=1240, BW=4961MiB/s (5202MB/s)(80.4GiB/16591msec); 0 zone resets
    slat (usec): min=35, max=3303, avg=196.47, stdev=265.84
    clat (nsec): min=1980, max=8641.5k, avg=606325.60, stdev=548661.85
     lat (usec): min=347, max=9105, avg=802.80, stdev=686.76
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[  310], 10.00th=[  314], 20.00th=[  318],
     | 30.00th=[  322], 40.00th=[  326], 50.00th=[  330], 60.00th=[  338],
     | 70.00th=[  367], 80.00th=[  996], 90.00th=[ 1598], 95.00th=[ 1844],
     | 99.00th=[ 2212], 99.50th=[ 2409], 99.90th=[ 3032], 99.95th=[ 3720],
     | 99.99th=[ 5145]
   bw (  MiB/s): min= 4008, max= 5768, per=99.99%, avg=4960.77, stdev=274.08, samples=33
   iops        : min= 1002, max= 1442, avg=1239.82, stdev=68.50, samples=33
  lat (usec)   : 2=0.01%, 4=0.45%, 10=1.59%, 20=0.09%, 50=0.02%
  lat (usec)   : 100=0.05%, 250=0.19%, 500=71.58%, 750=1.72%, 1000=4.35%
  lat (msec)   : 2=17.09%, 4=2.80%, 10=0.04%
  cpu          : usr=17.45%, sys=0.80%, ctx=55536, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,20578,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=4961MiB/s (5202MB/s), 4961MiB/s-4961MiB/s (5202MB/s-5202MB/s), io=80.4GiB (86.3GB), run=16591-16591msec
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I'd happily run some testing with fio, but having trouble finding the right options for doing a sequential read/write test. I don't think the results I'm getting which show 20GBps are going to be useful. Maybe you could share something specific to clue me in?
I don't actually know how to use it myself, but that seems to be the consensus here.
You probably want the iodepth value to be greater than 1, but besides that...
Sorry for not being able to help.

Edit: take a look at this post (#13), it might be useful.
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Thanks for linking that post. I will try to learn more about using fio when I get a chance, but copying that last test posted by sretalla I get: WRITE: bw=3798MiB/s (3982MB/s), 3798MiB/s-3798MiB/s (3982MB/s-3982MB/s), io=1500GiB (1611GB), run=404472-404472msec

using: fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randwrite --size=500g --io_size=1500g --blocksize=10m --iodepth=1 --direct=1 --numjobs=1 --runtime=3600 --group_reporting

Hopefully that's enough to prove the pool itself is fast enough to get the performance I want and the bottleneck is someplace else.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Anyone have any ideas on this? Did something in particular change in TrueNAS 13 which might affect how large files are read or written? I know there are some speed optimizations that took place, but it seems they are oriented towards IOPS and iSCSI which is not what I'm looking at here.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
dd is notoriously slow since it reads then writes then read then writes.... its very single threaded and latency sensitive. the actual bandwidth is choked by the lack of parallelism. We don't use dd to test a NAS performance.... You've tuned your way around it by using ARC.

So, I'd guess that something has changed in the latency.. could be in ARC or SMB. See if you can find a change there.
If you could do a local test and compare versions, that might eliminate ARC as the culprit. If there is a negative difference, then ARC operation may have changed. That would be worthwhile reporting.

If you use fio with larger queue depth, you'd see the array has much higher bandwidth.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
dd is notoriously slow since it reads then writes then read then writes.... its very single threaded and latency sensitive. the actual bandwidth is choked by the lack of parallelism. We don't use dd to test a NAS performance.... You've tuned your way around it by using ARC.

So, I'd guess that something has changed in the latency.. could be in ARC or SMB. See if you can find a change there.
If you could do a local test and compare versions, that might eliminate ARC as the culprit. If there is a negative difference, then ARC operation may have changed. That would be worthwhile reporting.

If you use fio with larger queue depth, you'd see the array has much higher bandwidth.
Thanks for your response. Using fio, I'm seeing bandwidth over 5GBps and with dd over 5GBps as well. They may be slow or unoptimized, but 5GBps is fast enough for me to not complain about it. If I'm interpreting those things correctly, my raw disk performance is fine. It's just doing a file copy that lags behind significantly at less than 20% of the disks throughput.

While the initial issue was in doing copies/moves via SMB, I did quickly start testing locally to rule out SMB or network issues as the cause. I'm not sure how to isolate down the issue much further in terms of the latency for ARC. Are there specific tests in addition to what I posted above you can suggest?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
If you got good performance locally doing file copies (same as 12.0), but slower performance doing SMB, then it useful to know and document.

Our performance testing team is now looking for these issues.. the initial focus on any new Release is on stability and functionality.

I'd also recommend looking at parallel file copying software.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Issue is that local file copies are slow - slower than via SMB. SMB performance is great. That's why I'm at a loss. I can copy a file more quickly via SMB than I can locally, even if that data is being re-read straight from ARC to disk or from another local pool made of SSDs.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
SMB has Server side copy... are you using that?

 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Yes, it is for sure doing server side copy. I have no local network activity during copy operations that run about 600MBps when doing the copies on smb shares. I'm also doing testing locally on the box via ssh by running cp \dataset1\file \dataset2\file2 to make sure no network bottlenecks are involved.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Server-side copy is much faster.. moving pointers rather than real data. SMB server side copy will be faster than actual copies.

Most of the slowness in copying is due to the copy software being non-parallel. If you can see a major difference in read or write performance, it is much easier to diagnose and discuss.
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Server-side copy is much faster.. moving pointers rather than real data.
Yes, I would think it would be faster. However in this case, it is not. Upgrading to TrueNAS 13 cut it's performance in half. Also, I am moving data between datasets, so it should be reading and writing the full contents of the file and not just moving pointers.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
It seems like I am not adequately communicating the nature of the issue. I am getting slow speeds when copying files locally, using a cp command, between datasets. While this problem initially showed up while trying to do copies between datasets using SMB/server side copy, I quickly transitioned to local testing to rule out anything related to SMB and the network. I've tried to include all the relevant details I can think of above.

Also to be clear, the only change made was upgrading to TrueNAS 13. If my expectations are unrealistic, they are only unrealistic for TrueNAS 13. Switching back to 12 immediately doubles my performance.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I wonder if this is due to a faulty build of cp in upstream FreeBSD 13 vs 12. This thread talks about implementation details of cp in FreeBSD between 12 and 13. Just to rule this out, try copying /bin/cp from a 12 system to your 13 system (saving the 13 /bin/cp to /bin/cp.bak beforehand for easy restore) to see if this is indeed the case.

Alternatively, can you see if you see the same performance discrepancy using rsync to copy the 50+ GB files between datasets?
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
I wonder if this is due to a faulty build of cp in upstream FreeBSD 13 vs 12. This thread talks about implementation details of cp in FreeBSD between 12 and 13. Just to rule this out, try copying /bin/cp from a 12 system to your 13 system (saving the 13 /bin/cp to /bin/cp.bak beforehand for easy restore) to see if this is indeed the case.

Alternatively, can you see if you see the same performance discrepancy using rsync to copy the 50+ GB files between datasets?
Thank you for your suggestions!! We've definitely found something. rsync was useless as it was very slow at 300MBps. However, using the version of cp from TrueNAS 12 in TrueNAS 13 resulted in a 3X increase in transfer speeds from 1GBps to about 3GBps on the pool!! It also doubled CPU usage (which to me is a good thing here.)

This did not improve the end use case of moving files via SMB server side copy. I assume that when I do a copy over SMB, it's not actually calling the cp command, but doing something else. Perhaps whatever that something else is also received some kind of update with the same tweak that messed up transfer speeds?

I looked at the freebsd thread you mentioned and frankly am not educated enough to understand the implications, or even if due to the date being discussed from over a year ago, if the "improvements" made there are what ended up in the current version of cp in TrueNAS 13.

In any case, I'm glad to have found something definitive, even though it's not a fix. Where to go from here?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Where to go from here?

As I understand it, cp was tuned in 13 to have lower CPU, with the result of the lower copy performance you see. You should submit a bug or feature request to upstream FreeBSD, so that the end user has the choice of tuning cp for either lower CPU or greater performance.

As for the server side copy, I'm not sure it was tested against copies between datasets, so this may be an OpenZFS 2.1 (in 13) regression vs. 2.0 (in 12). Try asking around on the OpenZFS developer channels:

 
Top