Slow SMB - Lost - Need direction

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
However, I would have thought the read performance would have been higher than the write performance?

And you thought this, .. why?

Let me rearrange your thinking about these processes a bit. Keep an open mind and forget some of your preconceived notions.

When you are writing, your traffic comes in over the network, and is immediately placed in the next ZFS transaction group (TXG) to be written. This is effectively a write cache, and it is in system memory, so this fills as fast as those poor little packets can be pulled off the network and dumped in main memory. When the TXG fills, this one moves to a different state, where it flushes out to disk, and a new one opens for current write traffic. This does not imply that you can write forever at unlimited speed; you can only have the current and flushing transaction groups. But your write speeds are effectively limited to the lower of your network speed or your pool speed, with the TXG's acting as a buffer or cache mechanism in between. So as long as you can push traffic at the NAS at high speed, and the pool can write it at high speed, you get really high write speeds.

On the other hand, let's consider a read. The NAS has no idea that you're about to open /mnt/pool/your/data/file/123, so it has to wait for that request to come in over the network. It hopefully has the metadata for that directory in ARC, but has to do a seek to get the first blocks of data. That request goes down to the HDD, takes some milliseconds to seek, reads the data into main memory, then shovels it out the network at you. It may do a limited amount of speculative prefetching to optimize the next read request, but if your file is 1GB or 1TB, it isn't going to read all of that. It has no idea whether or not you're going to ask for it. So maybe it reads 1MB of your file into cache, and the next few read requests are fulfilled from ARC. Now another read request comes in, and the ARC doesn't have the data, so again, the system has to go out to the pool, and pull the data into main memory, before returning it to you.

The write process is effectively a fast-as-it-can-go firehose, while the read process is more of a lock-step process because the NAS isn't prescient.
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
And you thought this, .. why?

Let me rearrange your thinking about these processes a bit. Keep an open mind and forget some of your preconceived notions.

When you are writing, your traffic comes in over the network, and is immediately placed in the next ZFS transaction group (TXG) to be written. This is effectively a write cache, and it is in system memory, so this fills as fast as those poor little packets can be pulled off the network and dumped in main memory. When the TXG fills, this one moves to a different state, where it flushes out to disk, and a new one opens for current write traffic. This does not imply that you can write forever at unlimited speed; you can only have the current and flushing transaction groups. But your write speeds are effectively limited to the lower of your network speed or your pool speed, with the TXG's acting as a buffer or cache mechanism in between. So as long as you can push traffic at the NAS at high speed, and the pool can write it at high speed, you get really high write speeds.

On the other hand, let's consider a read. The NAS has no idea that you're about to open /mnt/pool/your/data/file/123, so it has to wait for that request to come in over the network. It hopefully has the metadata for that directory in ARC, but has to do a seek to get the first blocks of data. That request goes down to the HDD, takes some milliseconds to seek, reads the data into main memory, then shovels it out the network at you. It may do a limited amount of speculative prefetching to optimize the next read request, but if your file is 1GB or 1TB, it isn't going to read all of that. It has no idea whether or not you're going to ask for it. So maybe it reads 1MB of your file into cache, and the next few read requests are fulfilled from ARC. Now another read request comes in, and the ARC doesn't have the data, so again, the system has to go out to the pool, and pull the data into main memory, before returning it to you.

The write process is effectively a fast-as-it-can-go firehose, while the read process is more of a lock-step process because the NAS isn't prescient.
@jgreco
The reason I thought this was from the doc located here:
"Streaming read speeds and read IOPS on a mirrored vdev will be faster than write speeds and IOPS. When reading from a mirrored vdev, the drives can “divide and conquer” the operations, similar to what we saw above in the striped pool. This is because each drive in the mirror has an identical copy of the data. For write operations, all of the drives need to write a copy of the data, so the mirrored vdev will be limited to the streaming write speed and IOPS of a single disk."

Thanks for the explanation - that helps. I had the idea that on read it would attempt to read the data to memory to pre-load/fetch the data (or somehting along those lines) -such that the memory would feed the network at higher speeds than the disks could. A poor preconceived idea, but not being a nas/disk expert it was my best guess ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
"Streaming read speeds and read IOPS on a mirrored vdev will be faster than write speeds and IOPS. When reading from a mirrored vdev, the drives can “divide and conquer” the operations, similar to what we saw above in the striped pool. This is because each drive in the mirror has an identical copy of the data. For write operations, all of the drives need to write a copy of the data, so the mirrored vdev will be limited to the streaming write speed and IOPS of a single disk."

This is sort-of-true (maybe 80%) and sort-of-BS. If you can convince ZFS to do readahead on the mirrored vdev, yes, you do have the opportunity to read from both halves of the mirror. That's true. However, there isn't intelligent or opportunistic read-ahead (where the filesystem is actively analyzing that you are reading a particular file) so you will tend to end up in this lock-step pattern. Additionally, you don't normally get benefits from reading the other side of the mirror unless you're on 10G or faster; a typical HDD is about 2-3Gbps, so a single side of the mirror is sufficient to fill a 1Gbps pipe. However, there is increased parallelism available with the mirror configuration; each vdev can be serving two different requests simultaneously. This is substantially better than RAIDZ which is optimized towards large file/single access. As with everything ZFS, none of this really comes out quite as simple as you'd expect, because other factors such as copy-on-write also throw a wrench into the works.

Thanks for the explanation - that helps. I had the idea that on read it would attempt to read the data to memory to pre-load/fetch the data (or somehting along those lines) -such that the memory would feed the network at higher speeds than the disks could. A poor preconceived idea, but not being a nas/disk expert it was my best guess ;)

Unfortunately you are always going to be limited by the lock-step nature of reads; reads are USUALLY slower than writes because of this. With HDD's, a lot of this ends up being how much misfortune you suffer in the form of seeks, and this is hard to accurately predict because of the CoW nature of ZFS.
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
@jgreco - I gotcha and thanks for taking the time to explain that. To that end, based on my previous posts, is there any room to improve these speeds - any tweaking, tuning etc or am I pretty much tapped out at this point? Don't get me wrong - I'm already 4x to 5x faster than my previous QNAP/1gb connection setup - but I want to make sure I'm not leaving anything on the table if possible.
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
@jgreco @Davvo - jic - here is a screenshot of a fio test as well. I'm not 100% how good/bad or indifferent it looks ;)

1674860478584.png
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Try this.
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
I use 256gb of ram in my server with 12 disks raidz2 and the system is using 128gb for arc by default. If you want performance, ram is a must.
@Daisuke - wow. I just got 64gb - and that's the max for my motherboard.... Would you expect to see better performance (aka - closer to the 10gbe mark) - I'm pretty much keeping steady at 5gb max with random starts of of 10gbe when the copy first begins.
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
Try this.
@Davvo
I can't find the v3 version for truenas scale in that link any longer - any idea of where to find it?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
@Davvo @jgreco - should I work on a rebalance? View attachment 62996
Would be beneficial.

@jgreco - I don't see 3 or 2 out there - screenshot from filezilla
View attachment 63002
I too didn't manage to get it from filezilla, use the shell and run wget ftp://ftp.sol.net/incoming/solnet-array-test-v3.sh.
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
this worked :) from the shell in truenas - sudo curl ftp://ftp.sol.net/incoming/solnet-array-test-v3.sh --output solnet-array-test-v3.sh
 

vexter0944

Dabbler
Joined
Jan 4, 2023
Messages
34
I ran this - but it took so long the shell timed out and I lost the results. Is there a way to output to a file?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Top