what performance to expect with raidz2 on 12 x 10TB Ironwolf Pro Nas ?

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
specs :
RAIDZ2
12 x Seagate IronWolf Pro 10TB nas - in one big pool
HP DL380e G8i dual 8 core xeon E5-2450 14 bay LFF
HP H220 9205-8i 2-Port SAS 6GB PCI-E 3.0 x8 HBA Host Bus Adapter Card
Solarflare SFC9100 Flareon 7000 Series dual 10Gb nic
TrueNAS-12.0-U5
Hi I'm building / testing a new storage server and experiencing disappointing read performance - e.g. 300MB/sec.
I believe the drives should deliver about 180MB/sec reads. With 12 drives I would expect some multiples of that when reading chunky 10GB files but I am only getting about 300MB/second. This is tested by cat'ing the file on the server itself - not over the network.

I found this which does highlight potential slow read on raidz2 but 300mb/sec is still disappointing

My problem is I am not sure what to expect - what read performance should I expect with a 12 drive raidz2 pool ?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The numbers don't work like you think because the way ZFS stores blocks means that each block requested doesn't necessarily sit across all of the disks in a VDEV/Pool.

The article you quote mentions an 8-wide VDEV/Pool.

I can't recall the link, but if you do some searching, there's an ideal number of disks in a VDEV of each type (RAIDZ1/2/3) which spreads the blocks and parity in a way that makes best use of all disks from a capacity perspective.

I recall it being something like 6 or 9 for RAIDZ2, but don't quote me on that.

Here are some thoughts about that which I was able to find quickly:
https://constantin.glez.de/2010/06/04/a-closer-look-zfs-vdevs-and-performance/ said:
  • Each data block that is handed over to ZFS is split up into its own stripe of multiple disk blocks at the disk level, across the RAID-Z vdev. This is important to keep in mind: Each individual I/O operation at the filesystem level will be mapped to multiple, parallel and smaller I/O operations across members of the RAID-Z vdev.
  • When writing to a RAID-Z vdev, ZFS may choose to use less than the maximum number of data disks. For example, you may be using a 3+2 (5 disks) RAID-Z2 vdev, but ZFS may choose to write a block as 2+2 because it fits better.

and
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
thanks @sretalla. this doc confirms 3-9 drives per vdev. I'm using 12 so i think it might make sense to switch to 2 x 6 if the theory says it will improve performance. I think I did 12 because I thought it would be more resilient against drive failure. Struggling to understand the distribution of data between vdevs though and it's quite a bit of effort to rebuild although happy to to that if it makes sense.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
zpool iostats is showing reads across all drives during cat'ing of big file - but looks poor bandwidth
Screenshot from 2021-10-03 13-24-01.png
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
The numbers don't work like you think because the way ZFS stores blocks means that each block requested doesn't necessarily sit across all of the disks in a VDEV/Pool.

The article you quote mentions an 8-wide VDEV/Pool.

I can't recall the link, but if you do some searching, there's an ideal number of disks in a VDEV of each type (RAIDZ1/2/3) which spreads the blocks and parity in a way that makes best use of all disks from a capacity perspective.

I recall it being something like 6 or 9 for RAIDZ2, but don't quote me on that.

Here are some thoughts about that which I was able to find quickly:


and

a big thank you @retalla - the article you sent contains justification for trying complete rebuild as 2x6 vdevs : "When using more than one vdev, they're always striped" - let's see if i can double the performance...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The advice about component vdevs ideally being a power-of-two-plus-the-raidz-level doesn't work quite the way it used to with modern compression.

On a very mature RAIDZ3 pool, 11 component devices, I typically see about 200-300MBytes/sec for single consumer read performance.

Basically what this comes down to is your expectations are a bit wack. So let's think about this for a bit.

In a world with no caching, a request for a block of data would come over the network, be handled by the CPU, be issued to the disk, the disk would eventually respond, the CPU would hand it back to the network, and some milliseconds later, your client gets its block of data. This works out to some hundreds of blocks of data per second, which is going to be a very low transfer speed, let's say like 4096 * an optimistic 250 transactions per second, 1MByte per second. This is basically the effect of latency on a lock-step system.

To get beyond that, we need a few things. One, we need to do stuff like speculative read-ahead ("prefetch"), so that data is read from disk so that it is ready when the client wants it. The problem with speculative read-ahead is that there's a good chance you will end up reading data that doesn't end up actually being used, so, systems like ZFS attempt to identify consumer behaviour such as sequential reads, to better decide how to apply a strategy to keep speeds high. The problem is, you are looking at this as "I am a single consumer and these disks are all mine and they should dedicate all their speed to me", while ZFS is written to be a fileserver for a busy UNIX system doing lots of things. ZFS is well situated to handle parallelism in client requests, but it is not particularly optimized to throw maximum resources at a single client for the relatively unusual case where you are the only consumer using the system.

The other thing is that you actually do need the ability for your drives to perform sequential read operations, because any time you hit a seek, the data flow stops. ZFS, being a copy-on-write filesystem, has a tendency to fragment data, so your best model for fast data access on a RAIDZ is to treat it the way you would treat a WORM drive. What seems to happen instead is that many people use their pools for, oh, I dunno, bittorrent targets, creating a lot of fragmentation along the way.

So a few other things to do --

Increase the recordsize to 1M.

Make sure you've tuned for large TCP buffers.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
The advice about component vdevs ideally being a power-of-two-plus-the-raidz-level doesn't work quite the way it used to with modern compression.

On a very mature RAIDZ3 pool, 11 component devices, I typically see about 200-300MBytes/sec for single consumer read performance.

Basically what this comes down to is your expectations are a bit wack. So let's think about this for a bit.

In a world with no caching, a request for a block of data would come over the network, be handled by the CPU, be issued to the disk, the disk would eventually respond, the CPU would hand it back to the network, and some milliseconds later, your client gets its block of data. This works out to some hundreds of blocks of data per second, which is going to be a very low transfer speed, let's say like 4096 * an optimistic 250 transactions per second, 1MByte per second. This is basically the effect of latency on a lock-step system.

To get beyond that, we need a few things. One, we need to do stuff like speculative read-ahead ("prefetch"), so that data is read from disk so that it is ready when the client wants it. The problem with speculative read-ahead is that there's a good chance you will end up reading data that doesn't end up actually being used, so, systems like ZFS attempt to identify consumer behaviour such as sequential reads, to better decide how to apply a strategy to keep speeds high. The problem is, you are looking at this as "I am a single consumer and these disks are all mine and they should dedicate all their speed to me", while ZFS is written to be a fileserver for a busy UNIX system doing lots of things. ZFS is well situated to handle parallelism in client requests, but it is not particularly optimized to throw maximum resources at a single client for the relatively unusual case where you are the only consumer using the system.

The other thing is that you actually do need the ability for your drives to perform sequential read operations, because any time you hit a seek, the data flow stops. ZFS, being a copy-on-write filesystem, has a tendency to fragment data, so your best model for fast data access on a RAIDZ is to treat it the way you would treat a WORM drive. What seems to happen instead is that many people use their pools for, oh, I dunno, bittorrent targets, creating a lot of fragmentation along the way.

So a few other things to do --

Increase the recordsize to 1M.

Make sure you've tuned for large TCP buffers.
thanks @jgreco for your response. I've tried to keep the network equation out of the performance tests so far, for the reason that if i can't see good performance on a single cat of a file I'll never see anything better on remote access. I've also tried to keep concurrent access out of the performance tests - again because if i can't see good single consumer access then concurrent access is not going to be any better. My use case is 300-400 clients concurrently reading one of 1000's of 2-4GB files via 10GB nic over truenas the NFS service. I've just tried complete rebuild replacing 1x12 drive vdev with 2x6 drive vdevs - I still get about 300MB / second cating the files to /dev/null - so no improvement. I will try your suggestion of recordsize - which I've seen suggested in several places. My goal is to be able to saturate the 10GB nic - so I'd like to get a 3x from here approx. I would hope that any read ahead logic built into the disk should pay off as I am typically doing a lot of sequential reads. I'm also prepared to get decent sized cache if necessary 1TB ssd if beneficial (I have 128 gb ram so far), it might be the only way to use the 10GB nic bandwidth, but it is hard to know if that will pay off without actually trying it.
 
Last edited:

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
I tried recordsize = 1M - cat'ing to /dev/null on the server itself gave me improvement from 300 MB/s to 480 MB/s
not sure where to tune for large tcp buffers
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Noatime maybe also, that helps the “lots of tiny files” use case, but not your chunky file use case. Sticking it here anyway in case someone else comes across this thread.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I tried recordsize = 1M - cat'ing to /dev/null on the server itself gave me improvement from 300 MB/s to 480 MB/s
not sure where to tune for large tcp buffers
These network tunables work well for me on my 10G systems. Ignore the hw.sfxge.* settings unless you're using SolarFlare SFN6122F NICs:
network-tunables-2021-10-03.jpg

I also use jumbo frames; can be a pain to set up, but I found it to be worth the effort on my systems.

Not sure where/how the TCP window size is set on FreeBSD systems, but I get a 4.00MByte default TCP window size when I run iperf. So perhaps that is simply the default on FreeBSD? The trick will be setting a larger TCP window size on Windows clients, assuming you have those.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
Not sure where/how the TCP window size is set on FreeBSD systems, but I get a 4.00MByte default TCP window size
You are the one who set the initial TCP windows size in your tunables: 'net.inet.tcp.sendspace=4194304' & 'net.inet.tcp.recvspace=4194304'. So based on your post the window size starts at 4M and can increment by 64K as needed to a max of buffer size of 16M.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
The advice about component vdevs ideally being a power-of-two-plus-the-raidz-level doesn't work quite the way it used to with modern compression. On a very mature RAIDZ3 pool, 11 component devices, I typically see about 200-300MBytes/sec for single consumer read performance. Basically what this comes down to is your expectations are a bit wack. So let's think about this for a bit. In a world with no caching, a request for a block of data would come over the network, be handled by the CPU, be issued to the disk, the disk would eventually respond, the CPU would hand it back to the network, and some milliseconds later, your client gets its block of data. This works out to some hundreds of blocks of data per second, which is going to be a very low transfer speed, let's say like 4096 * an optimistic 250 transactions per second, 1MByte per second. This is basically the effect of latency on a lock-step system. To get beyond that, we need a few things. One, we need to do stuff like speculative read-ahead ("prefetch"), so that data is read from disk so that it is ready when the client wants it. The problem with speculative read-ahead is that there's a good chance you will end up reading data that doesn't end up actually being used, so, systems like ZFS attempt to identify consumer behaviour such as sequential reads, to better decide how to apply a strategy to keep speeds high. The problem is, you are looking at this as "I am a single consumer and these disks are all mine and they should dedicate all their speed to me", while ZFS is written to be a fileserver for a busy UNIX system doing lots of things. ZFS is well situated to handle parallelism in client requests, but it is not particularly optimized to throw maximum resources at a single client for the relatively unusual case where you are the only consumer using the system. The other thing is that you actually do need the ability for your drives to perform sequential read operations, because any time you hit a seek, the data flow stops. ZFS, being a copy-on-write filesystem, has a tendency to fragment data, so your best model for fast data access on a RAIDZ is to treat it the way you would treat a WORM drive. What seems to happen instead is that many people use their pools for, oh, I dunno, bittorrent targets, creating a lot of fragmentation along the way. So a few other things to do -- Increase the recordsize to 1M. Make sure you've tuned for large TCP buffers.

I buy your point about single consumer rates versus aggregate - and happy to settle for 480MB/s for single consumer read rate. I guess the follow up question is what should I reasonably be targeting for the combined MB/s for a largish number of concurrent consumers ? the server has a dual port 10gb nic - should I be able to saturate both ports if everything setup optimally ?
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I buy your point about single consumer rates versus aggregate - and happy to settle for 480MB/s for single consumer read rate. I guess the follow up question is what should I reasonably be targeting for the combined MB/s for a largish number of concurrent consumers ? the server has a dual port 10gb nic - should I be able to saturate both ports if everything setup optimally ?

PROBABLY not. RAIDZ is really optimized towards archival file storage, and can struggle a bit once you get more than a bit of concurrency going. With RAIDZ, if a consumer is reading a 1MB block off an 11-drive RAIDZ vdev, then all 11 drives need to seek to the appropriate location on the disk, and then read about two dozen 4K sectors off eack one, then free up for other workload. The optimal case is "no other workload" because then the heads are in the correct area of the disk for the next request from this one consumer -- which hopefully is prefetched anyways, but isn't always going to be, because there's a limit to the amount of prefetching that happens. If you want to be running two consumers concurrently, there is going to be a lot of seeking guaranteed to be going on, which is basically dictated by the amount of data that is prefetched. Some of this has changed over the years and I'm not interested in fishing around for current details this morning, so I'm just trying to outline the issue for you here. Let's say that there's no prefetching, just the 1MB block read. So you have two consumers alternating requests from two different areas of the disk (we can handwave and treat the entire vdev as performing similarly to a single disk). You seek to one place, read 1MB, seek to another, read 1MB, seek back to the first, read 1MB, etc. If you look at that high level behaviour, it should be clear that the seek time plays an integral part of overall performance. In practice, a typical modern HDD can maybe do 200 IOPS per second, so that works out to 100MBytes/sec for each consumer, or 200MBytes/sec aggregate.

You can effectively double(ish) this by going to two vdevs, but in practice filesystems are chaotic so you aren't likely to see any sort of precise doubling. This becomes more like gambling, where in blackjack, the instantaneous results of any hand are hard to predict, but the house wins in the long term. Adding vdevs increases the amount of I/O capacity.

The ticket to concurrent performance is mirrors, not RAIDZ. RAIDZ is optimized for capacity but not really performance. However, with mirrors, things are a bit different. If you have a single three-way mirror vdev, you can have three separate consumers reading data from different parts of the disk without getting in each others way seek-wise. And you can go four-way or even more. You're paying for it in that you are burning through HDD's and capacity faster than you would be with RAIDZ, of course. I like to suggest three-way mirror vdevs because it is the sweet spot at which you do not lose redundancy if a drive fails.

Beyond that, you really need to have stuff in ARC or L2ARC. L2ARC won't generally cache sequential workloads, but you can look at vfs.zfs.l2arc_noprefetch and related settings.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
thank you @jgreco, all makes sense. I will experiment with ssd cache and see if i really need to mirror. If it turns out much of my workload is repetitive and read from cache sufficiently then it would be nice to keep the capacity benefits of raidz. Unfortunately I discovered the gen8's do not support nvme so it will have to be standard ssd. I am thinking of 2TB Samsung 860 Pro.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
so i got lucky the 2tb samsung 860 pro arrived today and seems to be compatible with my setup. However, once again I'm not sure if i should be disappointed or if my expectations were too high. I've added the ssd to my 2x6 raidz2 pool. So far it hasnt improved large scale concurrent reads from the pool including ssd at up to 600MB/sec.

zpool iostats is telling me 99% of the reads are coming from the hdds not the ssd - so I'm not really getting any benefit of the ssd. However, the ssd usage is creeping up slowly as the ssd fills up at a rate of 37Mb / second. Maybe when the 2tb ssd is full I will get more cache hits and my aggregate read performance will go up - but it doesnt look like it so far.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
I left the workload running overnight the ssd cache is now two thirds full but the pool is still only reading only 1% of the data from ssd cache. Graphana says the ssd is not more than 10% active. It is disappointing because the set of data this particular workload is reading is the same size as the ssd. So I am naively hoping that quickly repeated runs of the workload should mostly hit the cache - but this is just not happening at all.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
What is arc_summary.py reporting? If you are running what appears to be sequential read workloads, ZFS may not be that interested in trying to cache it, see discussions of l2arc_noprefetch (etc). If you are running a workload that is the size of your L2ARC, your ARC is probably thrashing and not doing a good job of picking up candidate blocks. It may also be that you should substantially boost l2arc_write_boost and l2arc_write_max.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
arc_summary says :
l2arc.write_boost 8388608
l2arc.write_max 8388608
l2arc.noprefetch 1
now testing :
l2arc_write_boost 26214400
l2arc_write_max 52428800
l2arc.noprefetch 0

the ssd cache is filling up much quicker and i can see a step up in reads from ssd - but will need to fill it up before i can see if it's going to add value.
it maybe that my workload is unrealistic. I am throwing 384 read threads from the clients, i might just get better through put if i have less clients.
I understand many concurrent sequential reads can look like random to server
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That's really tepid.

Try running these three commands from the CLI

# sysctl -w vfs.zfs.l2arc_write_boost=134217728
# sysctl -w vfs.zfs.l2arc_write_max=67108864
# sysctl -w vfs.zfs.l2arc_noprefetch=0

Be careful about making aggressive changes, but I'll say that this starting point is not particularly aggressive given the scope of things here. If this seems to help, it is probably okay to double both the write_boost and max values once, see how it improves, and MAYBE once more, but I would get skeptical at that point.
 
Top