Poor read speeds on multiple different systems

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
You may very well be hitting a disk I/O bottleneck, as @jgreco suggests.

But have you added any network tunables on these systems?

Here are the ones I use with FreeNAS 11.2-U8 (ignore the hw.sfxge.* settings as they're hardware-specific):
network-tunables-2021-10-03.jpg
 

kurtc

Dabbler
Joined
Dec 17, 2017
Messages
39
You're getting 60MBytes/sec on iSCSI on RAIDZ2 with two vdevs? That seems like it could be very reasonable, especially once fragmentation takes hold.

https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/

I get about 400-500 megabits per second, it's a 65% full pool (which is where some of the speed loss comes in) with a ZFS-reported 18% fragmentation, with three mirror vdevs. So it's a little different, but given the small amount of available I/O, I think what you're seeing on your hardware is not necessarily bad.

OK, the pool is 9% fragmented. I feel like I should at least be able to get close to 2 hard drives worth of I/O.
It sounds like I have chosen the wrong storage for my application and probably should stop beating my head against a wall! LOL

So before these systems my world was all RAID cards and shared storage (Nimble (CASL), EMC, etc.) ZFS always seemed like the perfect solution for backup storage, but it sounds like it can't scale up in a reasonable fashion for my use case. When you store a couple hundred terabytes of backups and need to recover a VM, getting it out at 30-60MBps isn't going to meet RTO's. My guess is the REFS on top of ZFS is causing fragmentation. I wouldn't think 9% is bad, but maybe there is something I can't see. In my limited understanding of filesystems, I feel like ZFS would really benefit from a garbage collection methodology like some of the Log Structured File Systems have. Something that rewrites data sequentially.

Thanks for the assistance!
 

kurtc

Dabbler
Joined
Dec 17, 2017
Messages
39
You may very well be hitting a disk I/O bottleneck, as @jgreco suggests.

But have you added any network tunables on these systems?

Here are the ones I use with FreeNAS 11.2-U8 (ignore the hw.sfxge.* settings as they're hardware-specific):
View attachment 50131

I have not, which one of these do you suggest I start with or is there a resource I could review on what they do?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
OK, the pool is 9% fragmented. I feel like I should at least be able to get close to 2 hard drives worth of I/O.

Okay, so, I'm not necessarily saying you CAN'T get more, and I'm not discouraging you from trying, but I do want to point out:

Two hard drives are maybe 250 IOPS/sec each. That's 500 IOPS for the pool. I'm really glad you said this, in fact, because it means that you are that rare bird here on the forums, someone who might be REASONABLE. ;-)

So let me point out that an IOPS, from the angle that matters, is a seek. If you are seeking for every 4K block you get, which is clearly a really bad case, 4096 * 500 is only 2MBytes. You're beating 2MBytes/sec, so you're definitely not hitting a hugely pessimistic case.

Tuning may bring out more speed, and you've just been handed some pointers there by someone else. Absolutely do feel free to try that stuff. But to get back to what I was saying, when that runs out of steam:

Mirrors provide much better read performance on iSCSI pools, and you might be able to make your performance substantially better using mirrors. Additionally, if you read the "path to success for block storage" article, one other factor that will decrease your fragmentation is more free space (which should work even for RAIDZ2 somewhat). Depending on how your application is writing data, if it is writing large sequential chunks, having much more free space on the pool is likely to result in it being written to the pool sequentially, which will in turn result in much better read performance on the way out.

However, on the flip side, if your client restores are "single-threaded", there is going to tend to be a speed cap as to how fast a restore can go, even with mirrors. However, especially with mirrors, ZFS shines at parallelism, and you might find that while doing a single restore isn't as fast as you'd like, doing multiple parallel restores makes much better use of the available resources.

Just some random thoughts for you to explore.
 

kurtc

Dabbler
Joined
Dec 17, 2017
Messages
39
Okay, so, I'm not necessarily saying you CAN'T get more, and I'm not discouraging you from trying, but I do want to point out:

Two hard drives are maybe 250 IOPS/sec each. That's 500 IOPS for the pool. I'm really glad you said this, in fact, because it means that you are that rare bird here on the forums, someone who might be REASONABLE. ;-)

So let me point out that an IOPS, from the angle that matters, is a seek. If you are seeking for every 4K block you get, which is clearly a really bad case, 4096 * 500 is only 2MBytes. You're beating 2MBytes/sec, so you're definitely not hitting a hugely pessimistic case.

Tuning may bring out more speed, and you've just been handed some pointers there by someone else. Absolutely do feel free to try that stuff. But to get back to what I was saying, when that runs out of steam:

Mirrors provide much better read performance on iSCSI pools, and you might be able to make your performance substantially better using mirrors. Additionally, if you read the "path to success for block storage" article, one other factor that will decrease your fragmentation is more free space (which should work even for RAIDZ2 somewhat). Depending on how your application is writing data, if it is writing large sequential chunks, having much more free space on the pool is likely to result in it being written to the pool sequentially, which will in turn result in much better read performance on the way out.

However, on the flip side, if your client restores are "single-threaded", there is going to tend to be a speed cap as to how fast a restore can go, even with mirrors. However, especially with mirrors, ZFS shines at parallelism, and you might find that while doing a single restore isn't as fast as you'd like, doing multiple parallel restores makes much better use of the available resources.

Just some random thoughts for you to explore.

I did test one system with 12 mirrors and it didn't seem to improve dramatically. When a Zvol is shared out as an iSCSI target, does TrueNAS treat each initiator talking to that extent as a single thread?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That might be the wrong question. iSCSI itself is based on TCP but is capable of handling multiple requests that do not necessarily complete in order (tagged command queueing). Therefore the behaviour is somewhat like a single thread and somewhat not. I don't have a good reference to point you at, I'm sorry. Perhaps @mav@ has something better to say, because I'm not saying anything intelligible here.

I would expect that the 12 mirrored system would have a lot of potential for high performance across a bunch of parallel tasks.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Are you using dedupe?
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
...snip...
When you store a couple hundred terabytes of backups and need to recover a VM, getting it out at 30-60MBps isn't going to meet RTO's.
...snip...
60MB/s is only 0.48Gbit/s -- and it just seems to me that you ought to be able to beat that with your equipment.

Example: here's a screenshot of network bandwidth during the interval I back up my desktop PC to my FreeNAS server 'BACON' -- I'm getting > 3.5Gb/s during the read-intensive 'validation' step, corresponding roughly to 'recovering a VM' if I understand your needs correctly. See 'my systems' below for full specs, but in brief: 'BACON' is nothing fancy, it has a 10Gb/s NIC and a single pool made up of 2 x RAIDZ2 vdevs.
bacon-bandwidth.jpg


I recommend making sure your network configuration is squared away before throwing in the towel. You should get iperf results near 10Gb/s using the tunables I posted earlier. I also use jumbo frames, for what that's worth, but even without jumbos you should be getting better than 9Gb/s with iperf or something's just not right.

Sorry I can't point you to a resource about the tunables; they're just what I've cooked up over the years, reading here on the forum and other sites such as Calomel:

 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
I agree with @Spearfoot that before you analyze any other problems `iperf` must should you full link bandwidth of 10Gb/s, not 5-6-whatever. Lower numbers may mean some packet losses or other problems. And it should better be in single TCP stream, because each iSCSI connection is a single TCP stream. In some cases single connection bandwidth can be limited by single CPU core performance, in which case it is not that bad, but I'd expect that happen closer to 5-6GB/s, not 5-6Gb/s.

On top of that SCSI target in TrueNAS is set to process up to 32 requests per LUN at a time, so aside of mentioned network bottlenecks single initiator is not a single thread, as @jgreco said, unless it really does not provide enough parallel requests. Previously the SCSI target could serialize logically consequential reads to help ZFS speculative prefetcher be more efficient, but earlier this year I've improved the prefetcher and was able to significantly relax that constraint. So make sure you are running latest TrueNAS version (12.0-U6 now).
 

kurtc

Dabbler
Joined
Dec 17, 2017
Messages
39
I agree with @Spearfoot that before you analyze any other problems `iperf` must should you full link bandwidth of 10Gb/s, not 5-6-whatever. Lower numbers may mean some packet losses or other problems. And it should better be in single TCP stream, because each iSCSI connection is a single TCP stream. In some cases single connection bandwidth can be limited by single CPU core performance, in which case it is not that bad, but I'd expect that happen closer to 5-6GB/s, not 5-6Gb/s.

On top of that SCSI target in TrueNAS is set to process up to 32 requests per LUN at a time, so aside of mentioned network bottlenecks single initiator is not a single thread, as @jgreco said, unless it really does not provide enough parallel requests. Previously the SCSI target could serialize logically consequential reads to help ZFS speculative prefetcher be more efficient, but earlier this year I've improved the prefetcher and was able to significantly relax that constraint. So make sure you are running latest TrueNAS version (12.0-U6 now).

These tests are across 7 different TrueNAS servers and 2 different iSCSI networks. I just reconfigured one of them that has 42 x 4TB drives as 13 x mirror vdevs (3 drives mirrored per). That one has 4 x 10Gb interfaces for iSCSI and iperf from the Host initiator shows 9300Mb per path with multiple threads. I think the network is fine. Especially since the behavior is the same on that network which has 2 independent Arista switch sets, and the the other 5 TrueNAS servers run on Nexus and HP switches that iperf fine.

That is good to know on the multiple threads for iSCSI
 
Last edited:

kurtc

Dabbler
Joined
Dec 17, 2017
Messages
39
As mentioned above I just reconfigured the Pool on one of the TrueNAS servers (12.0 U6) that has 42 x 4TB drives as 13 x mirror vdevs (3 drives mirrored per) leaving 3 for spares. I configured a 20TB zvol with 64KB block size, shared it out as a device extent (512b), connected it to a Windows Host, formatted as ReFS (64KB), and wrote 8TB to it last night. I am reading all 8TB back today and getting about 200MB per second. Obviously that is much better, but I still don't understand why only 200MBps. If I look at each drive they are only pushing about 5MBps, ~75iops, and 3ms latency. I can't imagine fragmentation is an issue yet with one large write and one large read.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I think the network is fine.
Perhaps so, but have you tried adding tunables? At least for TCP window size? And jumbo frames should definitely help w/ block storage.

I'm just a dog typing on my master's computer, but @mav@ is with iXsystems, and knows whereof he speaks. We both agree that it just makes sense to maximize network throughput.
 

kurtc

Dabbler
Joined
Dec 17, 2017
Messages
39
Perhaps so, but have you tried adding tunables? At least for TCP window size? And jumbo frames should definitely help w/ block storage.

I'm just a dog typing on my master's computer, but @mav@ is with iXsystems, and knows whereof he speaks. We both agree that it just makes sense to maximize network throughput.

I would be glad to try the tunables. Should I try TCP window size first and then just test the various other ones you posted?

Also, 9K jumbos are configured and tested end-to-end.

Thanks everyone!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, you ARE starting out with what is essentially a very complicated configuration using protocols known for being sensitive to latency, and you have several latency factors in there. Quite frankly, it's unreasonable to assume that everything is just going to automatically optimize itself for a complicated workload, or even be ABLE to perform the way we all WISH it magically would.

What happens on your nice new mirror pool if you just share out a large chunk of space with NFS or SMB, and copy DIRECTLY from the Windows box?

This may point towards additional helpful network tweaking, or help to isolate other fundamentals. If you cannot get basic NFS and SMB to go screaming fast, I can pretty much guarantee that your MUCH more challenging iSCSI/ZVOL/ReFS stack is going to suck. So it is easier to find your bottlenecks with the easier configuration first.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I would be glad to try the tunables. Should I try TCP window size first and then just test the various other ones you posted?

Also, 9K jumbos are configured and tested end-to-end.

Thanks everyone!

Jumbo can actually turn out to be a performance hit on a lot of hardware. It exercises unusual code paths and can hit different kernel buffer limits than normal packet processing.
 
Top