A TrueNAS and iXsystems Questions & Answers Interview [with elaboration request on RDMA answer]

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Hey guys,

I just found this https://nascompares.com/news/truenas-ixsystems-nas-qa-answering-your-questions-in-2022/
There are some good answers for new users who are looking into NAS systems and are considering TrueNas as option, I like it

I actually stumbled over this while searching for news about TrueNas Scale and RDMA support, and was a bit surprised about the answer that @morganL gave wrt that question...

With TrueNAS Scale, will RDMA/RoCEv2 be supported?


RDMA is a very useful technology for accessing data in RAM on another system. For accessing data on HDDs and Flash, there is only a minor benefit. TrueNAS SCALE will support RDMA in a future release based on customer/community demand.

Now I am no expert on RDMA (or rather RoCEv2 as the implementation we most likely would look at), but from what I gather, for example from this FreeBSD presentation from 2018 ( https://papers.freebsd.org/2018/bsdcan/shwartsman-roce_as_a_performance_accelerator/) I would think that having (and I quote) "Local like [NVMe-over-Fabrics] performance" would be something that Flash arrays would benefit from as well?

I mean I basically see support for RoCE on a lot of other storage vendors 'Pro' lists, I see it at work in our IBM All-Flash Arrays, just not here?
I know its still on the roadmap (https://jira.ixsystems.com/browse/NAS-106190, got pushed back to the next release), but I really am confused here why you only expect minor benefits from this?

Thanks & Cheers
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Thanks @Rand

The presentation you referenced from freebsd.org was a sales pitch from Mellanox with only highly tuned lab data.

For example: they were using 512B write IOPS.
There was no ZFS and data protection included
There was no evidence that HDDs and SSDs were even part of the test.

I'd recommend finding some real world-data on what the benefit is.... certainly less CPU and TCP overheads, but it's likely to be highly workload and access pattern specific. Adding cores is getting cheaper. I don't know the answer for all use-cases, but for most I suspect my answer is correct.

In the meantime, you will find that TrueNAS 12.0 includes some significant reduction in memory copies in the stack..... which does increase bandwidth and IOPS. So, it's fair to say, the focus in TrueNAS is on general performance with ZFS and real media. We prioritize the work based on current customer needs.

In a previous life, I introduced the 1st storage to hit 1 Million IOPS.... but it had no useful compression and snapshots and it needed Fiber Channel. It was not software defined and flexible like TrueNAS. TrueNAS has significantly reduced the costs of getting high performance (we hit 1.2 Million IOPS with the TrueNAS M60) with features, but it cannot also be the Ferrari for breaking records.

So, if RDMA is needed, please upvote the jira tickets or contact our sales team with your needs. We want those real-world problems.

Cheers

Morgan
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Hi Morgan,

thanks for the explanation.
I wanted to quickly whip up a comparison of RDMA vs non RDMA traffic (Chelsio T6's vs MLX CX5's) but that didnt work out so well as I have not been able to get any proper numbers from the SSD pool.

However, maybe I can simply explain the need i have (and I am certain many other have too), and given your experience you maybe can tell me if my hope is substantiated by RDMA enablement or not :)

The main problem I (personally and many others I know) had for years is not that the (various) system(s) performs badly, but that it is impossible to get the actual performance out to the consumer side.

For example the "non-performing" system I mentioned above runs 12 mirror'ed dualpathed SAS3 SSDs so is perfectly capable of writing several GB/s. The problem is that I cannot get that performance to my consumers (ESXi boxes in my case), for some reason it was limited to some 800MB/s. (I think there was an issue but thats not the point here, even if I'd get >1GB/s it would still be far below the 4.5 GB/s I see locally [o/c depending on actual test])

Everybody I've been talking to over the years always said "Its the increased latency caused by networking", and that's perfectly understandable.
This issue does not only impact SAS or NVME pools, you have the same problem on small HDD pools as well - even a 6 drive RaidZ2 is slower remotely than locally (of course).

Now up until now I was under the impression (or had the hope) that using RDMA (be it RoCE or iWarp, with RoCe promising lower latency and iWarp simpler setup) I would be able to mitigate the latency penalty of networking to at least a certain degree, so any setup would enjoy at least a certain remote speedup by this.

Your comment however seems to indicate that my hope for what RDMA could do might not be realistic?

Thanks & cheers
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Hi Morgan,

thanks for the explanation.
I wanted to quickly whip up a comparison of RDMA vs non RDMA traffic (Chelsio T6's vs MLX CX5's) but that didnt work out so well as I have not been able to get any proper numbers from the SSD pool.

However, maybe I can simply explain the need i have (and I am certain many other have too), and given your experience you maybe can tell me if my hope is substantiated by RDMA enablement or not :)

The main problem I (personally and many others I know) had for years is not that the (various) system(s) performs badly, but that it is impossible to get the actual performance out to the consumer side.
Not impossible, but not necessarily simple.

For example the "non-performing" system I mentioned above runs 12 mirror'ed dualpathed SAS3 SSDs so is perfectly capable of writing several GB/s. The problem is that I cannot get that performance to my consumers (ESXi boxes in my case), for some reason it was limited to some 800MB/s. (I think there was an issue but thats not the point here, even if I'd get >1GB/s it would still be far below the 4.5 GB/s I see locally [o/c depending on actual test])

With 12 Mirrored SAS SSDs, we can get 3-6 GB/s and over 60K IOPS out of them.. depending on iO/size and mix. We also use an NVDIMM as SLOG. This would be a TrueNAS M40 platform..... only 10 cores and 192GB RAM.

We testi with 12 client machines (25Gbe) using iSCSI (100Gbe on NAS). How are you testing?



Everybody I've been talking to over the years always said "Its the increased latency caused by networking", and that's perfectly understandable.
This issue does not only impact SAS or NVME pools, you have the same problem on small HDD pools as well - even a 6 drive RaidZ2 is slower remotely than locally (of course).
Agreed networking does slow down single client performance. Are you focussed on single client issues?
The key to more bandwidth on single clients is having more queue depth and larger IO/s

Now up until now I was under the impression (or had the hope) that using RDMA (be it RoCE or iWarp, with RoCe promising lower latency and iWarp simpler setup) I would be able to mitigate the latency penalty of networking to at least a certain degree, so any setup would enjoy at least a certain remote speedup by this.

Yes, a certain amount of speedup is possible, but I don't think its that dramatic unless a lot of the data is in cache/ARC.
The question is how much CPU time is TCP taking and how much latency is it adding.
Its typically not a dominant issue.

One test is to deliberately read data from ARC..... access a smaller 4GB volume. If that is fast enough, then TCP is not the issue.

Your comment however seems to indicate that my hope for what RDMA could do might not be realistic?

I'd first look at the performance testing model you are using. You should be able to get more performance, but it may depend on how you test and tune the system.

Thanks & cheers

Its an interesting topic. It would be cool if someone else in the community could provide any real world experience... even if its on a non-TrueNAS experience.

Morgan
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One, probably not related comment, is that 10G Base-T Ethernet has higher latency than 10Gbps over Fiber. It is not a "killer" of performance, but something to understand exists. It would be worse if both NAS & clients used 10G Base-T to a 10G Base-T switch. Meaning twice the latency. Again, not terrible, but something to understand.
 

efeu

Cadet
Joined
Jan 24, 2019
Messages
5
I was building my own active/passive ZFS iSer target a while ago. I rly would like to see iSer on TrueNAS Scale, to migrate to.

JFYI
With iSer I'm getting around 1 Million 4k IOPS and 7GiB/s of max transfer from a single host to a single iSer target (the host I'm testing from is limited by 8x PCIe3) Using iSCSI I get 6GiB/s and around 600k 4k IOPS.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I was building my own active/passive ZFS iSer target a while ago. I rly would like to see iSer on TrueNAS Scale, to migrate to.

JFYI
With iSer I'm getting around 1 Million 4k IOPS and 7GiB/s of max transfer from a single host to a single iSer target (the host I'm testing from is limited by 8x PCIe3) Using iSCSI I get 6GiB/s and around 600k 4k IOPS.

What was the back-end storage for those tests?

Were you testing with TrueNAS SCALE or TrueNAS 12.0 or TrueNAS 13.0?
 

efeu

Cadet
Joined
Jan 24, 2019
Messages
5
No I setup this a few years ago using an ubuntu as target. Nothing should have changed for doing it a similar way on a debian today.

SuperStorage 2029P-DN2R24L DualNode in-a-box (with 2x Xeon(R) Silver 4108, 192GB Ram, 2x Mellanox ConnectX-3 Pro 56G DualPort each Node)
10x Samsung SSD PM1725
4.15.0-50-generic, SCST 3.4, inbox kernel Mellanox driver
as target

Some "old" Sandy Bridge-E Testlab Xeon with a ConnectX4 DualPort 56G

Those numbers 1M IOPS 7GiB/s were some simple fio read tests from testlab initiator, actual it was testing read the arc to see the overhead and forsure way less cpu usage by using it compared to simple iSCSI. And actual those numbers were "limited" by low clock CPU and "only" PCIe3 x8 on the initiator side.

I made a writedown of my testlab installation and some specific parallel 3 host esxi in-VM benchmarks:

For "production" I split up those 10 disks in two 5disk RAIDZ and have them running and beeing exported one from each node, while the other node is the "standby". So I have both nodes active beeing the "failover" for the other one. So actual I do not have an acitve-passive setup I have an active-passive passive-active setup.

Sooo, when you switched from freebsd to debian, there should not a lot of work to be done to get iSer target run on TrueNAS Scale.

Install rdma-core package from repo and enable rdma/iser by enable those modules:
ib_isert
rdma_cm
knem

Then enable iSer target portal on the iSCSI target (dunno wich one you use in TrueNAS, TGT, LIO or SCST but all three support iSer).

That's it.

Problably anyone with TrueNAS could setup that in a few minutes from shell. (I cant cus I dont have the test environment right now)

Ensure your network is lossless, either with GlobalPause for Layer2 or ECN+PFC for Layer3, but I guess for a "simple test" that would not be needed.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
@efeu Thanks for the clarification.

that's about a 15-30% performance increase if the back-end storage is DRAM.

With a mixed RW workload to SSDs, we expect the performance differences to be much less. Most of the community and our customers are interested in that all-flash performance. If it's in DRAM, the data usually stays within the actual server, not the storage

We're always looking for the most significant performance impacts and generally make pretty big improvements each release. This is a comparison between 12.0 and 13.0 for a SSD NAS.

LLHTlFkcksazsOlAOUi_6jKKwkPY_wd1hR_LqpdEysZo9boEARCKH0PmFnE0rXGR2AEV2yC1_tg1LlXXa0rYEXjnpnOt8jh-DDoKu2DkKUIOCJFVoPITlEJNlRyge9wT_nFQDZK4jCPPx9UNWw

Feel free to make a suggestion in jira.. we are keen to find users that need it for their business.
 
Top