Cluster Relative SMB Performance

NickF · Mar 7, 2023

Hi all,

With SMB Clustering being out for about a year now, and with the first minor updates to Bluefin (22.12.1) and TrueCommand (2.3.1), both documenting improvements in this area, I've decided now is a good time to start playing around with clustering. I switched to SCALE in 2021 and I am putting my money where my mouth is.

I've put together a cluster of 3 small "TinyMinyMicro" nodes, each with 16GB of RAM and 2x512GB NVME drives (computer chosen has 2 PCIe Gen4x4 NVME slots) to do this testing. I think this can represent a pretty good "worst case scenario" for what people might try to do to stretch relatively low end gear. If CORE is scale UP with tons of disks in disk shelves, SCALE is scale out with multiple computers. Folks are undoubtedly going to try and do this on the cheap, and the TinyMiniMicro formfactor is, I think, the perfect low end starting point for understanding what performance might be. I have done testing with the onboard gigabit ethernet, as well as with a 5 Gigabit Ethernet USB adapter, connected to a multigigabit capable switch.

Anyway, for refrerence here is the SMB performance between my desktop (10 gigabit ethernet) and my "Production" homelab server (40 Gigabit Ethernet). I am using CrystalDiskMark just so that I have a reasonable baseline for comparing with other things, as the entire point of this exercise in futility is for me to understand the relative performance of things and share my findings.

FIO looks like this:

Code:

C:\Users\nickf>fio --bs=128k --direct=1 --directory=z\:fio --gtod_reduce=1 --iodepth=32 --group_reporting --name=randrw --numjobs=12 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
randrw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=windowsaio, iodepth=32
...
fio-3.32
Starting 12 threads
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
Jobs: 12 (f=9): [m(1),/(1),m(4),/(1),m(1),/(1),m(3)][15.5%][r=166KiB/s,w=142KiB/s][r=1,w=1 IOPS][eJobs: 12 (f=12): [m(12)][16.9%][r=196MiB/s,w=205MiB/s][r=1565,w=1638 IOPS][eta 00m:59s]
Jobs: 12 (f=12): [m(12)][100.0%][r=153MiB/s,w=151MiB/s][r=1220,w=1206 IOPS][eta 00m:00s]
randrw: (groupid=0, jobs=12): err= 0: pid=58196: Sun Feb 26 01:07:30 2023
  read: IOPS=1664, BW=208MiB/s (218MB/s)(12.3GiB/60413msec)
   bw (  KiB/s): min=82446, max=399460, per=93.26%, avg=198726.02, stdev=6562.68, samples=627
   iops        : min=  640, max= 3115, avg=1549.01, stdev=51.19, samples=627
  write: IOPS=1667, BW=209MiB/s (219MB/s)(12.3GiB/60413msec); 0 zone resets
   bw (  KiB/s): min=107162, max=398916, per=92.08%, avg=197283.26, stdev=6451.41, samples=627
   iops        : min=  832, max= 3111, avg=1537.82, stdev=50.32, samples=627
  cpu          : usr=0.00%, sys=1.52%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.1%, 4=0.1%, 8=0.2%, 16=13.5%, 32=86.3%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.5%, 8=0.3%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=100571,100752,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=208MiB/s (218MB/s), 208MiB/s-208MiB/s (218MB/s-218MB/s), io=12.3GiB (13.2GB), run=60413-60413msec
  WRITE: bw=209MiB/s (219MB/s), 209MiB/s-209MiB/s (219MB/s-219MB/s), io=12.3GiB (13.3GB), run=60413-60413msec

C:\Users\nickf>

Performance of my local Gen 3 U.2 NVME drives on my desktop is as follows:

FIO Looks like this:

Code:

C:\Users\nickf>fio --bs=128k --direct=1 --directory=e\:fio --gtod_reduce=1 --iodepth=32 --group_reporting --name=randrw --numjobs=12 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
randrw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=windowsaio, iodepth=32
...
fio-3.32
Starting 12 threads
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
randrw: Laying out IO file (1 file / 256MiB)
Jobs: 12 (f=12): [m(12)][100.0%][r=6599MiB/s,w=6608MiB/s][r=52.8k,w=52.9k IOPS][eta 00m:00s]
randrw: (groupid=0, jobs=12): err= 0: pid=60024: Sun Feb 26 01:00:21 2023
  read: IOPS=55.4k, BW=6931MiB/s (7267MB/s)(407GiB/60099msec)
   bw (  MiB/s): min= 2294, max=13115, per=99.29%, avg=6881.30, stdev=199.42, samples=840
   iops        : min=18355, max=104920, avg=55047.36, stdev=1595.38, samples=840
  write: IOPS=55.5k, BW=6938MiB/s (7275MB/s)(407GiB/60099msec); 0 zone resets
   bw (  MiB/s): min= 2394, max=13213, per=99.25%, avg=6885.77, stdev=199.50, samples=840
   iops        : min=19152, max=105703, avg=55083.20, stdev=1596.03, samples=840
  cpu          : usr=3.33%, sys=14.71%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.2%, 2=0.6%, 4=2.9%, 8=10.3%, 16=58.5%, 32=27.6%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=96.8%, 8=1.3%, 16=0.7%, 32=1.2%, 64=0.0%, >=64=0.0%
     issued rwts: total=3332003,3335414,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=6931MiB/s (7267MB/s), 4096MiB/s-6931MiB/s (4295MB/s-7267MB/s), io=407GiB (437GB), run=60099-60099msec
  WRITE: bw=6938MiB/s (7275MB/s), 4096MiB/s-6938MiB/s (4295MB/s-7275MB/s), io=407GiB (437GB), run=60099-60099msec

C:\Users\nickf>

Now, with those REFERENCE points out of the way, here are my test results using gigabit ethernet on my SCALE cluster:

For some reason I lost the text output of FIO for this test, but heres a screen shot:

The same test at 5G (remember this is USB, so I think you can expect 2.5Gig PCIE cards to perform about the same)

and FIO at 5G:

Code:

C:\Users\nickf>fio --bs=128k --direct=1 --directory=y\:fio --gtod_reduce=1 --iodepth=32 --group_reporting --name=randrw --numjobs=12 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
randrw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=windowsaio, iodepth=32
...
fio-3.32
Starting 12 threads
Jobs: 12 (f=12): [m(12)][100.0%][r=91.9MiB/s,w=99.6MiB/s][r=735,w=796 IOPS][eta 00m:00s]
randrw: (groupid=0, jobs=12): err= 0: pid=13596: Tue Mar 7 23:23:48 2023
  read: IOPS=746, BW=93.7MiB/s (98.3MB/s)(5686MiB/60662msec)
   bw (  KiB/s): min=52492, max=136739, per=100.00%, avg=96116.32, stdev=1247.48, samples=1439
   iops        : min=  408, max= 1064, avg=747.91, stdev= 9.72, samples=1439
  write: IOPS=755, BW=94.8MiB/s (99.4MB/s)(5753MiB/60662msec); 0 zone resets
   bw (  KiB/s): min=59439, max=136208, per=100.00%, avg=97264.78, stdev=1200.72, samples=1439
   iops        : min=  463, max= 1062, avg=756.90, stdev= 9.36, samples=1439
  cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=6.8%, 32=93.2%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=45295,45842,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=93.7MiB/s (98.3MB/s), 93.7MiB/s-93.7MiB/s (98.3MB/s-98.3MB/s), io=5686MiB (5962MB), run=60662-60662msec
  WRITE: bw=94.8MiB/s (99.4MB/s), 94.8MiB/s-94.8MiB/s (99.4MB/s-99.4MB/s), io=5753MiB (6032MB), run=60662-60662msec

I was going to test the relative performance of a SINGLE one of my test nodes, but it's getting late and TrueCommand/SCALE may be feature incomplete in that I can't seem to decom the cluster and go back to running a single node. This is still listed as experimental, so it's fine. I'll just wipe one of the test nodes and I'll update this thread with those results when I have some time.

I hope this helps some folks...and I hope I didn't perturb too many people by going off the beaten path and using a USB NIC @jgreco is probably screaming inside right now. In any case, my wife is going to have to deal with the electric bill from my big boy server a bit longer before moving to the hyperconverged and highly available SCALE cluster is ready for prime time.

morganL · Mar 8, 2023

NickF said:
Hi all,

With SMB Clustering being out for about a year now, and with the first minor updates to Bluefin (22.12.1) and TrueCommand (2.3.1), both documenting improvements in this area, I've decided now is a good time to start playing around with clustering. I switched to SCALE in 2021 and I am putting my money where my mouth is.

I've put together a cluster of 3 small "TinyMinyMicro" nodes, each with 16GB of RAM and 2x512GB NVME drives (computer chosen has 2 PCIe Gen4x4 NVME slots) to do this testing. I think this can represent a pretty good "worst case scenario" for what people might try to do to stretch relatively low end gear. If CORE is scale UP with tons of disks in disk shelves, SCALE is scale out with multiple computers. Folks are undoubtedly going to try and do this on the cheap, and the TinyMiniMicro formfactor is, I think, the perfect low end starting point for understanding what performance might be. I have done testing with the onboard gigabit ethernet, as well as with a 5 Gigabit Ethernet USB adapter, connected to a multigigabit capable switch.

Thanks for this test... we have a lot of testing internally as well.

Bottom line is that clusters need network bandwidth
To cluster NVMe drives, 100Gbe network is needed.

Typically a 3 node cluster is not faster than a single node. Clusters need to do more work to write or read data.
Clusters are higher latency than single nodes.... better suited to larger numbers of clients and addressing bandwidth or capacity limitations.

NickF · Mar 8, 2023

I think that's understood, but it was worth playing around with and sharing my findings. The fact is that people are going to think, like with RAIDZ1 or a Mirrored array that read performance might be better. However, because of the latency inherent between each node and the additional computational overhead associated, that's simply not the case. As for the networking, I concur, which is why I purchased the fastest NICs available for my test platform and shared the performance scaling between the two.

Gigabit is simply not going to be fast enough and even my multigigabit performance sucks. Removing the bottleneck from the drives by using NVME has shown that the network stack's bottleneck is really hitting hard. Computationally speaking these nodes are far faster than a Mini:

T

The value add of 25 gigabit will really shine over 10 gigabit here, and I think if people are looking to build a scale out cluster, that should really be the starting point. Getting less than gigabit performance out of a 5GBe interface with NVME is simply not worth the benefit. People are going to start asking these questions here as this feature set matures and more people dip their toes in. At least now they can find this thread and see what not to do :)

HoneyBadger · Mar 8, 2023

Greatly appreciate the testing being done, and even moreso the sharing of it in a "journey for knowledge" format here so that others can benefit from your experiments as well - both in a "what to do" and "what not to do" as you say.

NickF · Mar 8, 2023

HoneyBadger said:
Greatly appreciate the testing being done, and even moreso the sharing of it in a "journey for knowledge" format here so that others can benefit from your experiments as well - both in a "what to do" and "what not to do" as you say.

Right! Thats the intention.

For posterities' sake, my expectation was that for the gigabit adapter would perform approximately 1/3 as fast as a single node, and the 5 Gigabit adapter (which, because of USB limitations might better be described as 3.2 Gigabit) would scale similarly. In other words, when you are reading or writing to 3 nodes at the same time, so you are splitting that pipe up into thirds.

Assuming a single node can do 110MB/s Sequential Read and Write in Crystal (which I think is a fair assumption that I will verify), I had expected to get somewhere abouts 36 MB/s, and I got pretty close to that expectation, getting 40 mixed. This is a performance delta of about 36% compared to 1 node saturating a gigabit link.

I had similarly expected the 5 Gigabit adapter (which should be around 400MB/s on a single node) to come in at 133MB/s. I came in at 52, which is only 13%. If there is any take away from my testing, it's that the performance scaling between network adapters does not appear to be linear. Obviously, this is one mans test (and using an Aquantia USB NIC!!) so that may not be true, and hopefully I spark some other curious minds to contribute here..

NickF · Mar 9, 2023

If I can just ask here, and this isn't something that seems possible now with the GUI now, but may be possible. Can we dedicate interfaces to inter-node communication and have a different dedicated network interface for client-facing traffic? Obviously further layers of abstraction with LAGS would be even more ideal.
At the very least, that model may prove to lessen the performance penalties here.

MY UNDERSTANDING of the traffic flow is like this:

All traffic is hitting any one node at a time and that node distributes the parity bits to the bricks on the other two nodes. (Maybe not parity, but whatever the gluster terminology is here)
So inherently, depending on how big your cluster is, you are dividing/splitting the theoretical maximum "pipe size" in 3 for 3 nodes, 5 for 5 nodes, etc.

So if we add a second physical NIC into the mix, wouldn't it remove the network bottleneck? Obviously there will still be a latency penalty and a computational one, but I think that would still be a substantial performance improvement?

Double lines here are a different physical interface:

Maybe @morganL @Kris Moore have some thoughts they can share here.

Within the limits of my test platform as it is, if I were to add another 5GB NIC, I feel like it would make a big difference and maybe even make this a viable lab solution. Even if I used the onboard gigabit interface for client facing traffic, and then used the 5Gbe adapter for inter-node traffic, I would imagine that performance would be better than with the 5GBe adapter alone. I think it would be neat if I could approach gigabit ethernet speeds via SMB on this platform.

morganL · Mar 9, 2023

NickF said:
If I can just ask here, and this isn't something that seems possible now with the GUI now, but may be possible. Can we dedicate interfaces to inter-node communication and have a different dedicated network interface for client-facing traffic? Obviously further layers of abstraction with LAGS would be even more ideal.
At the very least, that model may prove to lessen the performance penalties here.

MY UNDERSTANDING of the traffic flow is like this:
View attachment 64552

All traffic is hitting any one node at a time and that node distributes the parity bits to the bricks on the other two nodes. (Maybe not parity, but whatever the gluster terminology is here)
So inherently, depending on how big your cluster is, you are dividing/splitting the theoretical maximum "pipe size" in 3 for 3 nodes, 5 for 5 nodes, etc.

So if we add a second physical NIC into the mix, wouldn't it remove the network bottleneck? Obviously there will still be a latency penalty and a computational one, but I think that would still be a substantial performance improvement?

Double lines here are a different physical interface:
View attachment 64553

Maybe @morganL @Kris Moore have some thoughts they can share here.

Within the limits of my test platform as it is, if I were to add another 5GB NIC, I feel like it would make a big difference and maybe even make this a viable lab solution. Even if I used the onboard gigabit interface for client facing traffic, and then used the 5Gbe adapter for inter-node traffic, I would imagine that performance would be better than with the 5GBe adapter alone. I think it would be neat if I could approach gigabit ethernet speeds via SMB on this platform.

yes, you can have a separate physical or virtual network (VLAN). the physical networks can be on the same or different switches.

The gluster protocol runs on the back-end between nodes

The SMB access protocol runs on the front-end network.

HoneyBadger · Mar 10, 2023

NickF said:
If I can just ask here, and this isn't something that seems possible now with the GUI now, but may be possible. Can we dedicate interfaces to inter-node communication and have a different dedicated network interface for client-facing traffic? Obviously further layers of abstraction with LAGS would be even more ideal.

Yes, it is possible - you have to pre-configure the network interfaces on each node, and then select them during the "Create Cluster" interface, as that's when the back-end Gluster network is set up.

Using your 1Gbps for "front-end" and the 5Gbps for "back-end" should improve things overall. There's still the added latency impact of sending the traffic to each other cluster member (depending on your redundancy type) but performance should improve over a single-network solution.

Important Announcement for the TrueNAS Community.

Cluster Relative SMB Performance

NickF

Guru

Attachments

morganL

Captain Morgan

NickF

Guru

HoneyBadger

actually does care

NickF

Guru

NickF

Guru

morganL

Captain Morgan

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Cluster Relative SMB Performance

Guru

Attachments

Captain Morgan

Guru

actually does care

Guru

Guru

Captain Morgan

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Cluster Relative SMB Performance"

Similar threads