Resource icon

The path to success for block storage

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
No, RAIDZ will not work better, and it isn't an unusual use case anyways. RAIDZ has a lot of baggage with it in the form of block size selection and parity use.

If you don't care about data integrity other than maybe to notice something amiss, just stripe all the disks.

100gigE isn't likely to be a useful thing for you. The fastest modern disks can sustain an average of about 200Mbytes/sec doing sequential access which is ballpark 2Gbits/sec, so you have a theoretical max of 14 * 2Gbps = 28Gbps. You won't be seeing that most of the time, but certainly 10GbE would be fine and 40 seems justifiable.
 

Sirius

Dabbler
Joined
Mar 1, 2018
Messages
41
Hmm, fair enough. I'll stick with mirrors then as I can afford the wasted space for a bit of integrity.

I would have thought 100gigE would take advantage of an Optane based L2ARC, or even regular ARC assuming a system with RAM/ARC that's fast enough.

One Optane 900p will do 2500MB/s, so 4 together (assuming perfect scaling - which won't happen) would be 10000 MB/sec, so with 8 of those Optanes in a striped L2ARC that'd be a theoretical 20000MB/sec (again assuming perfect scaling), which is above the 10GB/sec theoretical a 100gigE connection can manage.

Of course, that depends on all that data fitting in to L2ARC, but 8 striped 280GB Optane would be 2.2TB of L2ARC (assuming no overprovisioning) and there's no way the working set would fill all of that. The system also has 128GB 1866mhz ECC memory in quad channel, so regular ARC should hold quite a bit of hot data too.

The only reason for not just going all flash is that it's simply too far out of my budget - ~14 7200rpm 8TB drives isn't too bad to purchase, but ~14 (or ~7 if I stripe them assuming enterprise drives) 8TB flash drives? I hate to imagine how much those would cost, even if I got them off eBay.

As for my current network, I use Mellanox ConnectX-3 cards with the 56gigE firmware tweak applied, so I've seen sequential speeds as high as 5GB/sec. I'm easily breaking past 10gig performance.

Thank you so much for your reply, I appreciate it! :)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Speaking strictly from a speed point of view, yes, ARC and L2ARC could potentially hit significantly higher than 40GbE speeds.

However, you're also talking about iSCSI here, so you have a practical limit in the number of transactions per second you can manage across the wire on a TCP circuit, and it isn't just a matter of "hey iperf3 says I got fifty gigabits" but rather that all your traffic has to traverse a complex stack of components and software.

https://www.truenas.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

I do a little bit on latency in the latter half of that post. It's a little bit bull*** for various mitigating reasons, but also still very good to contemplate as to why there are practical limits to how fast a NAS/SAN can go.

If you had multiple iSCSI clients, I'd bet that would be able to do a much better job of utilizing your resources.
 

Sirius

Dabbler
Joined
Mar 1, 2018
Messages
41
Interesting article. My current SLOG is 2 x Optane 900p 280GB although I've considered using P4800X or 905p, but then again, write performance isn't as important to me as read.

My thought was that if I went with say, Chelsio T62100-CR cards, on both the target and initiator end, I could use either iSCSI Offload or iSER to reduce the amount of overhead between the two systems. I wouldn't really try to run 100gigE with a non-RDMA or non-offload based solution as I imagine that'd suck up CPU like crazy - not just on the Target but also on the Initiator, hence why I thought iSER or iSCSI Offload could help.

According to the manual for those cards, they support both Windows 10 Pro For Workstations and FreeBSD with iSCSI Offload support, and if I was OK with running Linux on the Target side I could use iSER on both ends. I considered Mellanox cards but they dropped iSER support on Windows and I don't think they've ever done iSCSI Offload.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Noob here. How did you know and how can we know to avoid when purchasing drives?

 

medicineman25

Dabbler
Joined
Mar 20, 2021
Messages
29
Farout! I was literally going to purchase half of these brands recently. Thanks for the info :)
 

dffvb

Dabbler
Joined
Apr 14, 2021
Messages
42
Good read - maybe worth updating for differences if one uses flash only?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Good read - maybe worth updating for differences if one uses flash only?
Have a look at post #29 in this thread for a bit of detail on that front re: RAIDZ vs mirrors.

 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I linked the newly-posted Resource to this old thread, to preserve the discussion. XenForo actually made a revant suggestion!
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
1. You'll have better results with smaller width vdevs (eg: the 2x 3-drive) in terms of both performance and possibly also the space efficiency. You'll also have better redundancy, as with the single 6-drive Z1 you can only tolerate one failure, whereas with the 2x 3-drive you can have one drive in each of the 3. You'll virtually never be bottlenecked by the parity calculation speed on any modern processor.

2. The default ZVOL block size in Free/TrueNAS is 16K. Mirrors can handle any size, but the RAIDZ ones if you reduce the blocksize/recordsize too small, you end up with poor space utilization as described before. 16K is usually fine. I've been getting good results with 32K as it allows for bigger records to compress where possible.

3. Without a non-volatile write cache I can't see that Sabrent working well (and looking at the SLOG thread, it choked hard at low record sizes) - you can use non-enterprise Optane like the M10 or 900p/905p but you need to mind the endurance. It depends on how hard you write to your array, but in your scenario I'd be tempted (home lab, you can afford downtime?) to just have a handful of 16G Optane cards and swap them when they hit the 300TBW mark.
Are you matching your data set size and block size to both be 32k?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Are you matching your data set size and block size to both be 32k?

If we're talking ZFS parameters, only one of them matters - recordsize is relevant to NFS, and volblocksize is relevant for ZVOLs/iSCSI, so the impact of recordsize at the pool level isn't relevant for volblocksize of ZVOLs created underneath it.

If we're talking the size of I/Os "inside the VMDKs" then 32K is what I've used for "general/uncategorized" workloads as I find it gives measurable compression gains over the default 16K. For things like Exchange/SQL where 64KB NTFS allocation is specified as a "best practice" then set a matching size.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
If we're talking ZFS parameters, only one of them matters - recordsize is relevant to NFS, and volblocksize is relevant for ZVOLs/iSCSI, so the impact of recordsize at the pool level isn't relevant for volblocksize of ZVOLs created underneath it.

If we're talking the size of I/Os "inside the VMDKs" then 32K is what I've used for "general/uncategorized" workloads as I find it gives measurable compression gains over the default 16K. For things like Exchange/SQL where 64KB NTFS allocation is specified as a "best practice" then set a matching size.
That makes sense. I found 32k or 64k worked best for general Use with 128 being too high.

however with 128 it sure used alot less cpu!
 

im.thatoneguy

Dabbler
Joined
Nov 4, 2022
Messages
37
Following up here on block sizes. I have an NVME RAID card which I hope to use for L2ARC (I figure take some of the CPU load off wherever possible). When creating the virtual RAID0 disk for l2arc should I match it to volblocksize? Does it matter? It defaults to 64K for stripe size. Does L2ARC care?

I ran fio on /dev/RAIDCARD and it was 8GB/4GB r/w and >200k IOPS so it seems in theory to be an excellent L2ARC sink in theory but in practice, questionable in real world performance. I'm wondering if setting 64K for volblocksize across the board would help.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
When creating the virtual RAID0 disk for l2arc
It's suggested that you don't use RAID devices anywhere - including L2ARC. Bear in mind that volblocksize and recordsize are maximum values, and smaller records can still be written - so the read-modify-write overhead of a 4K block on a 64K stripe RAID device would be significant.

What's the device in question?
 

im.thatoneguy

Dabbler
Joined
Nov 4, 2022
Messages
37
What's the device in question?
Supermicro AOC-SLG3-2H8M2-O

Unfortunately, it doesn't allow raw access to the disks as a PCIe bifurcation device. But it's the only NVME board that's officially supported by Supermicro. Maybe I need to just go the Amazon route and buy a generic NVME 8x -> 2x M.2 4x boards.

Use case is for almost exclusively files > 2MB-80GB though. So, there shouldn't be many 4k reads (writes to ARC).
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
What do you mean? The 2H8M2 is a SAS3408 based board, it isn't a "bifurcation device" but rather (I'm pretty certain) an IR-mode RAID controller, which is theoretically acceptable for use with TrueNAS. I can't guarantee it as I haven't seen it.

But it's the only NVME board that's officially supported by Supermicro.

That's ridiculous. The AOC-SLG3-2M2 is a bifurcation paddleboard that is functionally equivalent (just more expensive than) any of those "Amazon" cards; I have dozens of these running just fine in X9 systems that support bifurcation.

If you have a board that does not support bifurcation, Supermicro also offers other products based on a PLX, such as the AOC-SHG3-4M2P, which supports 4x NVMe M.2 in a PCIe x8 slot. Have one, works wonders.
 

im.thatoneguy

Dabbler
Joined
Nov 4, 2022
Messages
37
The 2H8M2 is a SAS3408 based board, it isn't a "bifurcation device"
it doesn't allow raw access to the disks as a PCIe bifurcation device.


That's ridiculous. The AOC-SLG3-2M2 is a bifurcation paddleboard
It's not on the supported AOC list for my motherboard, system or on the SLG3-2M2 hardware list. Someone on Amazon said that it threw PCIe errors in the IPMI on their unsupported SM motherboard and eventually replaced it to stop getting error alerts. The last time I tried a Connect-X NIC that wasn't exactly on SM's supported\tested list I had nothing but issues and Supermicro tried to debug it but then eventually shrugged and said "weird, but not a supported model so not anything more we can do". If I'm going with unsupported hardware that might glitch out, I figure I might as well spend 1/4 as much so that I can try 4 different cards to see if any of them are better than the others.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It's not on the supported AOC list for my motherboard, system or on the SLG3-2M2 hardware list. Someone on Amazon said that it threw PCIe errors in the IPMI on their unsupported SM motherboard and eventually replaced it to stop getting error alerts. The last time I tried a Connect-X NIC that wasn't exactly on SM's supported\tested list I had nothing but issues and Supermicro tried to debug it but then eventually shrugged and said "weird, but not a supported model so not anything more we can do". If I'm going with unsupported hardware that might glitch out, I figure I might as well spend 1/4 as much so that I can try 4 different cards to see if any of them are better than the others.

The 2M2 is a bunch of copper traces on a PCB and a few components for power. I could easily imagine an Amazon customer incapable of spelling "bifurcation" much less understanding what "bifurcation support required" meant. I've used them on a number of X9 and X10 boards, was rather enraged to find that bifurcation support didn't exist on an X9DR7-TF+, and I note that the thing that most sucks about them is you have to figure out a little bit of BIOS configuration fu, which is annoying to me since Supermicro risers have little ROM's on them to help the mainboard correctly configure the PCIe slots.

X being incompatible with Y is just a hazard of the business, and I don't really blame Supermicro for bailing on the Connect-X. I spend a certain amount of time (and dollars) identifying cheap parts that can be made to work together but not all of them are a success. The main upsides for the 2M2 are just that it has a really sweet PCIe retention bracket and also SMBus support. Otherwise, the Aliexpress cheapies being resold on Amazon are just about the same.
 
Top