vpools on 2 enclosures

StorageCurious · Sep 28, 2022

Hi,

(new here - hi!)

I am planning on testing out TrueNAS for my needs, via a PowerEdge 730XD (768GB RAM, with a non-RAID HBA330) that I just bought for this (and will either stay on as TrueNAS Core or be repurposed after my tests, so I had nothing to lose here). I've already been trying it on a PC I had lying around but let's just say it's too limited to try any real lifer scenarios.

It is a 24 x 2.5" bay device, which is good enough for testing but I would have liked 3.5" drives for capacity. But what I bought was all I could find (at the right price) rapidly. I figured I might as well use an external enclosure, for example a Dell MD1400 and connect it with a LSI 9300 card that has an external SAS port. This way I would get enough LFF bays for data and plenty of SFF bays for "other needs".

If I go that way, I will want to use the entire 12 LFF bays for spinning disks (main storage, probably RAIDZ2 although RAID10 equivalent is possible) while I would use the PowerEdge SFF bays for "non-data helper vdev" like SLOG/metadata (and of course the TrueNAS Core boot mirror) and possibly a small separate SSD or 15k drive storage pool. Most bays would be left with blanks, at least initially.

I truly don't know if any of this will be needed, but it might and 2 small Optanes for SLOG and a couple of SSDs for metadata seem a small price to pay if it makes this noticeably faster.

I am questioning my external SAS enclosure plan. Here are my specific questions:

1) If I were to try to add performance to the LFF vpool by adding a SLOG or a metadata vdev (or L2ARC, although with my RAM it's unlikely to be really needed), those extra "helper vdevs" would be on the server itself, while the storage vdev would be on the enclosure. This would mean I would have a pool that has critical disks in both device. If there any reason this isn't a good idea in terms of reliability?

2) ...specifically, what happens if the SAS cable is disconnected (or the enclosure is turned off/enclosure looses power before the server does in a power failure) and the TrueNAS server sees only the metadata vdev but loses sight of the storage vdev? Losing the pool, or does the pool "disconnects/goes missing" until the enclosure is connected back in?

I imagine that little SAS cable and feel it's carrying a lot of weight here.
(If this goes in production I will have a separate system for replication and an external cloud-storage for backup, but that`s not relevant at this point in my queries)

Or am I worrying too much and whatever I can imagine happening can as likely happen on a single integrated enclosure?

As a side note - what is a good rackmount SAS enclosure recommendation these days for 12 LFF for TrueNAS? The MD1400 looked like it could play the part but that`s mostly because I don't know what I am looking for.

ChrisRJ · Sep 28, 2022

Please describe your use-case in as much detail as possible.

Also, I highly recommend to check the resources from the "Recommended readings" in my signature.

StorageCurious · Sep 28, 2022

Please describe your use-case in as much detail as possible.

Trying to understand a hypothetical, which is the use of an external enclosure to only have one TrueNAS instance with a mix of SFF and LFF?

But, you mean "eventual use-case" then it would be 80TB of VM virtual disks.

ChrisRJ · Sep 29, 2022

What are your performance requirements? The amount of 80 TB indicates that 5k IOPS might be a bit tight, but that is just a guess.

Edit: The forum rules have some pretty good points how to describe what shall be achieved, i.e. the use-case.

StorageCurious · Sep 29, 2022

Chris,

I feel we are hijacking my own thread, which was to discuss the implications if a pool is split across enclosures via a SAS cable, to discuss performance optimization of my use-case. Which is to my benefit and appreciated, but that wasn't my most pressing concern.

But if you're so inclinded - in terms of performance requirement, I don't quite have numbers. I do know :

- A few of these Virtual machine have a total of 60TB of files and are used as a Windows file server of Apple file repository and may become a Dataset shared over SMB via TrueNAS if I test the AFP/SMB aspects and like them - I'm just hoping AD permissions are somewhat similarly managed (security groups, etc). At worst they`ll stay as a big consolidated Windows File server.
Those are infrequently accessed, almost archive-level for the bulk of it.

- Of the other 20TB, about 5TB are actively used (SQL, Mail server, etc.) while the other 15TB are VM machine either turned on and built for a particular purpose (and infrequently used) or turned off and only turned on when needed.

Where did you get 5k IOPs?

HoneyBadger · Sep 29, 2022

StorageCurious said:
I am planning on testing out TrueNAS for my needs, via a PowerEdge 730XD (768GB RAM, with a non-RAID HBA330) that I just bought for this ... 24 x 2.5" bay

Welcome. That's one heck of a "learner system" - it's got more heft behind it than several production builds I've seen. I'll answer your direct questions first and then dig into things a bit more.

StorageCurious said:
1) If I were to try to add performance to the LFF vpool by adding a SLOG or a metadata vdev (or L2ARC, although with my RAM it's unlikely to be really needed), those extra "helper vdevs" would be on the server itself, while the storage vdev would be on the enclosure. This would mean I would have a pool that has critical disks in both device. If there any reason this isn't a good idea in terms of reliability?

2) ...specifically, what happens if the SAS cable is disconnected (or the enclosure is turned off/enclosure looses power before the server does in a power failure) and the TrueNAS server sees only the metadata vdev but loses sight of the storage vdev? Losing the pool, or does the pool "disconnects/goes missing" until the enclosure is connected back in?

The general approach of using JBODs/external enclosures always brings with it some risk of interruption, whether due to the SAS cable failure (which can be mitigated with redundant cabling) or the enclosure power failing (which can be mitigated with redundant power feeds and UPS's) - but things like the enclosure backplane failing are harder to account for. You could even use two external enclosures, two separate HBAs in your "head unit", and configure the pool so that each mirror vdev uses a disk from each enclosure; in that world, you'd theoretically be able to survive a single enclosure failing as each disk would have a mirrored partner in the second enclosure. To scale beyond two enclosures, you then have to make a new pool with the "mirror across JBODs" setup built each time. It would be arduous but doable.

Specifically, what should be expected is that all I/O to the pool will stop, and you'll very likely have your server issue a kernel panic if the enclosure isn't recovered within a minute or two (the ZFS deadman timer will trigger, the default behavior is "wait" but it can only "wait" for so long before it will throw a fit) - the question then is "what happens to your data?"

In the case of datasets or zvols that are set to use asynchronous writes (the default for SMB/AFP, and iSCSI) roughly the last few seconds of writes prior to the interruption could be lost permanently, as they were only logged into main RAM. Datasets or zvols that are set to use synchronous writes, on the other hand (which is where your SLOG devices come in) would have those writes safely stored either on the in-pool ZFS Intent Log (ZIL) or on the separate LOG device (SLOG) - and prior to the pool coming back online (either within that short window of recovery without reboot, or after a panic, reboot, and re-import) ZFS will note that there are incomplete transactions in the intent log, and synchronize them to the pool.

Now, back to your overall system setup.

An R730XD SFF is an incredibly solid base to build a performance system on, and you've already got the correct HBA330 storage controller and "gobs of RAM" - those 24 bays in the head unit are begging to be filled with SSDs and used as a high-performance pool of mirrors. 10K/15K SAS disks have fallen out of favor recently, but are still a potential option if budgetary constraints demand. This would be my preferred option for housing that ~5TB or so of "active VMs" - mirrored SSD or SAS10/15Ks, with a chunk of Optane for an SLOG device. An interesting module you could potentially explore is the U.2 NVMe front bay option that gives you a combination of SAS and NVMe protocols, which would be helpful for SLOGs as you wouldn't have to sacrifice as many internal PCIe slots.

With regards to the LFF expansion for the "file share" capacity, the use of it for the "less used" VMs still means I'm going to lean towards mirrors here as well. See the resource below that was written on this:

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files...

www.truenas.com

Were this my system personally, I'd construct as follows:

I'd use two disks for mirrored boot (either in the front bays or the rear-mounted 2x2.5" cage, if it's available)

For the "fast pool" if budget allows, I'll use 8x 1.92T SSDs to make a pool with roughly 7.68T of speedy flash, and put my heaviest-hitting VMs (Exchange, SQL, etc) there. If I can't swing SSDs, I could consider something like 16x 1.2T HGST SAS10K refurb drives for not much money, have a 9.6T mirror setup, but never assign more than half of that (4.8T) - and I'll definitely want some L2ARC to support this workload. The amount of primary RAM and L2ARC can definitely mitigate how much of a performance hit you take in going to spinning disks - especially if you get a good compression ratio.

I'll try to buy the 4x U.2 NVMe bay. This gives bays 21 through 24 access to high-speed PCIe x4 - and then again, based on budget, I'll gun for the Optane DC P4800X as the "premium" option, or the budget option of the Optane P4801X or DC P3700 for SLOG devices. Perhaps the premium Optane for the fast pool, and the budget Optane or P3700 for the LFF based pool. If I can't find the NVMe bay, then my preferred SAS SLOG is the Ultrastar DC SS530 - the "Write Intensive" or "WUSTM" model specifically - but I might still consider it worthwhile to sacrifice an internal PCIe slot or two for the Optane drives.

For the LFF pool, I'm going mirrors. 16TB seems to be the sweet spot for dollars-per-TB, either in the Seagate Exos or the WD Ultrastar HC550 series - and I'll buy the SAS style so that I can run a pair of cables and hopefully have redundancy against cable failure. The challenge here is balancing the cost, capacity and IOPS - it's of course less expensive to run fewer, larger drives - but each spindle can only drive a fixed number of IOPS.

StorageCurious · Sep 29, 2022

Thanks HoneyBadger.

The general approach of using JBODs/external enclosures always brings with it some risk of interruption...
...Datasets or zvols that are set to use synchronous writes, on the other hand (which is where your SLOG devices come in) would have those writes safely stored

I was intending to go with sync writes for safety, so I'm glad my initial instincts (and understanding of ZFS) were right. My main worry was a SAS cable coming disconnected during "non-critical" maintenance. I would likely make a production system "redundant" in terms of SAS cabling. But this would probably be one of these "oops" moment where we are working in the server room anyways and can be fixed quickly - as long as no pool/data is lost I can live with this.

Thank you for that - looking forward to pulling a cable as a test on fake-data!

About the rest of your suggestions :

- I don't have the U.2 NVME parts yet, but it is planned to get the necessary items to get the 4 NVMe drives working in that server at the cost of only one PCIe slot. It also makes the NVMe drives hot-swappable as opposed to having them as PCIe drives, which is always a plus.

- I probably will end up mirroring a bunch of 2.5" drives (10k or 15k possibly) for the VM machines, and use a somewhat big SSD drive (3.84TB?) for L2ARC on the enclosure (unless my initial tests show 768GB of RAM to be plenty - I have a feeling my working dataset is very modest and would fit this large SSD) . This would be cheaper, since I only have to buy one single SSD drive (mirror not needed for L2ARC) as opposed to filling the entire pool with them. SLOG on 2 NVME U2 drives, and possibly a mirrored SSD vdev for metadata. All things to be tested out in real-life scenarios. Who knows my first go at it I might get a 90% hit rate on ARC alone.

I do realize the writes won't be be helped by the ARC/L2ARC, but it might be one of those compromises I need to live with.

- The file share would hopefully be served from TrueNAS directly instead of from a VM machine - If that happens, wouldn't a RAIDZ2 be better than mirroring (cost-wise) with limited performance impacts? My understanding is all those SLOG/mirrored things apply mostly to block storage, not so much to SMB datasets hosted straight on TrueNAS.

My plan is to actual swing the VM machines on it, without L2ARC/metadata vdev/SLOG and check the ZFS stats to get an first idea, then add a big cheap SSD drive (and burn through it, I know) as L2ARC to get a second set of data point. All the while running a few real-life tests to compare.

Welcome. That's one heck of a "learner system" - it's got more heft behind it than several production builds I've seen. I'll answer your direct questions first and then dig into things a bit more.

Either it becomes a production system, or replaces one of our weaker VMWare hosts. Either way it's not wasted.

ChrisRJ · Sep 29, 2022

StorageCurious said:
I feel we are hijacking my own thread, which was to discuss the implications if a pool is split across enclosures via a SAS cable, to discuss performance optimization of my use-case. Which is to my benefit and appreciated, but that wasn't my most pressing concern.

Well, in a former life I worked as a consultant and for me one of the most important aspects of the job was to point out things that might be relevant in addition to what was explicitly asked for. Somehow I have preserved that attitude. I am well aware that wasn't what you asked for. But some of your wording indicated, to me, the possibility of performance considerations not as precise as possible. In hindsight, I may have been wrong here

. Anyway, thanks for the positive comment, there have already been very different ones.

StorageCurious said:
But if you're so inclinded - in terms of performance requirement, I don't quite have numbers. I do know :

- A few of these Virtual machine have a total of 60TB of files and are used as a Windows file server of Apple file repository and may become a Dataset shared over SMB via TrueNAS if I test the AFP/SMB aspects and like them - I'm just hoping AD permissions are somewhat similarly managed (security groups, etc). At worst they`ll stay as a big consolidated Windows File server.
Those are infrequently accessed, almost archive-level for the bulk of it.

That sounds indeed like completely negligible, given the overall power of your system.

StorageCurious said:
- Of the other 20TB, about 5TB are actively used (SQL, Mail server, etc.) while the other 15TB are VM machine either turned on and built for a particular purpose (and infrequently used) or turned off and only turned on when needed.

I guess that is the piece of information, which had been missing for my understanding. In my world, 80 TB of VM storage typically means 500+ VMs, each with an I/O demand roughly comparable to an RDBMS server at maximum load. Of course, that is extreme and I had not expected something in that order of magnitude. But already 10 VMs with high IOPS demand, can be tricky.

Thanks for the interesting discussion and good luck!

StorageCurious said:
Where did you get 5k IOPs?

That was a completely arbitrary number, simply to trigger a discussion

StorageCurious · Sep 29, 2022

thanks for the positive comment, there have already been very different ones.

I don't doubt it.

StorageCurious said:
StorageCurious said:

Where did you get 5k IOPs?

Click to expand...

That was a completely arbitrary number, simply to trigger a discussion

Well you had me googling "5k IOPs" for a bit, thinking it was a meaningful constant in the world of ZFS ;-)

HoneyBadger · Sep 29, 2022

StorageCurious said:
Thanks HoneyBadger.

I was intending to go with sync writes for safety, so I'm glad my initial instincts (and understanding of ZFS) were right. My main worry was a SAS cable coming disconnected during "non-critical" maintenance. I would likely make a production system "redundant" in terms of SAS cabling. But this would probably be one of these "oops" moment where we are working in the server room anyways and can be fixed quickly - as long as no pool/data is lost I can live with this.

Thank you for that - looking forward to pulling a cable as a test on fake-data!

It looks like you've done your reading. Remember that enforcing sync writes will cause your SLOG device to become a potential bottleneck for write speeds, so only set it for where it's truly necessary.

And definitely make sure it's fake/test data before pulling the plug. If your HBA(s) and backplane are set up to use either redundant paths or a wide-band approach for the SAS link then you'll likely see a momentary hiccup as it switches to "use secondary SAS path" or "renegotiate from x8 to x4 lanes" but then it will continue on.

StorageCurious said:
About the rest of your suggestions :

- I don't have the U.2 NVME parts yet, but it is planned to get the necessary items to get the 4 NVMe drives working in that server at the cost of only one PCIe slot. It also makes the NVMe drives hot-swappable as opposed to having them as PCIe drives, which is always a plus.

It's worth the PCIe slot savings, but NVMe hotplug support is to my knowledge still a little touch-and-go on FreeBSD 13.x - you'll definitely want to dig a bit deeper into this as it might be a case where you need to manually send power commands to the device before unexpectedly removing it.

StorageCurious said:
- I probably will end up mirroring a bunch of 2.5" drives (10k or 15k possibly) for the VM machines, and use a somewhat big SSD drive (3.84TB?) for L2ARC on the enclosure (unless my initial tests show 768GB of RAM to be plenty - I have a feeling my working dataset is very modest and would fit this large SSD) . This would be cheaper, since I only have to buy one single SSD drive (mirror not needed for L2ARC) as opposed to filling the entire pool with them. SLOG on 2 NVME U2 drives, and possibly a mirrored SSD vdev for metadata. All things to be tested out in real-life scenarios. Who knows my first go at it I might get a 90% hit rate on ARC alone.

The ultimate question is "can your workload needs be filled by a pool with 10K/15K SAS drives?" The relatively massive ARC will do a good job of helping your read performance certainly. L2ARC is also useful but isn't quite as "intellligent" as ARC - it's a ring buffer that gets (slowly) filled by the "ARC eviction candidates" so it isn't as simple as hoping that the most recently used 3.84T will be served from there. The fact that you have a plan to test these scenarios bodes well though.

StorageCurious said:
I do realize the writes won't be be helped by the ARC/L2ARC, but it might be one of those compromises I need to live with.

Writes will be aggregated by ZFS - with the amount of RAM you have, you could potentially extend the amount of data allowed to be "pending write to pool" but ultimately your sustained write speed comes down to your pool devices. Keeping a large amount of free space to allow for contiguous writes will help.

StorageCurious said:
- The file share would hopefully be served from TrueNAS directly instead of from a VM machine - If that happens, wouldn't a RAIDZ2 be better than mirroring (cost-wise) with limited performance impacts? My understanding is all those SLOG/mirrored things apply mostly to block storage, not so much to SMB datasets hosted straight on TrueNAS.

Fileshares are better done from TrueNAS directly, as you can get better results from things like the prefetcher knowing the details of "this is a file being accessed" as opposed to "these random blocks in a ZVOL are being accessed."

If the files are large, and accessed in a generally-sequential manner by a small number of clients, then RAIDZ2 can theoretically provide better throughput. But the more that the I/O pattern trends towards random, though, whether as a result of multiple dozens of client machines, smaller files, or random access within large files, the better mirrors tend to do in comparison. If you plan to run VMs from it, those are also heavily biased towards that small, random I/O.

StorageCurious said:
My plan is to actual swing the VM machines on it, without L2ARC/metadata vdev/SLOG and check the ZFS stats to get an first idea, then add a big cheap SSD drive (and burn through it, I know) as L2ARC to get a second set of data point. All the while running a few real-life tests to compare.

You'll definitely want to use an SLOG device, and if using iSCSI set your ZVOLs to sync=always for data safety. VMware will do sync writes to NFS exports by default, but expects the storage array to have non-volatile write cache when iSCSI is used.

StorageCurious said:

Either it becomes a production system, or replaces one of our weaker VMWare hosts. Either way it's not wasted.

Click to expand...

Are you planning to use iSCSI or NFS to connect from your VMware hosts? There's a little bit of tuning and some settings that are best for each of them.

StorageCurious · Sep 29, 2022

It's worth the PCIe slot savings, but NVMe hotplug support is to my knowledge still a little touch-and-go on FreeBSD 13.x

I`m not the type to assume anything, but I did here and I'm glad you mentioned this. Probably better to use a M.2 adapter instead for SLOG, to avoid hot-swapping the drive by sheer habit. Something like a quadruple M2 adapter from a known brand. I don't really know much about NVMe in a business setting, I just have a pair in my gaming PC.

Are you planning to use iSCSI or NFS to connect from your VMware hosts? There's a little bit of tuning and some settings that are best for each of them.

iSCSI - I'm already using iSCSI on some older RAID hardware I inherited. My initial tests will be easy because I can just VMmotion my machines from my existing storage onto the TrueNAS test machine (once I am confident of the basic setup so I won't lose data), run a few tests, switch them back, destroy/enhance/recreate the test zpool, rinse and repeat..

The more I think about this though the more it sounds like a SSD zpool is indeed the way to go, at least for the high-performance part. I presume when this is done you tell the L2ARC not to cache the SSD pool, so it focused on the other more needy pools(s). Or does L2ARC cache helps the SSD pools just by taking some load off them?*

*These feel like theoretical questions, I don't think my use-case necessitates that level of tweaking. But tweaking is so much fun!

It looks like you've done your reading

I have, which doesn't mean I understood everything. But I tried not to come at this completely clueless. And this is a rabbit hole, so there is much interesting reading available.

Thanks a lot for your answers, they certainly help me think through this, and the reassurance that I'm not completely off the rails does help.

Etorix · Sep 29, 2022

L2ARC is not a system device, it is a pool device. If the pool has a L2ARC device, it is used. If a pool has no cache device, it won't use the L2ARC of another pool.

StorageCurious · Sep 30, 2022

Etorix said:
L2ARC is not a system device, it is a pool device. If the pool has a L2ARC device, it is used. If a pool has no cache device, it won't use the L2ARC of another pool.

I knew that! Must have been thinking about a post I found about disabling the ARC for a particular pool. Thanks for the reminder.

HoneyBadger · Sep 30, 2022

StorageCurious said:
I`m not the type to assume anything, but I did here and I'm glad you mentioned this. Probably better to use a M.2 adapter instead for SLOG, to avoid hot-swapping the drive by sheer habit. Something like a quadruple M2 adapter from a known brand. I don't really know much about NVMe in a business setting, I just have a pair in my gaming PC.

The R730XD supports PCIe slot bifurcation, so you should be able to use a passive card like the ASUS Hyper M.2 - do note that high performance M.2 drives like the Optane P4801X will often run hot, and don't have the same thermal mass and airflow that U.2 form factor drives get.

NVMe hotplug support appears to have been improved in 13 - there is a note from a user who got manual hotplug working in FreeBSD 12.2, but this requires some foreknowledge about the PCI device IDs and how your system identifies the slot.

Other - NVMe hot swap working?

I'm about to buy a few more servers and it seems like NVMe is the way things are going, and the pricing is decent enough. The "U.2" standard seems here to stay and there are plenty of servers with NVMe "hot swap" bays for "U.2" format drives. I've done a fair amount of googling and looking...

forums.freebsd.org

StorageCurious said:
iSCSI - I'm already using iSCSI on some older RAID hardware I inherited. My initial tests will be easy because I can just VMmotion my machines from my existing storage onto the TrueNAS test machine (once I am confident of the basic setup so I won't lose data), run a few tests, switch them back, destroy/enhance/recreate the test zpool, rinse and repeat..

As mentioned before, you'll want to zfs set sync=always on your iSCSI ZVOLs (or the entire pool as a default) to ensure that your VMware writes are being committed to non-volatile storage.

Make sure that you have your network set up for MPIO if you plan to use multiple adapters - TrueNAS also doesn't allow for multiple network adapters on the same subnet, which other storage arrays do allow for, so this might require some tweaks to your storage network. There's also a setting to optimize the path-selection policy in VMware so that it properly cycles between your storage paths in a round-robin fashion, rather than defaulting to "use a single path."

StorageCurious said:
The more I think about this though the more it sounds like a SSD zpool is indeed the way to go, at least for the high-performance part. I presume when this is done you tell the L2ARC not to cache the SSD pool, so it focused on the other more needy pools(s). Or does L2ARC cache helps the SSD pools just by taking some load off them?*

*These feel like theoretical questions, I don't think my use-case necessitates that level of tweaking. But tweaking is so much fun!

As mentioned by @Etorix the L2ARC is attached at the pool level, so a device attached to an LFF HDD-based pool won't (and can't) be used by the SFF SSD-based pool. You can also adjust the "secondarycache" property of each dataset or zvol that you create in the pool, with a setting of "all" "metadata" or "none" to determine what that particular dataset/zvol is allowed to store in L2ARC. So if you have a group of very large files (videos, ISOs, etc) that you aren't as concerned with performance, you can zfs set secondarycache=metadata on the dataset that contains them. This will prevent ZFS from trying to use the L2ARC to cache the data itself, leaving more space/program-erase cycles free for data that actually benefits from the random access - but allowing it to hold metadata for quick directory listings.

StorageCurious said:
I have, which doesn't mean I understood everything. But I tried not to come at this completely clueless. And this is a rabbit hole, so there is much interesting reading available.

Thanks a lot for your answers, they certainly help me think through this, and the reassurance that I'm not completely off the rails does help.

Glad to lend a hand and answer questions!

Important Announcement for the TrueNAS Community.

vpools on 2 enclosures

StorageCurious

Explorer

ChrisRJ

Wizard

StorageCurious

Explorer

ChrisRJ

Wizard

StorageCurious

Explorer

HoneyBadger

actually does care

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

StorageCurious

Explorer

ChrisRJ

Wizard

StorageCurious

Explorer

HoneyBadger

actually does care

StorageCurious

Explorer

Etorix

Wizard

StorageCurious

Explorer

HoneyBadger

actually does care

Other - NVMe hot swap working?

Similar threads