Risks of running with a RAID controller

mauzilla · Jul 8, 2022

I have read numerous posts on this forum and fully understand the logic behind HBA over RC. I wish I knew this earlier though as I'm new to truenas and have not banked on this curve ball. Logistically it's impossible for me to flash the H710 Mini as it requires the removal of the RAID battery, and I am about 1000km away from the server. I could ask the DC team to perform this, but as they are supposed to look after the hardware after my non-vendor flash, I am left with no choice apart from returning to a Ubuntu OS and normal NFS share or taking the risk and running with a raid controller.

From my understanding the problems are:

1) TrueNas / ZFS cannot access the individual drives, making smart monitoring "impossible"
2) ZFS already performs most of the raid repairs / better raid features than what the card has so why stick with the outdated technology
3) If the controller dies, there will be data loss whereas ZFS is better equipt in handling this process.
4) Performance impact
5) ZFS may "overrun" the raid controller (something like caching)

For our use case, the 2 servers we want to setup with Truenas is 2 R720DX's which is repurposed now as backup storage, so we're not running critical / production data. I also want to test the replication task and run a handful of non-critical vm's on it.

I assume I am not the only person in this position so want to find out if there are others that have still opted running with the RAID controller knowing the drawbacks? There are some of the problems above that I feel can be resolved through IDRAC / custom smartmontools script to scan individual drives with megaraid,x (we already do this on our hypervisors), so is this considered a total "no go" or are there people doing this and what have been your experiences so far?

I am "in it to win it" at the moment so will test on my side as well. As this is our first round with Truenas, I will ensure our next NAS setup will be equip correctly :)

sretalla · Jul 8, 2022

I would start by directing you to the best available explanation already written on this:

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

Points 4, 5 and 6 are probably most of interest to you.

The only encounters I have had with folks running RAID cards are unfortunately the ones where it all went bad and they are here looking for help to recover (often the same folks in the group that have no backups).

ZFS has its own parity, caching and checksumming, so anything you're doing to mess with that is potentially bad news for your data somewhere down the road.

RAID hardware can do its job in the right context, but mixing it with ZFS is not good for 2 key reasons:

If you let the RAID hardware handle the drives and present a volume to ZFS, you will get checksums in ZFS which will tell you if data gets corrupted, but you won't have a good copy that ZFS can recover from.

If you use some kind of JBOD mode for each of the disks attached to the RAID card and give them to ZFS to manage, then those disks may land in the category of "can't be used in any other system", making recovery difficult... not to mention the potential for cache from the RAID card to tell lies to ZFS about what's really on the disk.

From what I can see, it's a risk that is worth spending a few hundred (on a well supported HBA) to avoid.

HoneyBadger · Jul 8, 2022

mauzilla said:
I could ask the DC team to perform this, but as they are supposed to look after the hardware after my non-vendor flash, I am left with no choice apart from returning to a Ubuntu OS and normal NFS share or taking the risk and running with a raid controller.

If possible, I would suggest you engage the DC team to act as your "remote hands" and replace the H710 Mini with the H310 Mini - which can then be crossflashed into a SAS2008 in IT mode, or purchased pre-flashed from a store such as Art of Server on eBay if you'd rather just have it be a drop-in replacement. The latter should be available for under USD$50, but since you're speaking in "km" I suspect you're outside the USA, so factor in shipping and exchange rates.

If you're just running 12 HDDs for a backup workload, you won't outstrip the performance available on the H310.

mauzilla · Jul 8, 2022

This is new ground for me, we've always kept to standard configs with vendors. Can you perhaps recommend a hba card that would not require any flashing that would work with dell? 12 drives. It seems lsi is promoted but I see different vendors like dell also having a lsi card so would appreciate experience before purchasing

HoneyBadger · Jul 8, 2022

The "Mini Monolithic" slot on Dell motherboards is a proprietary form factor, so you'll only find Dell branded cards there. The only Dell card that works without a flash is the HBA330, but that's intended for the 13G Dell servers (Rx30) and I'm not certain if it will fit in or behave in the 12G R720XD. The Dell OEM firmware for the H310 is severely limited by a queue depth of 25 across the entire card, which is why it needs to be flashed despite having a "factory IT/HBA mode"

The other option if you don't want to crossflash, is to use a genuine LSI card in one of the regular PCIe slots, but you'll have to purchase longer internal SAS cables to reach the slot location (I believe it's a 90-degree left angle at the backplane that you'll need - please verify first before purchase!) as well as having this cost you a PCIe slot.

mauzilla · Jul 19, 2022

HoneyBadger said:
The "Mini Monolithic" slot on Dell motherboards is a proprietary form factor, so you'll only find Dell branded cards there. The only Dell card that works without a flash is the HBA330, but that's intended for the 13G Dell servers (Rx30) and I'm not certain if it will fit in or behave in the 12G R720XD. The Dell OEM firmware for the H310 is severely limited by a queue depth of 25 across the entire card, which is why it needs to be flashed despite having a "factory IT/HBA mode"

The other option if you don't want to crossflash, is to use a genuine LSI card in one of the regular PCIe slots, but you'll have to purchase longer internal SAS cables to reach the slot location (I believe it's a 90-degree left angle at the backplane that you'll need - please verify first before purchase!) as well as having this cost you a PCIe slot.

After reading your comment I decided to take the chance with the IT mode flashing, worked perfectly first time around. I cannot believe that I have wasted so many years with simple NFS setups and local storage where a product like truenas was on the market all along. Thank you for the nudge, saved me a good couple of bucks

HoneyBadger · Jul 19, 2022

mauzilla said:
After reading your comment I decided to take the chance with the IT mode flashing, worked perfectly first time around. I cannot believe that I have wasted so many years with simple NFS setups and local storage where a product like truenas was on the market all along. Thank you for the nudge, saved me a good couple of bucks

Glad it worked out for you. :)

Note that while a backup workload will run quite well on almost any configuration, you mentioned you wanted to

mauzilla said:
run a handful of non-critical vm's on it

Depending on just how non-critical these VMs are, you may or may not need to take some precautions and/or add some additional hardware. See the resource below regarding the behavior of "synchronous writes" and how the storage appliance (TrueNAS in this case) doesn't have the ability to tell what is and isn't critical for a VM guest.

Resource - Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

This post is not specific to ESXi, however, ESXi users typically experience more trouble with this topic due to the way VMware works. When an application on a client machine wishes to write something to storage, it issues a write request. On a...

www.truenas.com

If they're truly disposable machines and you're willing to take "roll back to the last backup" as a recovery option, you can disable sync writes on the "disposable VM" dataset (but only that one!) and be fairly well off from a performance standpoint. However, I'd make it abundantly clear in your datastore/export naming convention that the location is NSFP (Not Safe For Production)

mauzilla · Jul 20, 2022

HoneyBadger said:
Glad it worked out for you. :)

Note that while a backup workload will run quite well on almost any configuration, you mentioned you wanted to

Depending on just how non-critical these VMs are, you may or may not need to take some precautions and/or add some additional hardware. See the resource below regarding the behavior of "synchronous writes" and how the storage appliance (TrueNAS in this case) doesn't have the ability to tell what is and isn't critical for a VM guest.

Resource - Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

This post is not specific to ESXi, however, ESXi users typically experience more trouble with this topic due to the way VMware works. When an application on a client machine wishes to write something to storage, it issues a write request. On a...

www.truenas.com

If they're truly disposable machines and you're willing to take "roll back to the last backup" as a recovery option, you can disable sync writes on the "disposable VM" dataset (but only that one!) and be fairly well off from a performance standpoint. However, I'd make it abundantly clear in your datastore/export naming convention that the location is NSFP (Not Safe For Production)

Thank you so much for thus insight, something I would have never considered. We obviously want to "dry run" some non-critical VM's but the intention is to move towards NAS storage for hypervisors and move away from local storage due to redundancy. For us we would definitely not want to carry any risks for data (would rather sacrifice performance over stability)

It appears that the SLOG devices is the best direction to go. I would assume then that there is no way to to use the existing "truenas application disks" as SLOG devices and that these need to be completely independant? I ask this because we have truenas installed on 2x 250GB and 2x 500GB SSD's which is a complete waste of space. Could one create partitions on these disks post install or is the only way to get this working the addition of 2 additional SSD's per truenas install?

HoneyBadger · Jul 20, 2022

mauzilla said:
Thank you so much for thus insight, something I would have never considered. We obviously want to "dry run" some non-critical VM's but the intention is to move towards NAS storage for hypervisors and move away from local storage due to redundancy. For us we would definitely not want to carry any risks for data (would rather sacrifice performance over stability)

It appears that the SLOG devices is the best direction to go. I would assume then that there is no way to to use the existing "truenas application disks" as SLOG devices and that these need to be completely independant? I ask this because we have truenas installed on 2x 250GB and 2x 500GB SSD's which is a complete waste of space. Could one create partitions on these disks post install or is the only way to get this working the addition of 2 additional SSD's per truenas install?

NAS storage for live VMs is an almost "polar opposite" workload from a backup target. The design and components that go into the two machines are significantly different, so it may not be as simple as adding a couple components to your backup unit.

The basic version is that while a backup job thrives mostly on high bandwidth and an ability to ingest data from a single writer at a high rate of speed, live VMs thrive on low latency and high IOPS (Input/Output Operations Per Second) while handling small requests from multiple clients as quickly as possible.

Backup jobs work well with RAIDZ2 and asynchronous writing. Live VMs want mirrors and sync writes for safety.

Have a look at "The Path to Success for Block Storage" - while NFS/NAS isn't technically block storage, it behaves identically when being used for storing VM disks, and the same rules apply:

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

With regards to the SLOG disks specifically, the necessary characteristics of a good SLOG device are far different from a boot device or even a regular capacity SSD. It needs to have the ability to accept and confirm writes as quickly as possible, which implies power-loss-protection for in-flight data. There's quite a list being gathered in this thread (which I need to re-write at some point) but the short version is that an Intel DC-series Optane device is your best commonly-available option. More exotic choices like NVRAM and NVDIMM are available as well, but may be difficult to source (NVRAM cards) or require special motherboard considerations (NVDIMM).

In your case, with the TrueNAS device being located remotely in a datacenter and requiring "remote hands" operations though, I might even consider a modern fast SAS device, as this would be hot-swappable in a failure/replacement scenario (NVMe hotplug is improving but nowhere near as mature) - I'm going to plug the Western Digital Ultrastar DC SS530, and you must specifically buy the "Write Intensive" model (model number starts with WUSTM)

SLOG benchmarking and finding the best SLOG

I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG: 1. Fast throughput 2...

www.truenas.com

mauzilla · Jul 20, 2022

HoneyBadger said:
NAS storage for live VMs is an almost "polar opposite" workload from a backup target. The design and components that go into the two machines are significantly different, so it may not be as simple as adding a couple components to your backup unit.

The basic version is that while a backup job thrives mostly on high bandwidth and an ability to ingest data from a single writer at a high rate of speed, live VMs thrive on low latency and high IOPS (Input/Output Operations Per Second) while handling small requests from multiple clients as quickly as possible.

Backup jobs work well with RAIDZ2 and asynchronous writing. Live VMs want mirrors and sync writes for safety.

Have a look at "The Path to Success for Block Storage" - while NFS/NAS isn't technically block storage, it behaves identically when being used for storing VM disks, and the same rules apply:

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

With regards to the SLOG disks specifically, the necessary characteristics of a good SLOG device are far different from a boot device or even a regular capacity SSD. It needs to have the ability to accept and confirm writes as quickly as possible, which implies power-loss-protection for in-flight data. There's quite a list being gathered in this thread (which I need to re-write at some point) but the short version is that an Intel DC-series Optane device is your best commonly-available option. More exotic choices like NVRAM and NVDIMM are available as well, but may be difficult to source (NVRAM cards) or require special motherboard considerations (NVDIMM).

In your case, with the TrueNAS device being located remotely in a datacenter and requiring "remote hands" operations though, I might even consider a modern fast SAS device, as this would be hot-swappable in a failure/replacement scenario (NVMe hotplug is improving but nowhere near as mature) - I'm going to plug the Western Digital Ultrastar DC SS530, and you must specifically buy the "Write Intensive" model (model number starts with WUSTM)

SLOG benchmarking and finding the best SLOG

I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG: 1. Fast throughput 2...

www.truenas.com

For something that sounded pretty simple, the more I do the more I realize I need to understand more. I was hoping it would have been as simple as just creating a NFS repository but I assume not. I have 10x 6TB drives all added to a single RAIDZ3, which if I understand now is not the most ideal for my use case? Should I considering redoing the truenas install and create different vdevs (maybe a mirror vdev for VM's and then allocate the rest to another vdev for backup purposes)?

ChrisRJ · Jul 20, 2022

mauzilla said:
[..] repurposed now as backup storage, so we're not running critical / production data.

I would argue that a backup is one of the most critical types of data you can have.

Heracles · Jul 20, 2022

I fully agree with @ChrisRJ about how critical backups are.

By definition, when you need your backup, it is because the main instance failed for any reason : ransomware, corruption, error, physical damage, ... Also, it is for you to re-gain access to something you need at the moment, being the evidence that this data is of value. So valuable data needed after the main instance failed, it is of critical importance. So much that you should have at least 2 backups, one Online and one Offline. See my signature about the details and why of that...

HoneyBadger · Jul 20, 2022

mauzilla said:
the more I do the more I realize I need to understand more.

That's a pretty universal constant with technology as a whole, in my experience. "The more I know, the more I need to know more."

mauzilla said:
I was hoping it would have been as simple as just creating a NFS repository but I assume not. I have 10x 6TB drives all added to a single RAIDZ3, which if I understand now is not the most ideal for my use case? Should I considering redoing the truenas install and create different vdevs (maybe a mirror vdev for VM's and then allocate the rest to another vdev for backup purposes)?

What's the full system configuration of your R720XD (aside from using a flashed H710 and 10x6TB drives)? RAM, network connectivity, CPUs?

For the configuration, you'll definitely want to use mirror vdevs, but pool performance is dependant on both the speed of each vdev and the total number of vdevs - so using 5 mirror vdevs (all 10 drives) will be faster than 2 or 3 (4 drives/6 drives) for your VM workloads. How many is necessary depends on the number of VMs you plan to put on the system and their performance profiles - once you reach a given point, building an all-flash system starts to make performance sense instead of throwing more spinning disks at it.

You could certainly start with a "split" configuration (eg: 4 drives in a 2x2-way mirror for VMs, 6 drives in a RAIDZ2 for backup) as a proof-of-concept, but you'll find that you can quickly outstrip the capabilities of spinning disk with even a small number of virtual machines.

mauzilla · Jul 20, 2022

HoneyBadger said:
That's a pretty universal constant with technology as a whole, in my experience. "The more I know, the more I need to know more."

What's the full system configuration of your R720XD (aside from using a flashed H710 and 10x6TB drives)? RAM, network connectivity, CPUs?

For the configuration, you'll definitely want to use mirror vdevs, but pool performance is dependant on both the speed of each vdev and the total number of vdevs - so using 5 mirror vdevs (all 10 drives) will be faster than 2 or 3 (4 drives/6 drives) for your VM workloads. How many is necessary depends on the number of VMs you plan to put on the system and their performance profiles - once you reach a given point, building an all-flash system starts to make performance sense instead of throwing more spinning disks at it.

You could certainly start with a "split" configuration (eg: 4 drives in a 2x2-way mirror for VMs, 6 drives in a RAIDZ2 for backup) as a proof-of-concept, but you'll find that you can quickly outstrip the capabilities of spinning disk with even a small number of virtual machines.

Haha I'm so glad you asked because I would have bombarded you with this info anyways :) I barely slept last night as my head was spinning trying to get the best out of this. I have a bit of a timeline (next Saturday) as the DC team will be installing a new server and firewall, and we intend on decommissioning 1 server which is coming our our offsite office. I would have liked to set that up as a truenas as well so that I can do offsite backups to that one where we can have the 1st replication at the DC and then the rest can be part of of normal weekly snapshots.

Anyways, we have the following 2 servers (identical in specs):

* Dell R720DX (3.5 12 bay)
* 2x SSD's (application SSD's but they are currently installed in the front, only found out yesterday there are 2 slots at the back, if they work we will move the SSD's to the back
* 10x 6TB HGST Enterprise Drives (7200RPM DC Drives) - This may however become 12 drives if I am successul in moving the SSD's to the back enclosure
* H710P Mini flashed to IT mode
* 10GB Fibre network with a secondary CAT6 LAGG consisting of 2x 1GB connections (latter not in use but available in the event of failure)
* Single XEON CPU
* 128GB RAM
* Dual PSU and at the DC we have both on separate breakers so no real loss of sudden power (but never say never)
* Current Truenas configuration is that I have setup a single datapool with all available drives (excluding the app drives), into a RaidZ3 partition giving me effectively 38tb of storage per server

Our purposes are mostly backups, but these are seperated as well:

Server A:
1- File level backups of shared hosting servers. We're talking millions of small files. The backup appliance handles this as incremental backups, and believe they use hard links instead of individual copies of files. Backups run at certain periods of time (6pm - 6am) on various shared hosting servers.
2 - Host a handful of VM's as primary NAS storage. All of the VM's at the DC currently run on SSD with a handful of (mostly nextcloud servers) running on DC 7200 drives. My plan was to move these to the NAS so that they can become shared storage across the pool. Although these Nextcloud servers have a database, we're not talking millions of users so at the end of the day these servers are also mostly used for file sharing. The plan is to then replicate this data pool to Server B for "redundancy" hourly in the event that server A goes offline that we have a way to at least recover to the last hour.

Server B:
1 - VM backups handled by XO (xen orchestra) using incremental backups. These are obviously large files (full export VHD's with incremental snapshots sent either daily or weekly depending on the VM)
2) - A replicated copy of the VMs storage on Server A

My takeaway so far has been that I have likely setup my storage incorrectly. It would appear that my better option would have been to create a pool of 4x 6TB in mirror (giving me a 12TB vdev) and then setting up the remainder of the drives (either 6 or 8 depending on setup) as a RAIDZ2 - I also now understand that I need to consider available free space on the RAIDZ2 / RAIDZ3 as the fuller it gets, the slower the vdev will perform (I stand corrected as it's not experience talking)

Regarding the SLOG, I will be honest in saying that it's a cost factor that we have not considered at all, and the combined cost / income we generate from the handful of nextcloud instances does not validate the pricepoint in increasing performance (it is afterall not a mission critical for performance as these are seldomly used, mostly when new files are stored with the occassional OS update etc). I however don't want to dismiss. There is obviously a limitation regarding additional storage devices, so our only realistic option is to purchase a couple of PCIe nvme controllers and installing 2 NVME's / M2 SSD's to mirror the SLOG. The problem is however the cost penalty, so if we do consider going this route I would like to find a middleground in which we can likely look at consumer grade slog devices. It's worth remembering that we do backup the VM's (both as part of replication and as a export using XO to server B) - Less than ideal

You seem to be a sensai on this topic so thank you so much for the time you have given me so far, it's a brand new world for me having solely relied on local RAID1 / RAID10 storage in the past

ChrisRJ · Jul 20, 2022

mauzilla said:
I have a bit of a timeline (next Saturday) as the DC team will be installing a new server and firewall, and we intend on decommissioning 1 server which is coming our our offsite office.

If I may be so blunt: My recommendation would be to not rush things here. Storage is a non-trivial topic and has a corresponding learning curve. You can absolutely learn what is needed from this forum. But that approach takes time and very likely you are not done when you first think that you know enough. At least that has been my experience with e.g. Linux, Novell NetWare, and TrueNAS.

Of course you can always speed things up by hiring an experienced consultant.

My $0.02 and all the best for your journey with TrueNAS

mauzilla · Jul 21, 2022

ChrisRJ said:
If I may be so blunt: My recommendation would be to not rush things here. Storage is a non-trivial topic and has a corresponding learning curve. You can absolutely learn what is needed from this forum. But that approach takes time and very likely you are not done when you first think that you know enough. At least that has been my experience with e.g. Linux, Novell NetWare, and TrueNAS. Of course you can always speed things up by hiring an experienced consultant. My $0.02 and all the best for your journey with TrueNAS

mauzilla · Jul 21, 2022

ChrisRJ said:
If I may be so blunt: My recommendation would be to not rush things here. Storage is a non-trivial topic and has a corresponding learning curve. You can absolutely learn what is needed from this forum. But that approach takes time and very likely you are not done when you first think that you know enough. At least that has been my experience with e.g. Linux, Novell NetWare, and TrueNAS. Of course you can always speed things up by hiring an experienced consultant. My $0.02 and all the best for your journey with TrueNAS

HoneyBadger · Jul 21, 2022

In general, I'm inclined to give some weight to @ChrisRJ 's comment of "take it slow" - there are a number of concepts in shared/networked storage that have to be treated significantly differently from local, and I would rather see you make a single excursion with the correctly designed and understood configuration rather than several back-and-forths. Iterative improvement is fine for a homelab where the server is 10 feet away, less good when the datacenter is 1000km away.

mauzilla said:
Haha I'm so glad you asked because I would have bombarded you with this info anyways :) I barely slept last night as my head was spinning trying to get the best out of this. I have a bit of a timeline (next Saturday) as the DC team will be installing a new server and firewall, and we intend on decommissioning 1 server which is coming our our offsite office. I would have liked to set that up as a truenas as well so that I can do offsite backups to that one where we can have the 1st replication at the DC and then the rest can be part of of normal weekly snapshots.

Is the DC team removing one of the two R720XD servers below, or are they working on what I assume are separate servers or hypervisor hosts that aren't described below?

mauzilla said:
Anyways, we have the following 2 servers (identical in specs):

* Dell R720DX (3.5 12 bay)
* 2x SSD's (application SSD's but they are currently installed in the front, only found out yesterday there are 2 slots at the back, if they work we will move the SSD's to the back
* 10x 6TB HGST Enterprise Drives (7200RPM DC Drives) - This may however become 12 drives if I am successul in moving the SSD's to the back enclosure
* H710P Mini flashed to IT mode
* 10GB Fibre network with a secondary CAT6 LAGG consisting of 2x 1GB connections (latter not in use but available in the event of failure)
* Single XEON CPU
* 128GB RAM
* Dual PSU and at the DC we have both on separate breakers so no real loss of sudden power (but never say never)
* Current Truenas configuration is that I have setup a single datapool with all available drives (excluding the app drives), into a RaidZ3 partition giving me effectively 38tb of storage per server

For my own understanding - are these the "backup/storage" servers, with there being other hardware being used for hypervisors (guessing XCP-NG?) The description of "application SSDs" has me a bit confused. Odds are good that those SSDs will not be capable of acting as SLOG devices though. Have you got a make and model or part number?

mauzilla said:
Our purposes are mostly backups, but these are seperated as well:

Server A:
1- File level backups of shared hosting servers. We're talking millions of small files. The backup appliance handles this as incremental backups, and believe they use hard links instead of individual copies of files. Backups run at certain periods of time (6pm - 6am) on various shared hosting servers.
2 - Host a handful of VM's as primary NAS storage. All of the VM's at the DC currently run on SSD with a handful of (mostly nextcloud servers) running on DC 7200 drives. My plan was to move these to the NAS so that they can become shared storage across the pool. Although these Nextcloud servers have a database, we're not talking millions of users so at the end of the day these servers are also mostly used for file sharing. The plan is to then replicate this data pool to Server B for "redundancy" hourly in the event that server A goes offline that we have a way to at least recover to the last hour.

For workload #1 this should be okay, although small files don't get perfect space-efficiency on RAIDZ if they end up compressed small enough to need padding data.

Workload #2 though, if your current performance level is "independent local SSD" then you will likely have a rough time switching to "network-attached HDD" - ZFS is definitely capable of some magic but it can't overcome the laws of physics or increase the speed of light.

mauzilla said:
Server B:
1 - VM backups handled by XO (xen orchestra) using incremental backups. These are obviously large files (full export VHD's with incremental snapshots sent either daily or weekly depending on the VM)
2) - A replicated copy of the VMs storage on Server A

This server on the other hand is just receiving backup streams, whether from XO or ZFS replication. But if the plan is to be able to run the replicated copy of these VMs from Server B in an "emergency situation" then you'll need to have the same performance considerations regarding vdev design ("use mirrors") and SLOG presence.

mauzilla said:
My takeaway so far has been that I have likely setup my storage incorrectly. It would appear that my better option would have been to create a pool of 4x 6TB in mirror (giving me a 12TB vdev) and then setting up the remainder of the drives (either 6 or 8 depending on setup) as a RAIDZ2 - I also now understand that I need to consider available free space on the RAIDZ2 / RAIDZ3 as the fuller it gets, the slower the vdev will perform (I stand corrected as it's not experience talking)

As mentioned previously, four spinning disks in mirrors is most likely not going to cut it from a performance perspective, if your VM workloads are being compared against the current state of "running from local SSD or local HDD" - the contention between different guest OSes will cause your I/O patterns to trend towards random access. HDDs are very, very bad at delivering random access I/O. If it's a very small VM workload being put on it, you may be able to get away with this by means of ZFS's ARC and SLOG. You mentioned that they are NextCloud instances (you're hosting NextCloud in the Cloud, so you're NextNextCloud?) - a strategy here might involve putting the VM OS disk, application, and database onto the "faster" VM pool, and creating individual per-tenant/per-VM NFS exports out of the RAIDZ2 pool where the "bulk data" is stored.

Vdevs do slow as they fill - however, this affects random I/O and overwrite-in-place (eg: VM workloads) much more than it does for large sequential files. It might be a consideration for the rsync/hardlink backup of the small files as well. If the workflow is "fill 10% of the disk until 90% full, then delete the oldest 10% and fill again" you're likely to keep large contiguous slabs of space open for writing and avoid free space fragmentation, which is the more accurate cause of "pool slows down as the drive fills."

mauzilla said:
Regarding the SLOG, I will be honest in saying that it's a cost factor that we have not considered at all, and the combined cost / income we generate from the handful of nextcloud instances does not validate the pricepoint in increasing performance (it is afterall not a mission critical for performance as these are seldomly used, mostly when new files are stored with the occassional OS update etc). I however don't want to dismiss. There is obviously a limitation regarding additional storage devices, so our only realistic option is to purchase a couple of PCIe nvme controllers and installing 2 NVME's / M2 SSD's to mirror the SLOG. The problem is however the cost penalty, so if we do consider going this route I would like to find a middleground in which we can likely look at consumer grade slog devices. It's worth remembering that we do backup the VM's (both as part of replication and as a export using XO to server B) - Less than ideal

SLOG presence is usually tied to your willingness to lose data. In this case, you're taking backups every hour and replicating them, so you're at worst likely to lose up to that hour of data, but you also need to take your recovery time into consideration here especially with this being a hosted/customer/income-generating instance. It sounds like you might have SLAs and contractual language to look at here - do you have RPO/RTO agreements with your customers? It's fine to have a one-hour RPO (recovery point objective) but if you have to restore the entire NFS export to "set the clock back an hour" your RTO (recovery time objective) is going to be very high.

I'm not sure where the "Application SSDs" fall into the mix here. There are some less expensive SAS SLOG options but they won't be comparable to an NVMe Optane. But where you lose out with the PCIe NVMe adapter is the ability to easily hotswap these devices (either you personally, or by lighting up the drive bay LED and asking the datacenter remote hands to "please go get the drive marked 'spare SLOG' and put it in here")

mauzilla said:
You seem to be a sensai on this topic so thank you so much for the time you have given me so far, it's a brand new world for me having solely relied on local RAID1 / RAID10 storage in the past

Glad to help out, although I feel as if I'm just assigning you mountains of homework on this one. As you're dealing with "customer data" and a revenue-generating system my advice remains to not rush this. It's better to take the time to prove out the concept locally, ensure it will deliver the performance you want, and then ship a completed (and well-understood) system for installation at your DC.

Important Announcement for the TrueNAS Community.

Risks of running with a RAID controller

Dabbler

Powered by Neutrality

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Wizard

Wizard

actually does care

Dabbler

Wizard

Dabbler

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Risks of running with a RAID controller"

Similar threads