TrueNAS Build for VMs - Looking for input

AshwinRS · Jan 20, 2023

Hey guys, looking to get some input on a new configuration I'm looking to test out for housing block storage to be used for VMs. The TrueNAS won't host the VMs, just the data. The VMs will be mixed usage for VPS hosting.

Hardware
Chassis
SSG-6028R-E1CR24L https://www.supermicro.com/en/products/system/2U/6028/SSG-6028R-E1CR24L.cfm

24x SAS3/SATA3 12GB/s backplane
Upgraded to add 2 more hotswap SSDs that we’re going to use for ZLOG
Upgraded to include 4 NVMe slots (uses 4 of the 24 hotswap slots)
Purchased add-on PCIe card that holds 2 NVMe M.2 drives (possibly use for metadata)
Purchased Connect-3x dual port 40GBE QSFP+ network card
2 x Intel Xeon E5-2697V3
256GB RAM but I’m thinking 512GB RAM (I’ve purchased enough 32GB DDR4 ECC RAM to max it out)

Hard Drives

OS SSDs: 2 x WD Blue 500GB 3D NAND
Spinning Disks: HGST Ultrastar 8TB 12Gb/s SAS HUH728080AL5200 - https://documents.westerndigital.co...astar-sas-series/data-sheet-ultrastar-he8.pdf
ZLOG SSDs: 2 x Intel DC S3610 800 GB 2.5” (5.3 PB of endurance) - https://www.intel.com/content/dam/w.../product-specifications/ssd-dc-s3610-spec.pdf
L2ARC NVMe: 4 x Intel P4610 NVMe

Configuration
Things we need to factor and must verify for optimal performance:

Sector size – apparently firmware on drives (SSDs and Spinning) can sometimes show 4K or 512b sectors but we need to verify the true physical sector size to determine the block size. This is needed to determine the ashift for TrueNAS
- From what I’m reading in the PDFs – 800GB SSDs are 512b and the HGST spinning disks can be 4K/512e
- Smartctl output is needed
- Diskinfo output is needed
- There’s a TrueNAS command that shows what TrueNAS sees but apparently it doesn’t always give the accurate information, I’m still looking for the command. Can somebody direct me to the right command(s) to run? I understand that writing 512 sectors to a 4K physical sector can lead to really bad write performance (would require ZFS to rewrite the record 8 times)
I’m thinking:
- Make sure the onboard raid controller cache is turned off so we don’t risk anything in case of a power failure
- RAID1 for OS
- RAID1 for ZLOG
- 2 x RAID1 for L2ARC on the NVMe in case a drive fails.
  - I know people wouldn’t suggest that because the data still sits on the pool but we want to ensure we don’t have deprecated performance in case of a drive failure.
  - This will stripe across both. Each drive is 660K IOPS and with RAID1 we would double that for reads (can read from 2 drives per each RAID1 volume) so that would put us at 2640K IOPS for read, but even if we only maintain the IOPS of one drive per RAID1 volume, that still puts us at 1320K IOPS which is amazing!
- 3 vdevs of 6 drives with 2 spare .
  - This will stripe across the three VDEVs for better performance
  - 8TB x 3 vdev x 4 drives of data (2 parity per vdev) = 96TB usable
- Metadata is something we need to investigate
  - I’m reading that people can set up a special vdev to host this to improve the performance. Would this require a lot of write endurance? I imagine that it would only change when data changes which I don't anticipate it to change that much as most of the data hosted remains as is, with the exception of DB updates and whatnot
  - I understand this is critical because if we lose the metadata then we lose the pool so we can’t do that
  - Others have noted that you can run a command to store this in RAM. Some people don’t do this is a good idea because it can take up a lot of RAM but we can add more RAM. I haven't tried this before, would we need to manually run the command line on each boot to pull the data from the vdevs into RAM? If so, then there is a downfall here because I imagine this may need to rebuild it in RAM which means degraded performance for a long time until it does.
  - Alternatively, instead of storing it in RAM, we can store it in L2ARC. I haven’t investigated but in theory this would mean it’s at least persistent and wouldn’t degrade the performance in case we rebooted as it would sit in L2ARC. Is this correct?

Any input here would be greatly appreciated!

Thank you!

Arwen · Jan 21, 2023

It is helpful to use the proper ZFS terminology, ZFS does not support RAID-1. (But does support Mirroring, which works similar but is not the same thing.) Next, there is no such thing as a ZLOG. We can guess you meant SLOG.

Terminology and Abbreviations Primer

We realize that new users have a lot to learn when they come to FreeNAS. There's a certain amount of confusion added to discussions when users pick random/approximate terms to describe things. I've spent a lot of time quietly trying to...

www.truenas.com

There is a general ratio between amount of RAM and L2ARC size, (which I don't remember off hand). Using very large L2ARC devices, (or more smaller ones), would not be recommended as the ideal path is to max out RAM first.

There is also a Resource, (see top of any forum page for the Resources link), for block storage:

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

As for the rest, I don't know, thus have not commented.

kspare · Jan 21, 2023

why mirror your l2arc? you said you don't want to reduce performance in the event of a failure....but if you ran striped. you'd double your performance. if you lost a drive striped, you would the same or faster than mirror.....

For your data, stick to mirrored vdevs.

I run 4 servers for terminal server storage. housing about 109 servers with 1100..ish users. 1tb ram each, 12 800gb sas ssd for l2arc, I dont run larger than 4tb sata drives. i'm using Intel P3700 nvme for my slog, mirrored, i've been experimenting with meta drives and struggling to find value in them to be honest. ram just works better.

Stick to chelsios 40gb nics.

kspare · Jan 21, 2023

its also worth nothing, that with the large l2arc, once the cache has warmed up, majority of the disk io comes from the l2arc or arc. takes a few days, but eventually the drives are almost idle, it's pretty cool to watch.

Arwen · Jan 21, 2023

I am not sure you can actually mirror Cache / L2ARC devices. In the past, this was NOT possible as Sun decided that this type of vDev was not critical. (The real data would always be available in the data pool.)

After playing with a VM of TrueNAS SCALE, I was not able to find any place in the GUI that would allow creating a Cache / L2ARC as a Mirror vDev. Nor, from the command line would it allow me to attach a Mirror afterwards;

Code:

root@truenas[~]# zpool attach tank 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97 /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132
cannot attach /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 to 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97: device is in use as a cache

If someone can prove me wrong, I will not be offended. Better correct information, (which I think I have supplied), than wrong information.

kspare · Jan 21, 2023

Arwen said:
I am not sure you can actually mirror Cache / L2ARC devices. In the past, this was NOT possible as Sun decided that this type of vDev was not critical. (The real data would always be available in the data pool.)

After playing with a VM of TrueNAS SCALE, I was not able to find any place in the GUI that would allow creating a Cache / L2ARC as a Mirror vDev. Nor, from the command line would it allow me to attach a Mirror afterwards;

Code:
root@truenas[~]# zpool attach tank 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97 /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 cannot attach /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 to 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97: device is in use as a cache

If someone can prove me wrong, I will not be offended. Better correct information, (which I think I have supplied), than wrong information.

theres just no need to do this…it’s a read only cache.

yazman · Jan 24, 2023

kspare said:
why mirror your l2arc? you said you don't want to reduce performance in the event of a failure....but if you ran striped. you'd double your performance. if you lost a drive striped, you would the same or faster than mirror.....

For your data, stick to mirrored vdevs.

I run 4 servers for terminal server storage. housing about 109 servers with 1100..ish users. 1tb ram each, 12 800gb sas ssd for l2arc, I dont run larger than 4tb sata drives. i'm using Intel P3700 nvme for my slog, mirrored, i've been experimenting with meta drives and struggling to find value in them to be honest. ram just works better.

Stick to chelsios 40gb nics.

In his case, if an NVMe drive dies then he would be degraded to the speed of the pool, no? The L2ARC stripes across all the drives, so wouldn’t it be reduced to the slowest drive (the pool in this case)? He would still benefit from the other NVMe drives that are operation, for sure, but I believe this performance hit is what the OP is trying to avoid.

When all NVMe drives are operational then you’re right, it’s a waste of NVMe cache. He’s not mirroring the drives because of the risk of data loss, but they don’t want to risk losing the performance. Something I’ve considered myself

yazman · Jan 24, 2023

kspare said:
theres just no need to do this…it’s a read only cache.

Arwen said:
I am not sure you can actually mirror Cache / L2ARC devices. In the past, this was NOT possible as Sun decided that this type of vDev was not critical. (The real data would always be available in the data pool.)

After playing with a VM of TrueNAS SCALE, I was not able to find any place in the GUI that would allow creating a Cache / L2ARC as a Mirror vDev. Nor, from the command line would it allow me to attach a Mirror afterwards;

Code:
root@truenas[~]# zpool attach tank 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97 /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 cannot attach /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 to 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97: device is in use as a cache

If someone can prove me wrong, I will not be offended. Better correct information, (which I think I have supplied), than wrong information.

Raid1 using a raid controller for the NVMe and JBOD / IT Mode for the spinning drives. I don’t believe this is possible in TrueNAS alone.

yazman · Jan 24, 2023

kspare said:
why mirror your l2arc? you said you don't want to reduce performance in the event of a failure....but if you ran striped. you'd double your performance. if you lost a drive striped, you would the same or faster than mirror.....

For your data, stick to mirrored vdevs.

I run 4 servers for terminal server storage. housing about 109 servers with 1100..ish users. 1tb ram each, 12 800gb sas ssd for l2arc, I dont run larger than 4tb sata drives. i'm using Intel P3700 nvme for my slog, mirrored, i've been experimenting with meta drives and struggling to find value in them to be honest. ram just works better.

Stick to chelsios 40gb nics.

Any issues with Mellanox Connectx-3? Just bought those and about to test them.

kspare · Jan 28, 2023

yazman said:
In his case, if an NVMe drive dies then he would be degraded to the speed of the pool, no? The L2ARC stripes across all the drives, so wouldn’t it be reduced to the slowest drive (the pool in this case)? He would still benefit from the other NVMe drives that are operation, for sure, but I believe this performance hit is what the OP is trying to avoid.

When all NVMe drives are operational then you’re right, it’s a waste of NVMe cache. He’s not mirroring the drives because of the risk of data loss, but they don’t want to risk losing the performance. Something I’ve considered myself

If you had 2 l2arc drives, and you mirrored them. You have the speed of one drive. If you stripe, you have the speed of two drives. If one was to fail you’d have the speed of your mirror….what’s the point of mirroring a read only cache? you so t go down if it fails, and you are actually Giving up performance under normal conditions….doesn’t make sense.

kspare · Jan 28, 2023

yazman said:
Any issues with Mellanox Connectx-3? Just bought those and about to test them.

I can’t say..I run chelsios… search the forums..

yazman · Feb 3, 2023

kspare said:
If you had 2 l2arc drives, and you mirrored them. You have the speed of one drive. If you stripe, you have the speed of two drives. If one was to fail you’d have the speed of your mirror….what’s the point of mirroring a read only cache? you so t go down if it fails, and you are actually Giving up performance under normal conditions….doesn’t make sense.

Unless I understand striping wrong then your VMs would still suffer if you had one of the cache drives die. If your object consists of multiple records then it would store those records across multiple drives, correct? I agree that it's not a real risk as in you don't lose your main/pool data, that's still safe, but you still face a performance impact, which is what the OP is asking about.

For example, your VM requests an object that is made of up 12 records, then those records would be on multiple drives because it's striped. So if one of those drives die, then when the VM requests that object, it would get some of them from cache but the remaining objects (that resided on the failed drive) would need to be requested from the main pool and provide degraded performance overall. Would it not?

If I'm wrong here then please let me know because I'm trying to understand how a failed cache drive would not result in degraded performance. I understand that striped drives would provide better performance than mirrored drives for write speed, but if one of them fails then you're bogged down by the weakest link (the main pool in this case).

Also, just to clarify:
RAID1 can read as fast striped disks, not always but with the right settings it can read from both disks.

kspare · Feb 4, 2023

You logic is this….I have 2 disks. If I mirror them, I get the performance of one, it if one disk fails, that read only data won’t be lost.

if you stripe, you have the performance of two disks, if you loose a disk you go to the the performance of a mirror and you loose the read only data.

why would you create a bottle neck right from the start?

yazman · Feb 4, 2023

kspare said:
You logic is this….I have 2 disks. If I mirror them, I get the performance of one, it if one disk fails, that read only data won’t be lost.

if you stripe, you have the performance of two disks, if you loose a disk you go to the the performance of a mirror and you loose the read only data.

why would you create a bottle neck right from the start?

To avoid a bigger bottleneck if a drive fails.

If they don't mirror then they will get more cache (and hence better overall performance) - 4 drives striping your cache. If a drive fails, then anything that requires going to the pool will be going at the speed of the pool.

If they mirror then they will get half the storage for cache - 2 drives striping your cache. If a drive fails, then you still have your full cache and performance is not degraded.

There are pros and cons to each situation.

kspare · Feb 4, 2023

you are not seeing how your logic is flawed.

let’s say each drive gives you 100 iops.

if you mirror them, you get 100 iops.

if you strip you get 200 iops.

in a mirror you get 100 iops either way

in the strip you have 200 iops amd double the cache size.

why would you limit yourself?

you missing out on double the speed and capacity for what? Like I said if you had 2 drives, in a failure situation you reduce to essentially what a mirror would give you….

this is like why they just don’t let you do a mirror in the first place..

I run 12 800gb sas drives in my l2 arc…imagine how much speed and performance I’d loose if I mirrored….my l2 arc has a very high hit ratio. Reads also don’t kill ssd..writes do. That’s why the log drive should be mirrored.

hope that all makes sense.

Important Announcement for the TrueNAS Community.

TrueNAS Build for VMs - Looking for input

AshwinRS

Cadet

Arwen

MVP

Terminology and Abbreviations Primer

Resource - The path to success for block storage

kspare

Guru

kspare

Guru

Arwen

MVP

kspare

Guru

yazman

Cadet

yazman

Cadet

yazman

Cadet

kspare

Guru

kspare

Guru

yazman

Cadet

kspare

Guru

yazman

Cadet

kspare

Guru

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS Build for VMs - Looking for input

Cadet

MVP

Guru

Guru

MVP

Guru

Cadet

Cadet

Cadet

Guru

Guru

Cadet

Guru

Cadet

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS Build for VMs - Looking for input"

Similar threads