TrueNAS Build for VMs - Looking for input

AshwinRS

Cadet
Joined
Jan 20, 2023
Messages
1
Hey guys, looking to get some input on a new configuration I'm looking to test out for housing block storage to be used for VMs. The TrueNAS won't host the VMs, just the data. The VMs will be mixed usage for VPS hosting.

Hardware
Chassis
SSG-6028R-E1CR24L https://www.supermicro.com/en/products/system/2U/6028/SSG-6028R-E1CR24L.cfm

  • 24x SAS3/SATA3 12GB/s backplane
  • Upgraded to add 2 more hotswap SSDs that we’re going to use for ZLOG
  • Upgraded to include 4 NVMe slots (uses 4 of the 24 hotswap slots)
  • Purchased add-on PCIe card that holds 2 NVMe M.2 drives (possibly use for metadata)
  • Purchased Connect-3x dual port 40GBE QSFP+ network card
  • 2 x Intel Xeon E5-2697V3
  • 256GB RAM but I’m thinking 512GB RAM (I’ve purchased enough 32GB DDR4 ECC RAM to max it out)
Hard Drives
Configuration
Things we need to factor and must verify for optimal performance:

  • Sector size – apparently firmware on drives (SSDs and Spinning) can sometimes show 4K or 512b sectors but we need to verify the true physical sector size to determine the block size. This is needed to determine the ashift for TrueNAS
    • From what I’m reading in the PDFs – 800GB SSDs are 512b and the HGST spinning disks can be 4K/512e
    • Smartctl output is needed
    • Diskinfo output is needed
    • There’s a TrueNAS command that shows what TrueNAS sees but apparently it doesn’t always give the accurate information, I’m still looking for the command. Can somebody direct me to the right command(s) to run? I understand that writing 512 sectors to a 4K physical sector can lead to really bad write performance (would require ZFS to rewrite the record 8 times)
  • I’m thinking:
    • Make sure the onboard raid controller cache is turned off so we don’t risk anything in case of a power failure
    • RAID1 for OS
    • RAID1 for ZLOG
    • 2 x RAID1 for L2ARC on the NVMe in case a drive fails.
      • I know people wouldn’t suggest that because the data still sits on the pool but we want to ensure we don’t have deprecated performance in case of a drive failure.
      • This will stripe across both. Each drive is 660K IOPS and with RAID1 we would double that for reads (can read from 2 drives per each RAID1 volume) so that would put us at 2640K IOPS for read, but even if we only maintain the IOPS of one drive per RAID1 volume, that still puts us at 1320K IOPS which is amazing!
    • 3 vdevs of 6 drives with 2 spare .
      • This will stripe across the three VDEVs for better performance
      • 8TB x 3 vdev x 4 drives of data (2 parity per vdev) = 96TB usable
    • Metadata is something we need to investigate
      • I’m reading that people can set up a special vdev to host this to improve the performance. Would this require a lot of write endurance? I imagine that it would only change when data changes which I don't anticipate it to change that much as most of the data hosted remains as is, with the exception of DB updates and whatnot
      • I understand this is critical because if we lose the metadata then we lose the pool so we can’t do that
      • Others have noted that you can run a command to store this in RAM. Some people don’t do this is a good idea because it can take up a lot of RAM but we can add more RAM. I haven't tried this before, would we need to manually run the command line on each boot to pull the data from the vdevs into RAM? If so, then there is a downfall here because I imagine this may need to rebuild it in RAM which means degraded performance for a long time until it does.
      • Alternatively, instead of storing it in RAM, we can store it in L2ARC. I haven’t investigated but in theory this would mean it’s at least persistent and wouldn’t degrade the performance in case we rebooted as it would sit in L2ARC. Is this correct?
Any input here would be greatly appreciated!

Thank you!
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
It is helpful to use the proper ZFS terminology, ZFS does not support RAID-1. (But does support Mirroring, which works similar but is not the same thing.) Next, there is no such thing as a ZLOG. We can guess you meant SLOG.

There is a general ratio between amount of RAM and L2ARC size, (which I don't remember off hand). Using very large L2ARC devices, (or more smaller ones), would not be recommended as the ideal path is to max out RAM first.

There is also a Resource, (see top of any forum page for the Resources link), for block storage:

As for the rest, I don't know, thus have not commented.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
why mirror your l2arc? you said you don't want to reduce performance in the event of a failure....but if you ran striped. you'd double your performance. if you lost a drive striped, you would the same or faster than mirror.....

For your data, stick to mirrored vdevs.

I run 4 servers for terminal server storage. housing about 109 servers with 1100..ish users. 1tb ram each, 12 800gb sas ssd for l2arc, I dont run larger than 4tb sata drives. i'm using Intel P3700 nvme for my slog, mirrored, i've been experimenting with meta drives and struggling to find value in them to be honest. ram just works better.

Stick to chelsios 40gb nics.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
its also worth nothing, that with the large l2arc, once the cache has warmed up, majority of the disk io comes from the l2arc or arc. takes a few days, but eventually the drives are almost idle, it's pretty cool to watch.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I am not sure you can actually mirror Cache / L2ARC devices. In the past, this was NOT possible as Sun decided that this type of vDev was not critical. (The real data would always be available in the data pool.)

After playing with a VM of TrueNAS SCALE, I was not able to find any place in the GUI that would allow creating a Cache / L2ARC as a Mirror vDev. Nor, from the command line would it allow me to attach a Mirror afterwards;
Code:
root@truenas[~]# zpool attach tank 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97 /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132
cannot attach /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 to 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97: device is in use as a cache

If someone can prove me wrong, I will not be offended. Better correct information, (which I think I have supplied), than wrong information.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
I am not sure you can actually mirror Cache / L2ARC devices. In the past, this was NOT possible as Sun decided that this type of vDev was not critical. (The real data would always be available in the data pool.)

After playing with a VM of TrueNAS SCALE, I was not able to find any place in the GUI that would allow creating a Cache / L2ARC as a Mirror vDev. Nor, from the command line would it allow me to attach a Mirror afterwards;
Code:
root@truenas[~]# zpool attach tank 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97 /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132
cannot attach /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 to 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97: device is in use as a cache

If someone can prove me wrong, I will not be offended. Better correct information, (which I think I have supplied), than wrong information.
theres just no need to do this…it’s a read only cache.
 

yazman

Cadet
Joined
Feb 5, 2020
Messages
5
why mirror your l2arc? you said you don't want to reduce performance in the event of a failure....but if you ran striped. you'd double your performance. if you lost a drive striped, you would the same or faster than mirror.....

For your data, stick to mirrored vdevs.

I run 4 servers for terminal server storage. housing about 109 servers with 1100..ish users. 1tb ram each, 12 800gb sas ssd for l2arc, I dont run larger than 4tb sata drives. i'm using Intel P3700 nvme for my slog, mirrored, i've been experimenting with meta drives and struggling to find value in them to be honest. ram just works better.

Stick to chelsios 40gb nics.
In his case, if an NVMe drive dies then he would be degraded to the speed of the pool, no? The L2ARC stripes across all the drives, so wouldn’t it be reduced to the slowest drive (the pool in this case)? He would still benefit from the other NVMe drives that are operation, for sure, but I believe this performance hit is what the OP is trying to avoid.

When all NVMe drives are operational then you’re right, it’s a waste of NVMe cache. He’s not mirroring the drives because of the risk of data loss, but they don’t want to risk losing the performance. Something I’ve considered myself
 

yazman

Cadet
Joined
Feb 5, 2020
Messages
5
theres just no need to do this…it’s a read only cache.
I am not sure you can actually mirror Cache / L2ARC devices. In the past, this was NOT possible as Sun decided that this type of vDev was not critical. (The real data would always be available in the data pool.)

After playing with a VM of TrueNAS SCALE, I was not able to find any place in the GUI that would allow creating a Cache / L2ARC as a Mirror vDev. Nor, from the command line would it allow me to attach a Mirror afterwards;
Code:
root@truenas[~]# zpool attach tank 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97 /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132
cannot attach /dev/disk/by-partuuid/743780da-8b35-433b-af71-34e43a93e132 to 942f0b1b-ec8e-4cdb-9bb6-8c7b26fa6a97: device is in use as a cache

If someone can prove me wrong, I will not be offended. Better correct information, (which I think I have supplied), than wrong information.
Raid1 using a raid controller for the NVMe and JBOD / IT Mode for the spinning drives. I don’t believe this is possible in TrueNAS alone.
 

yazman

Cadet
Joined
Feb 5, 2020
Messages
5
why mirror your l2arc? you said you don't want to reduce performance in the event of a failure....but if you ran striped. you'd double your performance. if you lost a drive striped, you would the same or faster than mirror.....

For your data, stick to mirrored vdevs.

I run 4 servers for terminal server storage. housing about 109 servers with 1100..ish users. 1tb ram each, 12 800gb sas ssd for l2arc, I dont run larger than 4tb sata drives. i'm using Intel P3700 nvme for my slog, mirrored, i've been experimenting with meta drives and struggling to find value in them to be honest. ram just works better.

Stick to chelsios 40gb nics.
Any issues with Mellanox Connectx-3? Just bought those and about to test them.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
In his case, if an NVMe drive dies then he would be degraded to the speed of the pool, no? The L2ARC stripes across all the drives, so wouldn’t it be reduced to the slowest drive (the pool in this case)? He would still benefit from the other NVMe drives that are operation, for sure, but I believe this performance hit is what the OP is trying to avoid.

When all NVMe drives are operational then you’re right, it’s a waste of NVMe cache. He’s not mirroring the drives because of the risk of data loss, but they don’t want to risk losing the performance. Something I’ve considered myself
If you had 2 l2arc drives, and you mirrored them. You have the speed of one drive. If you stripe, you have the speed of two drives. If one was to fail you’d have the speed of your mirror….what’s the point of mirroring a read only cache? you so t go down if it fails, and you are actually Giving up performance under normal conditions….doesn’t make sense.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Any issues with Mellanox Connectx-3? Just bought those and about to test them.
I can’t say..I run chelsios… search the forums..
 

yazman

Cadet
Joined
Feb 5, 2020
Messages
5
If you had 2 l2arc drives, and you mirrored them. You have the speed of one drive. If you stripe, you have the speed of two drives. If one was to fail you’d have the speed of your mirror….what’s the point of mirroring a read only cache? you so t go down if it fails, and you are actually Giving up performance under normal conditions….doesn’t make sense.
Unless I understand striping wrong then your VMs would still suffer if you had one of the cache drives die. If your object consists of multiple records then it would store those records across multiple drives, correct? I agree that it's not a real risk as in you don't lose your main/pool data, that's still safe, but you still face a performance impact, which is what the OP is asking about.

For example, your VM requests an object that is made of up 12 records, then those records would be on multiple drives because it's striped. So if one of those drives die, then when the VM requests that object, it would get some of them from cache but the remaining objects (that resided on the failed drive) would need to be requested from the main pool and provide degraded performance overall. Would it not?

If I'm wrong here then please let me know because I'm trying to understand how a failed cache drive would not result in degraded performance. I understand that striped drives would provide better performance than mirrored drives for write speed, but if one of them fails then you're bogged down by the weakest link (the main pool in this case).

Also, just to clarify:
RAID1 can read as fast striped disks, not always but with the right settings it can read from both disks.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
You logic is this….I have 2 disks. If I mirror them, I get the performance of one, it if one disk fails, that read only data won’t be lost.

if you stripe, you have the performance of two disks, if you loose a disk you go to the the performance of a mirror and you loose the read only data.

why would you create a bottle neck right from the start?
 

yazman

Cadet
Joined
Feb 5, 2020
Messages
5
You logic is this….I have 2 disks. If I mirror them, I get the performance of one, it if one disk fails, that read only data won’t be lost.

if you stripe, you have the performance of two disks, if you loose a disk you go to the the performance of a mirror and you loose the read only data.

why would you create a bottle neck right from the start?

To avoid a bigger bottleneck if a drive fails.

If they don't mirror then they will get more cache (and hence better overall performance) - 4 drives striping your cache. If a drive fails, then anything that requires going to the pool will be going at the speed of the pool.

If they mirror then they will get half the storage for cache - 2 drives striping your cache. If a drive fails, then you still have your full cache and performance is not degraded.

There are pros and cons to each situation.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
you are not seeing how your logic is flawed.

let’s say each drive gives you 100 iops.

if you mirror them, you get 100 iops.

if you strip you get 200 iops.

in a mirror you get 100 iops either way

in the strip you have 200 iops amd double the cache size.

why would you limit yourself?

you missing out on double the speed and capacity for what? Like I said if you had 2 drives, in a failure situation you reduce to essentially what a mirror would give you….

this is like why they just don’t let you do a mirror in the first place..

I run 12 800gb sas drives in my l2 arc…imagine how much speed and performance I’d loose if I mirrored….my l2 arc has a very high hit ratio. Reads also don’t kill ssd..writes do. That’s why the log drive should be mirrored.

hope that all makes sense.
 
Top