Freenas (Inside ESX Host)

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
I'm moving onto my next Freenas adventure. I've been a long time user but this time might up my game slightly.

Today I'm running

Motherboard:
X9SRL-F

RAM:
128GB ECC DDR3 (I don't remember the brand/model)

Processor:
Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz

92xx SAS adapter flashed to JBOD mode passed through to Freenas
2 pools of 6x3TB HGST 7200RPM raidz2

Intel 520 or 540 10Gb NIC's

SSD for ESX boot drive

Micron p320h 700GB (VM datastore)

This time I'm thinking I'd like to move over most (if not all) of my VM's to the freenas datastore. I still want high I/O performance for VM's because sluggishness annoys me. I'm also reducing the footprint/power consumption of my next rig.

AsRock Rack EPYC3251D4I-2T Mini-ITX


32GBx4 (128GB RAM) DDR4

Optane 905P 380GB

Oculink 4i to 4 sata (we will see if freenas recognizes this). If not I'll swap in my 92xx PCI adapter for the 6 sata ports I need

6x16TB seagate exos 7200RPM drives

Possible PCIe->m2 adapter + 1TB 970 Evo Plus

Any thoughts on the best way to setup this rig for VM's? Should I expose the zpool via NFS and utilizing the NVMe's as cache disks? Alternatively I've thought about doing the same as I currently have an letting ESX make raw use of them.

The main benefit I want to get out of a shared datastore is I can get a vSphere license for myself (at an affordable cost) and am considering adding 2 other compute nodes with only NVMe drives. This would allow me to vmotion etc on my home cluster.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
This time I'm thinking I'd like to move over most (if not all) of my VM's to the freenas datastore.
I am a little confused. Are you wanting to run FreeNAS as a VM or do you want FreeNAS to host the storage for the VMs? I can't imagine it would be exactly safe to do both, but I suppose it could work, like a dream inside a dream...

As for the storage, if the virtual (or not) FreeNAS is hosting the file that is the virtual drive for a virtual machine, and you say:
I still want high I/O performance for VM's because sluggishness annoys me.
This means you need lots of disks in mirror vdevs... That is how you get high IOPS from ZFS.

Required reading:





There are many other resources in the forum on virtualization with FreeNAS and a lot of folks have done it. You just need to do it right or it will be disappointingly slow or hazardous to your data, or both.
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
I am a little confused. Are you wanting to run FreeNAS as a VM or do you want FreeNAS to host the storage for the VMs? I can't imagine it would be exactly safe to do both, but I suppose it could work, like a dream inside a dream...
I’ll be running ESX bare metal. Freenas will be a VM with it’s boot disk on the base storage or a disk passed through. Additionally all disks for the zpools will be passed through. This VM will need to stay pinned to that local ESX host due to the pass through (can’t be vMotioned). I would then share the datastore (nfs) so all other VM’s can be on shared storage from freenas allowing them to vmotion.
As for the storage, if the virtual (or not) FreeNAS is hosting the file that is the virtual drive for a virtual machine, and you say:

This means you need lots of disks in mirror vdevs... That is how you get high IOPS from ZFS.

Required reading:

IOPS are about two things, disk latency and disks for parallel writing. Yes more “spindles” helps but so do much faster disks (NVMe/Optane). I’ll take a look at the reading but I’m fairly familiar with ZFS and RAID.

Due to the nature of how ESX consumes data stores and sync I/O things can get pretty slow. I have 10Gb NIC’s/infrastructure.
Non issues, familiar with the dues and don’t.



There are many other resources in the forum on virtualization with FreeNAS and a lot of folks have done it. You just need to do it right or it will be disappointingly slow or hazardous to your data, or both.
The two alternatives I’m seriously considering are VMware VSAN or just sticking with a raw disk on ESX and skipping the mobility. Each has their pro’s and cons.

I’d like to hit line rate for sequential read/writes and boot an Windows 10 VM in say 5 seconds (what it currently takes off a PCIe SSD). It looks like I’ll have to play with SLOG (and the data risks that come with it during hard power off) vs. mirrored NVMe’s. Clearly 6 spinning drives won’t net me the performance I want.

What I don’t have is much practical experience with a fast slog and the buffer it provides or extremely fast drives mirrored.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The first system is a bit dated CPU wise and power consumption wise. I can get more than double of the cores with similar performing per/core performance.
Is the FreeNAS running slow? The answer is more vdevs. Same system board I am using, if I recall correctly. Should be good for a few more years.
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
Is the FreeNAS running slow? The answer is more vdevs. Same system board I am using, if I recall correctly. Should be good for a few more years.
Motivation is multi-fold but I want more CPU for non freenas tasks and I want to move my VM datastore to freenas from a native ESX datastore. I also want to reduce heat, noise, and footprint of the solution and want to grow my storage array (storage could have been done in place). Hence, new build.

Right now I'm running on the rig with:

7.0 latest update with ESX
- Freenas VM with 8 vCPU's
- 32GB RAM (64 total on the host because 2 DIMM's had ECC errors and I'm returning them)
- Optane M2 905P SLOG (yes, I know the risks if I lose this slog) to 2 Samsung 970 EVO Plus 1TB drives (VM datastore)
- 10Gb VXNET3 NIC

I haven't installed my 16TB disks as I'm waiting to deal with the new RAM and my case requires me to remove the drives to install the RAM. I'll add another optane for the slog to this group of drives

Asus 16x -> 4 4x PCIe (bifurcation)

970 EVO for local ESX datastore (For Freenas VM to live on), no redundancy. I'll just backup the configuration instead

I've configured the pool NFS is on with SYNC = ALWAYS, atime = off, 128K data size, 8 threads for NFS server

With this configuration ESX is utilizing the NFS datastore mounted as root. On the datastore I've booted a W10 VM. On that VM I'm getting ~1400MB Read (1M blocks, 16GB total transfer, 4 thread) from disk mark and about 1300 MB/s write. Random 4KQ1 is down to about 40MB/s (which isn't surprising). 4KQ1 with 4thread is closer to 100MB/s which is pretty good for 4K random.

I first enabled LRO, TSO, and TXSUM offloads. I ran into the freebsd TSO bug in bsd 12.0 so has to disable TSO.

The next thing I've run into (and will be starting a dedicated thread on) is as soon as I enable a jail (and get a bridge) my read speed goes belly up off the NFS store. This appears related to the bridge. Disabling the plugin and rebooting remediates the problem.

I tried turning off the hardware offloads, same results. I looked at adding a separate unaddressed NIC on the same network and then assigning that to the jails with promiscuous mode but truenas/BSD didn't like that.

My windows CIFS performance is around 600MB sustained. Not great but not bad for single threaded and reading/writing to the same datastore.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
with a fast slog and the buffer it provides

the ... wha? oh dear, that's not right at all.

https://www.truenas.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

if you want fast, just ditch the slog entirely. slog is *always* slower than simply turning off sync writes.

I want more CPU for non freenas tasks
- Freenas VM with 8 vCPU's
Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz


day's still young but that's crazily contradictory enough that it's likely to win today's "wow that's nuts" award... why would you set vCPU=total threads? it means that freenas can only run when there is literally nothing else runnable by the hypervisor. cut that down to two, maybe three, probably not four vCPU.

This would allow me to vmotion etc on my home cluster.

are you using something like a vmug advantage license? if so, you have storage vmotion, and shared storage is not a requirement. you can migrate both compute and storage at the same time, and if you do it from one hypervisor to another with nvme, it's pretty frickin' fast...
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
the ... wha? oh dear, that's not right at all.
Yeah, no clue what I was trying to say there. It's not a buffer, agreed. With VM's we have many small sync writes a performant SLOG is of great benefit.
https://www.truenas.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

if you want fast, just ditch the slog entirely. slog is *always* slower than simply turning off sync writes.
Turning off sync writes is a terrible idea when the primary usage of a VM datastore is sync writes. Why the heck would I do that? Hence the benefit of a high performance SLOG with dealing with small I/O.

day's still young but that's crazily contradictory enough that it's likely to win today's "wow that's nuts" award... why would you set vCPU=total threads? it means that freenas can only run when there is literally nothing else runnable by the hypervisor. cut that down to two, maybe three, probably not four vCPU.
8vCPU on my new host, 4 on my old. If are you are going to be condescending at least be right. It's a very common practice to overprovision CPU's. net/net is to monitor CPU ready time to ensure you are not starving VM's (and I'm not). Based on the way I utilize my VM's most of the time the host isn't heavily loaded (again, home use).

I'd rather my NAS be able to utilize more of the CPU when running Plex etc.


Although hard for some to believe Freenas/truenas is not my most critical workload on the machine. I don't have reservations set on the CPU for freenas so yes, freenas may wait. The most mission critical workload is my UTM and I set reservations for this VM.
are you using something like a vmug advantage license? if so, you have storage vmotion, and shared storage is not a requirement. you can migrate both compute and storage at the same time, and if you do it from one hypervisor to another with nvme, it's pretty frickin' fast...
I'm pretty familiar with all of the flavors of vmotion. Ideally with vmotion I create a VMK nic for vMotion and isolate the traffic for it between hosts. The other net negative is I then have to buy fast storage for all hosts and likely need to mirror the drives for resiliency. I don't get the benefits of vSphere HA (if I want to utilize it), I don't get the benefits of truenas and ZFS (offsite sync, etc). I could utilize VM snapshots (and I do) but they come with negatives as well.

There are other vSphere features that benefit from shared storage.

One of the reasons I utilize ESX on this host is to play with vSphere/ESX. I could easily run truenas on native hardware and my UTM on a real appliance and virtualize my workloads on top of truenas but where's the fun in that?
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
Btw, this is well written and thank you for providing it to folks. This is all consistent with my understanding of ZFS and how it works. I suppose I should be more crisp with terminology and the difference between write cache/buffer and intent log.

Do realize though when you jump on people that not all folks are day 1 learners and have a long history and experience in the tech industry.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Turning off sync writes is a terrible idea when the primary usage of a VM datastore is sync writes. Why the heck would I do that? Hence the benefit of a high performance SLOG with dealing with small I/O.
My understanding is that with a non-redundant SLOG device that does not have power-loss protection, there is no net-gain in data safety relative to turning off sync writes. Instead there is the overhead of waiting for the SLOG device confirming the sync write. And of course a false sense of security. Perhaps @jgreco can chime in and clear this up from his perspective ...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Turning off sync writes is a terrible idea when the primary usage of a VM datastore is sync writes. Why the heck would I do that? Hence the benefit of a high performance SLOG with dealing with small I/O.

But the primary usage of a VM datastore is storing VM's, not sync writes. Especially in a situation where your storage is converged, the primary case that SLOG is protecting against, VM corruption, is greatly diminished.

The big problem SLOG is guarding against is a loss of data.

In the data center, when you have your SAN storage in racks "A1-A4" and your hypervisors a few rows over in racks "D4-D5", you want to guard against the risk that the power fails for the PDU in row A, taking down your SAN, but leaving your hypervisors up. If this happens, data that the SAN acknowledged as having been written to storage, back to the hypervisors, but didn't actually get written to stable storage, that, that's lost. But the VM's are still running and think they have merrily written their data to disk.

Your home stuff isn't likely to be on two separate circuits with a crapton of complicated failure-prone gear in between. So you mainly need to be concerned about the case where the NAS *crashes*, not where the NAS loses power, because when the NAS loses power, your hypervisors will too.

For the rest of the cases, you need to decide whether the safety belt of SLOG is worth the performance penalty you will pay. Are you processing bank transactions, where the inadvertent repeat of a transaction could cause thousands or millions of dollars in problems? Etc.

8vCPU on my new host, 4 on my old. If are you are going to be condescending at least be right.

I don't see anywhere where you specify that this is on your new host, and it wasn't condescending ...

It's a very common practice to overprovision CPU's.

Let me correct that: it's a common rookie mistake to overprovision CPU's. Talk to anyone who manages infrastructure hypervisors and they will tell you that one of their biggest challenges is users asking for more resources than needed, and the classic way to deal with it is to start off by allocating a reasonable number of vCPU's, typically one, and increasing it as demand is demonstrated.

https://www.zdnet.com/article/virtual-cpus-the-overprovisioning-penalty-of-vcpu-to-pcpu-ratios/

https://kb.vmware.com/s/article/1005362

https://ryanmangansitblog.com/2014/...-vcpus-should-a-virtual-machine-be-allocated/

So if you're incorrectly going to tell me not to be condescending about something, at least be right about the issue. :smile: Come on, lighten up. It's fine to be wrong or to have someone try to point out what appears to be a problem. If you truly have cores to waste, then of course you can set a higher vCPU count, but it looked to me like you were running 8 vCPU's on 4 core 8 thread CPU.

There are other vSphere features that benefit from shared storage.

One of the reasons I utilize ESX on this host is to play with vSphere/ESX. I could easily run truenas on native hardware and my UTM on a real appliance and virtualize my workloads on top of truenas but where's the fun in that?

That's true, but if you are really interested in peak performance, you're generally going to be limited by the network. FreeNAS is pretty heavyweight and is nowhere near as nimble and performant as just doing a combined vMotion and storage vMotion between two hypervisors with NVMe storage.
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
But the primary usage of a VM datastore is storing VM's, not sync writes. Especially in a situation where your storage is converged, the primary case that SLOG is protecting against, VM corruption, is greatly diminished.

The big problem SLOG is guarding against is a loss of data.

In the data center, when you have your SAN storage in racks "A1-A4" and your hypervisors a few rows over in racks "D4-D5", you want to guard against the risk that the power fails for the PDU in row A, taking down your SAN, but leaving your hypervisors up. If this happens, data that the SAN acknowledged as having been written to storage, back to the hypervisors, but didn't actually get written to stable storage, that, that's lost. But the VM's are still running and think they have merrily written their data to disk.

Your home stuff isn't likely to be on two separate circuits with a crapton of complicated failure-prone gear in between. So you mainly need to be concerned about the case where the NAS *crashes*, not where the NAS loses power, because when the NAS loses power, your hypervisors will too.

For the rest of the cases, you need to decide whether the safety belt of SLOG is worth the performance penalty you will pay. Are you processing bank transactions, where the inadvertent repeat of a transaction could cause thousands or millions of dollars in problems? Etc.
OS sync writes to disk are critical path (VM usecase). If we don't honor sync writes to the OS and we lose power we are in pretty bad shape. That's the main usecase for the slog vs. just turning sync writes off.

I don't see anywhere where you specify that this is on your new host, and it wasn't condescending ...

Let me correct that: it's a common rookie mistake to overprovision CPU's. Talk to anyone who manages infrastructure hypervisors and they will tell you that one of their biggest challenges is users asking for more resources than needed, and the classic way to deal with it is to start off by allocating a reasonable number of vCPU's, typically one, and increasing it as demand is demonstrated.

https://www.zdnet.com/article/virtual-cpus-the-overprovisioning-penalty-of-vcpu-to-pcpu-ratios/

https://kb.vmware.com/s/article/1005362

https://ryanmangansitblog.com/2014/...-vcpus-should-a-virtual-machine-be-allocated/

So if you're incorrectly going to tell me not to be condescending about something, at least be right about the issue. :smile: Come on, lighten up. It's fine to be wrong or to have someone try to point out what appears to be a problem. If you truly have cores to waste, then of course you can set a higher vCPU count, but it looked to me like you were running 8 vCPU's on 4 core 8 thread CPU.
Fair enough. Perhaps I didn't state it clearly enough but did mention it.
mstang1988 said:
Hence, new build.
...
That's true, but if you are really interested in peak performance, you're generally going to be limited by the network. FreeNAS is pretty heavyweight and is nowhere near as nimble and performant as just doing a combined vMotion and storage vMotion between two hypervisors with NVMe storage.
As mentioned, I'm not totally interested in peak performance, if I was I wouldn't be utilizing freenas or VM's. My desire is to run my home infra in a way that allows me utilize ESX but with sufficient enough performance to not degrade the user experience and enjoy the benefits freenas provides.

I'm aware that vMotion can hit 100Gb line rates during vMotion/storage vMotion. vMotion itself is pretty light weight as you are only moving the RAM footprint (repeatedly for dirtied bits) but once you start having data gravity (storage vmotion) the costs become more expensive. A 200GB VM disk running at line rate is going to still take 200s best case to move.

Regardless, I got my config with freenas running in a state I'm happy with. Given that I can hit line rate with large sequential transfers and ~80MB/s random 4K queue depth 1 that's enough for now.

For what it's worth, I was formally a kernel developer, now turned director. I mostly have a background in device drivers (NIC's and crypto adapters) and hypervisor development.
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
My understanding is that with a non-redundant SLOG device that does not have power-loss protection, there is no net-gain in data safety relative to turning off sync writes. Instead there is the overhead of waiting for the SLOG device confirming the sync write. And of course a false sense of security. Perhaps @jgreco can chime in and clear this up from his perspective ...
Any slog without power loss protection provides no data safety as you aren't providing a true sync semantic. Intel's Optane 905P was originally listed as having power loss protection (it doesn't have a DRAM buffer and writes directly to the memory). Intel doesn't claim them to be power loss protected but their big brothers (P4800x/P4801x) are.

The redundancy of the SLOG (mirror) only helps if you both lose power AND lose one slog device.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
OS sync writes to disk are critical path (VM usecase). If we don't honor sync writes to the OS and we lose power we are in pretty bad shape. That's the main usecase for the slog vs. just turning sync writes off.

Yes, I've written extensively about that over the years, but as I said, that calculus is very different for a home user than a commercial or enterprise organization. There is a significant cost to sync writes, and you're certainly welcome to pay that price if you feel you must, but it is likely not buying you very much. For a home user, if the power goes out, it probably takes the whole production down, SAN and hypervisors both, and SLOG doesn't help you at ALL in that case. If you have a PSU failure or a kernel panic on the NAS, that's the primary situation where SLOG is beneficial, but you also need to be writing critical data at just the right moment, and the reality is that you are paying a significant performance tax 100% of the time in order to cover what is really an edge case. It may be cheaper and safer just to have good backups, because there are other things that can cause VM corruption or loss that having SLOG doesn't fix.
 

mstang1988

Contributor
Joined
Aug 20, 2012
Messages
102
Yes, I've written extensively about that over the years, but as I said, that calculus is very different for a home user than a commercial or enterprise organization. There is a significant cost to sync writes, and you're certainly welcome to pay that price if you feel you must, but it is likely not buying you very much. For a home user, if the power goes out, it probably takes the whole production down, SAN and hypervisors both, and SLOG doesn't help you at ALL in that case. If you have a PSU failure or a kernel panic on the NAS, that's the primary situation where SLOG is beneficial, but you also need to be writing critical data at just the right moment, and the reality is that you are paying a significant performance tax 100% of the time in order to cover what is really an edge case. It may be cheaper and safer just to have good backups, because there are other things that can cause VM corruption or loss that having SLOG doesn't fix.
For kicks and giggles I went ahead and benchmarked from the windows 10 VM utilizing Crystal Disk Mark 8.0.1 x64 with and without sync writes to the shared NFS ESX datastore (NFS datastore on Freenas). I see some odd behaviors which really surprise me. 16GB single pass.

Test #1
No Slog, Sync Disabled Dual 970 Evo Plus
SEQ1M Q8T4 - READ 1596.18MB/s :::: WRITE 920.20MB/s
SEQ128 Q32T4 - READ1606.07MB/s :::: WRITE 2188.22MB/s
RND4K Q32T4 - READ 146.13MB/s :::: WRITE 48.04 MB/s
RND4K Q1T4 - READ 66.36MB/s :::: WRITE 43.26 MB/s

Test #2
Optane Slog, Sync Disabled Dual 970 Evo Plus
SEQ1M Q8T4 - READ 1655.40 MB/s :::: WRITE 2322.02MB/s
SEQ128 Q32T4 - READ 1560.87MB/s :::: WRITE 470.67MB/s
RND4K Q32T4 - READ 137.89MB/s :::: WRITE 7.75MB/s
RND4K Q1T4 - READ 61.01MB/s :::: WRITE 8.59MB/s

Test #3
Optane Slog, Sync Standard Dual 970 Evo Plus
SEQ1M Q8T4 - READ 1478.88MB/s :::: WRITE 1605.30MB/s
SEQ128 Q32T4 - READ 1441.48MB/s :::: WRITE 1559.04MB/s
RND4K Q32T4 - READ 147.03MB/s :::: WRITE 56.31MB/s
RND4K Q1T4 - READ 67.89MB/s :::: WRITE 8.21MB/s

Oddities from the data:
  1. Test #3 was the benchmark I've been running. Oddly the 4K random numbers have dropped from my baseline. This was after the reconfigures for Test#1 and Test #2. My sequential 4K was around 140MB/s for Q32T4 previously. The results were consistent previously as well, I've probably tested it 20-30 times while tuning so I do think adding/removing slog and changing sync may have made a persistent difference.
  2. Test#1 should have been the fastest on writes across the board. Why is SEQ1M Q8T4 lagging so much.
We can't say much as I'm benchmarking from an VM on the same host as the freenas and it's not repeated testing. 4vCPU's for windows, 8vCPU's for Freenas (these are the only VM's running) and it's a 8c/16T host. Ready time maxes at 0.09% on the Freenas Appliance and 0.5% on the Windows 10 VM. It's unlikely to be CPU bound (max CPU on host is 85.83% during execution of runs).

One other thing to add, this is the only dataset on the drives, 128K record size.

I haven't spent much time reading about the internals to transaction group commit. I know there is a timer but if we are consuming > 2GB/s a 5s commit ends up being about 10GB. Without knowing how/when the commit happens I'll discuss worst case. Worst case we are ingesting data while we have outstanding writes so we will likely be chewing up > 10GB RAM. I'll do some reading on how the transaction group commits but I wonder if there are some cases where it is ending up on swap or we back pressure the ingest because I'm outpacing NVMe drives.
 
Last edited:

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
Benchmarks are kind of pointless when it's a VM running it; as it's subject to the hypervisor's scheduler in all regards.

Also, consider that the test file needs to be over/above what can be cached at any level in the chain to be certain you're seeing "real" results.

Lastly multiple averaged passes should be performed; with the slowest and fasted tossed out; and the remaining averaged...
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
Also of note, I think Sync=Standard leaves it up to the system sending the data to request sync'd writes; or certain shares/files systems on the FreeNAS box...IE< NFS, I think, under Standard, is sync write, but SMB is not; as an example.
 
Top