Optimal zvol configuration for Debian VM guests performance

nickt

Contributor
Joined
Feb 27, 2015
Messages
131
Hi all,

I'd appreciate some guidance on the best way to configure zvols and disks for optimal Debian guest performance.

TL;DR: I am forming the view that:
  • LVM + bhyve is a bad mix
  • zvols defined for VM storage should have sync writes enforced, but it hammers performance, so I don't want to do it
It would be great to get some thoughts.

My FreeNAS box has a number of VMs, first deployed in FreeNAS 9.10, using iohyve. These are running Debian guests, and - in the main - host services deployed in Docker containers. They work fantastically well, although at times, disk performance has not been wonderful. For the most part, this hasn't bothered me - per my signature, I have a well spec'd FreeNAS box, and light demand - more than 2 concurrent users of any service at any one time is uncommon. But I've decided it's time to dig into performance.

One of the key things I've noticed is that relatively modest disk activity on one VM can lead to significant iowait impacts on all VMs. What I often see is a big write completes on the guest, but the iowait impact on all VMs persists for some time after the write has finished (10 - 15 seconds). In these cases, iowait can exceed 50%, sometimes even as far as 90% or more, which obviously has nasty impacts on VM performance.

Most of my Debian VMs were setup using iohyve under FN 9.10, which means a zvol was automatically configured. Debian was built using the installer's default partitioning scheme and LVM (I wanted the ability to expand disks in the future). More recently, I have started building a new VM using the FreeNAS 11.2 GUI, which also gets the latest Debian (buster).

My performance testing is a little simplistic, but is seems to be sufficient to reveal big differences. I am doing a large sequential (1GB) write using dd. Typically:

Code:
 ~$ dd bs=1024 count=1048576 </dev/urandom >test.dd2

So I've done this simple test in a bunch of different configurations. Here's what I found:

FlavourBuilt withDisk driverFormatSync writesWrite speediowait impact
FreeNAS 11.2-U8--zfsstandard90 MB/slight
Debian 8.8iohyve? (iohyve assigned)zvol / LVM / ext4standard12 MB/snil
Debian 9.5 iohyve? (iohyve assigned)zvol / LVM / ext4standard40 MB/sheavy
Debian 10FreeNAS 11.2 GUIvirtiozvol / LVM / ext4standard44 MB/sheavy
Debian 10FreeNAS 11.2 GUIvirtiozvol / ext4always10 MB/snil
Debian 10FreeNAS 11.2 GUIvirtiozvol / ext4standard160 MB/slight

What stands out for me:
  • Newer Debians are better than older Debians
  • LVM seems to be directly responsible for heavy iowait impacts
  • Sync writes destroy performance
Now I have read that VM storage really should have sync writes enforced. Sync writes protect from data loss following some bad event (power failure, component failure, ...); if a bad event happens, the VM may not come up again. OK. But sync writes smash performance and it's interesting to me that neither the iohyve tool in FN 9.10 or the GUI tool in FN 11.2 configure sync writes when they automagically define a zvol / disk for guest storage.

So I'm wondering how important sync writes really are.

I realise that I could deploy a SLOG device, but a good one isn't cheap. And then I start wondering whether I wouldn't be better just using cheap SSD storage directly provisioned to the VMs avoiding the complexity. I also have a UPS, which assures orderly shutdown if there is a power failure, meaning that there aren't too many reasons data loss could occur in practice.

Lastly, I am surprised by the difference in performance between writes directly in FreeNAS (90 MB/s) vs from the Debian 10 VM best case (160 MB/s). I've repeated these two tests again and again and consistently get this kind of difference.

Looking forward to your thoughts!

Nick
 

KrisBee

Wizard
Joined
Mar 20, 2017
Messages
1,288
There are not really any suprises here. I'd make a couple of casual observations:

1. For testing read/write performance get to grisps with fio, which you can use both on the FreeNAS pool and in any of your VM guests. You need to benchmark random read/writes not just squential. Also use zilstat on the FreeNAS host to monitor sync writes and zpool iostat -v when a SLOG is used.

2. AFAIK, you want in general to maximise the random read/write IOPS in your VMs. A raidz2 pool is a poor choice in this respect compared to say a stripe of three mirrors.

3. SSDs have much higher IOPS than HDDS, You're right that data centre grade SSDs can be expensive but those with end-to-end power lose protection are best. There other failure modes than a UPS alone can protect. ( see here for https://www.ixsystems.com/community/threads/slog-benchmarking-and-finding-the-best-slog.63521/ )

4. If you want to test the potential benefit of a SLOG see here: https://www.ixsystems.com/community/threads/testing-the-benefits-of-slog-using-a-ram-disk.56561/ You don't need to change the "dirty_data_sync parameter" to do this test. One word of warning, remember to remove the ram disk SLOG after testing.

5. Is LVM really necessary? You could just add another zvol backed virtual disk device to a VM, or reszie an existing zvol and then use a combo of parted and resez2fs for ext4 filesystems in the VM linux.

I'd benchmark again using fio, then re-test with a temp RAM device as a SLOG. If a SLOG makes enough of a difference, then something like a s/hand 200GB Intel S3700/S3710/S4610, or samsung 863a, may be good enough at reasonable cost, but see the ref below. If the addtion of a SLOG to your pool makes little difference, consider changing your pool layout or simply add a separate SSD mirror pool for all your VMs.

useful refs:

 

nickt

Contributor
Joined
Feb 27, 2015
Messages
131
Thanks @KrisBee for all the input - some really useful material that I will work through. I have used fio in the past, but it was so brutal in generating IO load that it brought all of my VMs and some of my FreeNAS services down. But I will have another go to get a bit more scientific about it.

In any case, it seems using LVM was a bad choice. As you say, it's not really necessary, and you make a good point that there are easy ways to increase an ext4 partition anyway. So I'll stop that.

But I'm still not clear on the importance / value of enforcing sync writes on VM zvols. My question is: how heretical is it to leave "sync=standard" on a VM zvol?

I get that a SLOG would (probably / partially) eliminate the performance issue, but I'm struggling to see how the following two systems are any different in terms of actual data safety:
  • System 1
    • Has a UPS to manage an orderly shutdown in a power failure
    • Has a VM with zvol and "sync=standard"
  • System 2
    • Has a UPS to manage an orderly shutdown in a power failure
    • Has a VM with zvol and "sync=always"
    • Has an enterprise class SSD based SLOG (but the SSD has no power loss prevention)
I can't think of any fault scenario where system 2 is any better than system 1. The only way to improve system 2 would be to use an enterprise class SSD with PLP, but then we are getting into seriously crazy horse costs.

The other aspect to this I'm trying to rationalise is to contrast the data safety of the VM storage with the NAS itself. My most valuable asset on my FreeNAS system is my data stored on native ZFS datasets, which I serve up to my clients using the built in SMB service. As I understand it (unlike NFS) Samba does not do sync writes, so surely any argument applied to VM based storage would be equally relevant to any natively stored data served by SMB?

Appreciate your feedback on this, and apologies for nit picking the detail.

Nick
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
so surely any argument applied to VM based storage would be equally relevant to any natively stored data served by SMB?

With the exception that the VM is writing to block storage, which means ZFS has no way of knowing what got written in there. Was it a file? Was it a critical superblock for the entire file system inside the VM? All looks the same.

You are right that forcing sync writes without enterprise grade hardware gains you little, if you already have a UPS. For your use case, going with sync=standard and restoring a VM from backup if it really does get corrupted sounds fine.

Read this resource, it explains why raidz is so terrible for block storage: https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/

A separate pool for your VMs, as a mirror of SSDs, sounds like an inexpensive way to solve your performance woes. How much disk space are we talking? Do the VMs themselves fit into 1TB?
Whatever files the VMs are serving should arguably not be in the block storage, but come from ZFS via an NFS or SMB share to those VMs. The mirror SSDs would just be for the VM block storage.
 

KrisBee

Wizard
Joined
Mar 20, 2017
Messages
1,288
@nickt Yorick has beaten me to it with good advice. I was going to add this, which says much the same:

As my last reply was rather superficial, feel free to nick pick at the detail. The whole subject of zfs pool design optimised to specific workloads is not exactly trivial and there are plenty of people here with greater in-depth knowledge than myself, there’s a lot to learn from the forum.

After re-reading your posts, one concern I have is the extent to which you are creating and storing data within the VMs themselves. Docker is attractive for its ease of use, etc., but AFAIU docker containers are meant to be ephemeral and stateless. Loose a docker and you shouldn’t lose your data. Hence the stop, remove and re-create life-cycle. Can you say that about your VMs? If you loose a VM, what data do you loose? If each VM is plumbed into your zfs pool via shares between it and the FreeNAS host in order to write data to the underlying pool, rather than storing it within the VM’s zvol storage, then that’s far better than “virtualising” data within the VM’s storage.

I would expect your thoughts about failures modes, benchmarking and sync standard versus sync always would change for a VM that’s writing most it’s data to the pool via (internal) network shares as opposed to a VM that writes data to it’s virtual storage.

SMB will use sync writes if the client uses this mode. A NFS server always runs in sync mode, but nfs clients may use sync or async mode. To summarise the zvol/dataset sync property:

“Sync always and sync standard have the same behaviour for sync writes. The write is acknowledged once the transaction is written to the zil. Which if it’s on a SLOG will be faster.

Sync disabled and sync standard have the same behaviour for async writes, which is that the write is acknowledged as soon as it’s received.

Sync always always writes synchronously, even if the write is async.

Sync disabled always writes asynchronously, even if the write is sync.”

I can’t find the thread I had in mind about VM’s, UPS and failure modes. But end to end PLP is a must for any device used as a SLOG. so perhaps you should revise your question re: 1 versus 2.

This possibly all boils down to get a couple of SSDs to hold VMs in a mirror Vdev and iowaits should be much reduced. If your VMs write data to your HDD pool via NFS shares then a SLOG can improve performance. Using standard sync versus sync=always for zvols is a matter of risk assessment. If I find that thread, I’ll post a link.
 

nickt

Contributor
Joined
Feb 27, 2015
Messages
131
Thanks @Yorick and @KrisBee for the excellent input. I'm getting a better idea of where to go next.

One further question: after doing some more testing with fio inside the VMs, it became clear that there is a performance difference in doing small random writes with --sync=1 or 0. That is: it appears that sync writes requested inside the VM are considerably slower than async writes, which would make sense if the underly storage technology is able to recognise the difference. So is it possible that bhyve + virtio is able to provide correct handling of sync requests from the guest? If so, this would mean that sync=standard on the zvol would be a good choice, and the performance penalty of forcing all writes to sync by setting sync=always would be unnecessary. But most of the discussion I've seen suggests that FreeNAS cannot distinguish between a guest requesting sync and async writes, so I'm not quite sure I understand what I'm seeing.

Interesting to learn about the poor fit for VM block storage on a raidz pool. My decision to use raidz was thinking of my FreeNAS box only as a NAS appliance back at the beginning of its life. The majority of my use of the NAS is to store / protect raw photo files and to a lesser extent, media files. For this use case, I think raidz is still the optimal choice. Standing up VMs to run docker applications is a more recent thing and - until now - I've never come back to question the architecture.

Agree on the comment about container storage inside the VMs. My working assumption is that the VMs are no more trustworthy than a raspberry pi. So all* the config of my containers (docker-compose.yaml and relevant config files) are under source control (on my local gitlab instance) and any critical storage is handled off VM on native zfs datasets with CIFS mounts inside the VM. Any critical storage that remains on VM for performance reasons (eg mysql databases) are regularly backed up. So losing a VM would be a pain (I have to rebuild it), but not catastrophic (I can rebuild it). It's a home server, so downtime might generate a few complaints, but nothing I can't cope with.

Interesting side note: CIFS performance on Debian 10 + virtio NIC VM is massively better than on my older iohyve Debian 8.8 + ?? NIC VM. Whereas the Debian 10 VM can easily saturate pool performance over CIFS, the older VM typically can't get better than about 12 MB/s.

So from here, I'm thinking of two steps. First step:
  • Build a new VM based on Debian 10, virtio drivers and no LVM, zvol with sync=standard - this alone will be a big step forward
  • Move docker containers to new VM
  • Rearchitect Nextcloud (which seems to be the service most prone to performance issues):
    • Nextcloud and redis containers on VM
    • Move nextcloud mysql to new iocage jail with dedicated (and tuned) dataset
    • Continue to use CIFS to connect nextcloud to file storage on native zfs dataset
Second step:
  • Invest in (consumer grade) SSD storage and configure a new pool for VM storage. Configure with pair of SSDs as mirror. Investigate performance with sync=always, but probably go for sync=standard
  • Move VMs to new SSD pool
  • Consider moving nextcloud mysql from iocage jail to new SSD pool (probably still as a jail)
Now for something really heretical: given that I think of my VMs as expendable, maybe I don't need two SSDs in step 2. Maybe one is good enough... I don't know - just how unreliable are consumer grade SSDs (under light load)?

* OK, not all of my containers are configured this way - I'm lazy - but I'm getting there...
 

nickt

Contributor
Joined
Feb 27, 2015
Messages
131
Hi all - on rereading, I realise my last post read more like a statement than a question. It would be great to get some feedback on a couple of things:
  • Is it possible that a bhyve Debian guest with a virtio disk is able to pass sync write requests to the host zfs zvol storage (with sync=standard) and expect it to be handled correctly?
    • All the research I've done talks about iSCSI, ESXi, VMware and NFS, all of which would suggest the answer is no, but I can't find anything about bhyve + virtio
    • My performance testing from a Debian guest shows a difference between fio --sync=1 and --sync=0, suggesting something is going on
  • Any feedback on my thoughts about where to from here would be awesome
Many thanks
 
Top