TrueNAS-SCALE-22.12.0 | Sparse zvol showing considerably higher allocation than is actually in-use

essinghigh · Feb 26, 2023

Hi all, have been searching for some information on this for a fair while and am starting to go a little insane trying to wrap my head around zfs documentation

I have a zvol for a Debian VM - mediasrv-35uf3. This is a sparse volume - so should only consume what it's using as far as I am aware. Total size is set at 5TiB.

root@truenas[~]# zfs list | grep media
Big Data/Virtual Machines/mediasrv-35uf3 1.70T 13.8T 1.70T -
root@truenas[~]#

I can see that 1.70T is used when checking the output of `zfs list`, however this is not reflected inside of the VM - I'm using considerably less.

Filesystem Size Used Avail Use% Mounted on
udev 3.9G 0 3.9G 0% /dev
tmpfs 794M 536K 793M 1% /run
/dev/vda2 5.0T 57G 4.7T 2% /
tmpfs 3.9G 4.0K 3.9G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/vda1 511M 9.8M 502M 2% /boot/efi
tmpfs 794M 0 794M 0% /run/user/1000

Disk /dev/vda: 5 TiB, 5497558138880 bytes, 10737418240 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0F233894-9A6A-4245-A7E1-C5FA54CBAC7E

Device Start End Sectors Size Type
/dev/vda1 2048 1050623 1048576 512M EFI System
/dev/vda2 1050624 10735319039 10734268416 5T Linux filesystem
/dev/vda3 10735319040 10737416191 2097152 1G Linux swap

root@mediasrv:~# fstrim / -v
/: 463.3 MiB (485847040 bytes) trimmed

In reality, I'm using about 60G, but I cannot figure out for the life of me why this would not be represented in the sparse volume. I don't have any snapshots etc that may be using space. I believe at one point I would have been using 2TiB, so possible that it's expanding because of this usage and then not reclaiming the space after it has been freed inside of the VM?

VM Disk Config: Mode=VirtIO , Disk Sector Size=Default
Zvol Config: Size=5.00TiB , Sync=Standrd , Compression=lz4 , Deduplication=Off , Read-only=off , Snapdev=hidden

Is this normal? Is there a way for me to reclaim this space?

Thanks in advance for any help.

jgreco · Feb 26, 2023

essinghigh said:
so should only consume what it's using as far as I am aware

This is a misunderstanding on your part. ZFS has minimal visibility into what is "in use" inside a zvol. At best, ZFS can be notified via unmap/TRIM that a block is no longer in use, but let's say your zvol's block size is 16KB, and you write something to the first two 512B virtual sectors, ZFS still allocates 16KB of space, stores your 1KB of data, and life moves on. If you attempt to free or overwrite the data from the client, there are at some unexpected things that might happen. One is that if you have taken any snapshots, a new 16KB block is allocated and loaded up with the unaffected sector data from the old 16KB block, meaning you now have two 16KB blocks consumed.

You haven't given any details about your pool design; did you by any chance use RAIDZ?

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk...

www.truenas.com

essinghigh · Feb 26, 2023

jgreco said:
This is a misunderstanding on your part. ZFS has minimal visibility into what is "in use" inside a zvol. At best, ZFS can be notified via unmap/TRIM that a block is no longer in use, but let's say your zvol's block size is 16KB, and you write something to the first two 512B virtual sectors, ZFS still allocates 16KB of space, stores your 1KB of data, and life moves on. If you attempt to free or overwrite the data from the client, there are at some unexpected things that might happen. One is that if you have taken any snapshots, a new 16KB block is allocated and loaded up with the unaffected sector data from the old 16KB block, meaning you now have two 16KB blocks consumed.

You haven't given any details about your pool design; did you by any chance use RAIDZ?

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk...

www.truenas.com

Hi jgreco,
Thanks for clearing up my misunderstanding. I'm new to ZFS as a whole so am still learning - very appreciative of your explanation.

In terms of pool design - I'm certainly not running by what's recommended. Two striped mixed-capacity VDEVs. I'll be rebuilding with a mirror very soon as I have two more drives on the way. However was hoping I could figure out the cause of some of my issues before rebuilding so hopefully I don't run into the same.

root@truenas[~]# zpool iostat 'Big Data' -v
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
Big Data 9.59T 14.0T 92 52 11.8M 11.3M
5eab754d-80a0-11ed-89a2-2cf05ddecc9b 4.77T 7.95T 44 26 5.76M 5.54M
5e8faca9-80a0-11ed-89a2-2cf05ddecc9b 4.82T 6.09T 47 26 6.03M 5.75M
cache - - - - - -
5ddea077-80a0-11ed-89a2-2cf05ddecc9b 361G 105G 257 28 4.33M 2.94M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@truenas[~]#

jgreco · Feb 26, 2023

essinghigh said:
Thanks for clearing up my misunderstanding. I'm new to ZFS as a whole so am still learning - very appreciative of your explanation.

The various concepts have a variety of interactions that can be very difficult to understand. I spend a lot of my time trying to make some of the rough stuff more accessible to new users. Don't be intimidated by ZFS -- it IS intimidating -- but experiment and ask questions and you should eventually get there, or perhaps end up in Arkham Asylum.

zvol usage is particularly complicated because you have to keep in mind what *else* is happening to the *other* data that's been stored.

hege · May 16, 2023

I'm encountering a similar issue on Truenas Scale 22.12.2. I have a sparse zvol (50G) on which I'm running a Debian VM with ext4 on it. The VM filesystem only uses 8Gb, yet the zvol is using 30G (24G written). Does the current implementation of the sparse zvol + Virtio qemu disk supports trim properly? I have enabled fstrim in the guest, and it says it freed up 30G space, but the zvol usage didn't decrease at all.

Several posts over at proxmox forum indicate that virtio-blk doesn't support trim, but virtio-scsi does. Unfortunately I didn't find any way to switch the virtio disk using the Truenas UI or midclt, and the underlying qemu XML files indeed indicate that the virtio block driver is in use. Does this mean that sparse zvols are actually not properly trimmed in Virtio VM disks?

hege · May 17, 2023

OK, I figured this one out. Based on this post, the qemu driver needs the discard option set. I did a virsh edit on the VM, added the discard option and restarted the VM with virsh, and suddenly fstrim made the sparse zvol shrink. Unfortunately the Truenas middleware will rewrite the XML files, so this is not the right long term solution.

So this seems to be a bug in Truenas Scale - the discard option needs to be set for VM disks backed by sparse zvols.

Code:

<driver name='qemu' type='raw' cache='none' io='threads' discard='unmap'/>

HoneyBadger · May 17, 2023

Hi @hege - Thanks for the clear instructions to reproduce this. Can you submit this as a bug using the "Report a Bug" link at the top of the forums, with an issue type of "Improvement"? I'll see if I can direct it over to the appropriate group in the development department.

hege · May 17, 2023

HoneyBadger said:
Hi @hege - Thanks for the clear instructions to reproduce this. Can you submit this as a bug using the "Report a Bug" link at the top of the forums, with an issue type of "Improvement"? I'll see if I can direct it over to the appropriate group in the development department.

I'd be happy to, but Jira says "User '...' doesn't have the 'Close Issues' permission" when I try to file a new improvement ticket. I just created my Jira account, perhaps it takes a while for permissions to be granted?https://ixsystems.atlassian.net/sec...lassian.jira.plugins.jim-plugin:bulkCreateCsv

hege · May 17, 2023

@HoneyBadger - I was able to file NAS-122018 as suggestion. I'd appreciate your help routing it to the development team for consideration.

Turnspit · Jun 10, 2023

hege said:
OK, I figured this one out. Based on this post, the qemu driver needs the discard option set. I did a virsh edit on the VM, added the discard option and restarted the VM with virsh, and suddenly fstrim made the sparse zvol shrink. Unfortunately the Truenas middleware will rewrite the XML files, so this is not the right long term solution.

So this seems to be a bug in Truenas Scale - the discard option needs to be set for VM disks backed by sparse zvols.

Code:
<driver name='qemu' type='raw' cache='none' io='threads' discard='unmap'/>

I'm facing the same issues, having a VM at 80GB, chomping up about 145GB of disk space already, which is pretty frustrating.

Could you give some further details on how you managed to edit the xml code? I have no idea where to look for it to be honest.

Jadan1213 · Aug 14, 2023

hege said:
OK, I figured this one out. Based on this post, the qemu driver needs the discard option set. I did a virsh edit on the VM, added the discard option and restarted the VM with virsh, and suddenly fstrim made the sparse zvol shrink. Unfortunately the Truenas middleware will rewrite the XML files, so this is not the right long term solution.

So this seems to be a bug in Truenas Scale - the discard option needs to be set for VM disks backed by sparse zvols.

Code:
<driver name='qemu' type='raw' cache='none' io='threads' discard='unmap'/>

It was a little convoluded having to unmask the libvirt socket and use virsh edit to make the change, but FINALLY my zvol is back down to 17GiB for my Docker VM. I had to zfs send | zfs receive the zvol a couple times and each time, it grew by about 10-15 GiB and fstrim did NOTHING! So THANK YOU for this! It definitely works.

For something so simple, and so necessary for proper trim/unmap on zvols, you'd think it would already be defaulted this way, or give a way in the GUI to enable it.

Jadan1213 · Aug 14, 2023

Turnspit said:
I'm facing the same issues, having a VM at 80GB, chomping up about 145GB of disk space already, which is pretty frustrating.

Could you give some further details on how you managed to edit the xml code? I have no idea where to look for it to be honest.

First, unmask the libvirt socket by using

systemctl unmask libvirtd.socket
systemctl restart libvirtd.service

Then list your vms using

virsh list

Take note of the name for the VM you want to shut down.

virsh shutdown <vm_name>

once it's shutdown, you can edit the xml by using

virsh edit <vm_name>

it will ask you what to use, i used nano. Scroll down (if necessary) until you find the entry for your zvol and add the discard=unmap flag.

<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='threads' discard='unmap'/>
<source dev='/dev/zvol/nvme01/vhds/Docker01-ivvcg2'/>
<target dev='vda' bus='virtio'/>
<boot order='1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>

use CTRL+O to save, CTRL+X to exit nano.
Now you can restart your VM using

virsh start <vm_name>

run your trim command in the CLI

sudo fstrim -va

It will output some information on the trim, and if you go check your dataset/zvol storage, it should be what you expect now. Keep in mind, this WILL BREAK the GUI for VM management. To make that part work again, just do the opposite:

systemctl mask libvirtd.socket
systemctl restart libvirtd.service

On reboot, it's likely you'd have to redo these steps since TrueNAS uses a config database. Enjoy.

hege · Aug 14, 2023

There is a way to make this persistent by changing the python code (obviously only do this if you understand what you're doing, but I have been running this with every minor release in the past months, with no obvious ill effects)

Code:

sudo nano /usr/lib/python3/dist-packages/middlewared/plugins/vm/devices/storage_devices.py

then change this line:

Code:

create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower()),

to this:

Code:

create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower(), discard='unmap'),

then reboot. Afterwards the VM should have this flag set. Obviously the next official upgrade will revert back to the bad behavior, you can probably automate this change to middlewared code with a startup script. Please upvote https://ixsystems.atlassian.net/browse/NAS-122018 so the devs consider it for the next major release.

Jadan1213 · Aug 14, 2023

hege said:
There is a way to make this persistent by changing the python code (obviously only do this if you understand what you're doing, but I have been running this with every minor release in the past months, with no obvious ill effects)

Code:
sudo nano /usr/lib/python3/dist-packages/middlewared/plugins/vm/devices/storage_devices.py

then change this line:

Code:
create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower()),

to this:

Code:
create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower(), discard='unmap'),

then reboot. Afterwards the VM should have this flag set. Obviously the next official upgrade will revert back to the bad behavior, you can probably automate this change to middlewared code with a startup script. Please upvote https://ixsystems.atlassian.net/browse/NAS-122018 so the devs consider it for the next major release.

Thank you for this. I'll likely apply this tweak. As soon as I figure out how to upvote it.. I will!

essinghigh · Aug 14, 2023

hege said:
There is a way to make this persistent by changing the python code (obviously only do this if you understand what you're doing, but I have been running this with every minor release in the past months, with no obvious ill effects)

Code:
sudo nano /usr/lib/python3/dist-packages/middlewared/plugins/vm/devices/storage_devices.py

then change this line:

Code:
create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower()),

to this:

Code:
create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower(), discard='unmap'),

then reboot. Afterwards the VM should have this flag set. Obviously the next official upgrade will revert back to the bad behavior, you can probably automate this change to middlewared code with a startup script. Please upvote https://ixsystems.atlassian.net/browse/NAS-122018 so the devs consider it for the next major release.

Thanks, confirming this has worked for me also. I've added a vote for the issue.
EDIT: Seen a 150GiB reduction out of this from my most storage-heavy VM, great catch!

mihies · Dec 20, 2023

Was this change ever implemented in TrueNAS Scale? It doesn't look from error report on JIRA.

essinghigh · Dec 20, 2023

mihies said:
Was this change ever implemented in TrueNAS Scale? It doesn't look from error report on JIRA.

As of TrueNAS-23.10.1, no. I still apply the workaround manually on each update.

In the meantime I have a script saved in /root to overwrite this when I update

Code:

sed -i "s/create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower()),/create_element('driver', name='qemu', type='raw', cache='none', io=iotype.lower(), discard='unmap'),/" /usr/lib/python3/dist-packages/middlewared/plugins/vm/devices/storage_devices.py
midclt call system.reboot

essinghigh · Jan 9, 2024

HoneyBadger said:
Hi @hege - Thanks for the clear instructions to reproduce this. Can you submit this as a bug using the "Report a Bug" link at the top of the forums, with an issue type of "Improvement"? I'll see if I can direct it over to the appropriate group in the development department.

Hey HoneyBadger, any luck with getting this looked at? It's a pretty easy fix so assumed this would have gone through relatively quickly. Doesn't look to have been any activity since it was raised back in May last year.

HoneyBadger · Jan 9, 2024

essinghigh said:
Hey HoneyBadger, any luck with getting this looked at? It's a pretty easy fix so assumed this would have gone through relatively quickly. Doesn't look to have been any activity since it was raised back in May last year.

It's been merged for the Dragonfish beta on https://ixsystems.atlassian.net/browse/NAS-125642 - let me see if I can prod for a backport to Cobia.

Edit: Just for safety, I'd suggest you change your reboot call in the script to midclt call system.reboot which will ensure a proper shutdown sequence through the TrueNAS middleware.

Important Announcement for the TrueNAS Community.

TrueNAS-SCALE-22.12.0 | Sparse zvol showing considerably higher allocation than is actually in-use

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

actually does care

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS-SCALE-22.12.0 | Sparse zvol showing considerably higher allocation than is actually in-use"

Similar threads