Metadata -- what is it, and what does it do?

Status
Not open for further replies.

Hexland

Contributor
Joined
Jan 17, 2012
Messages
110
So, I'm currently experiencing an issue with my FreeNAS box using a huge iSCSI device extent (outlined here -- http://forums.freenas.org/showthread.php?7448-Hanging-on-boot-import)

The crux of the issue seems to be that, in addition to 2 out of my 6 x 2Tb drives dedicated to RAID-Z2 redundancy information, the ZFS Volume I created to hold the iSCSI target also needs at least 100% of its own size for 'metadata'

E.g. I create a non-sparse ZFS Volume of 6Tb on a 7.1Tb pool - leaving 1.1Tb of free space available. The 6Tb is then handed off to iSCSI as a block device (so that Windows can attach and format it as NTFS).

My understanding was that the Volume would be fixed at 6Tb.

However, windows gets as far as copying roughly 3.5Tb over to the iSCSI device, and I run out of space on the parent pool.

It would appear that the ZFS Volume (while fixed at 6Tb) requires additional space for 'metadata' as the volume is used up. This would appear to support many of the documents I found which said that 'metadata' was not allocated at volume creation time, and instead is allocated on demand.


Now, what I don't understand is (and I've been unable to find any real explanation)...
a) what the hell is metadata
b) why does a block device like an iSCSI extent require metadata

Is this a bug in the volume creation? I found this bug (from a link supplied by paleoN) which smells a bit suspicious -- https://www.illumos.org/issues/437

Can any of you ZFS experts explain the function of metadata to me (in idiots terms)?

Thanks
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
Now, what I don't understand is (and I've been unable to find any real explanation)...
a) what the hell is metadata
b) why does a block device like an iSCSI extent require metadata
Skipping a for now. For b because it's on a ZFS volume and ZFS requires metadata;)

Is this a bug in the volume creation? I found this bug (from a link supplied by paleoN) which smells a bit suspicious -- https://www.illumos.org/issues/437
I'm not sure if you saw bug #430, 437 is considered a dup. From the formula on comment #5: size/block size*(blocksize+checksum), if it is using 512b than a 4Tb volume would take up 6Tb of space including metadata.

I was curious so I created a 1Tb zvol on my mirror. It should have about 300Gb on it now. When I get in front of it again I will finish filling it and see how much space it's using.
 

Hexland

Contributor
Joined
Jan 17, 2012
Messages
110
Thanks paleoN... I might try testing it again tonight (with a smaller volume, and manually create the volume with the -b parameter).

I'm curious though... does this mean that every block written to a ZFS volume or dataset is 50% larger than the actual block data? This is in addition to the redundancy information stored in the RAID-Z2 vdevs?

(So a 2Tb drive or volume can only effectively store 1.3Tb of user data)?

That can't possibly be right?
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
With FreeNAS-8.2.0-BETA3-x64 and a 1TB zvol I can't confirm your problem. What version of FreeNAS are you currently running?
Code:
# zfs list 
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank       1.32T   475G   321G  /mnt/tank
tank/zvolTest  1.01T   475G  1.01T  -

# zfs get all
NAME             PROPERTY              VALUE                  SOURCE
tank       used                  1.32T                  -
tank       available             475G                   -
tank       referenced            321G                   -
tank       compressratio         1.00x                  -
tank       quota                 none                   default
tank       reservation           none                   default
tank       recordsize            128K                   default
tank       readonly              off                    default
tank       copies                1                      default
tank       version               4                      -
tank       refquota              none                   default
tank       refreservation        none                   default
tank       usedbysnapshots       0                      -
tank       usedbydataset         321G                   -
tank       usedbychildren        1.01T                  -
tank       usedbyrefreservation  0                      -
tank/zvolTest  type                  volume                 -
tank/zvolTest  creation              Thu Jun 14  2:22 2012  -
tank/zvolTest  used                  1.01T                  -
tank/zvolTest  available             475G                   -
tank/zvolTest  referenced            1.01T                  -
tank/zvolTest  compressratio         1.00x                  -
tank/zvolTest  reservation           none                   default
tank/zvolTest  volsize               1T                     -
tank/zvolTest  volblocksize          8K                     -
tank/zvolTest  checksum              on                     default
tank/zvolTest  compression           off                    default
tank/zvolTest  readonly              off                    default
tank/zvolTest  shareiscsi            off                    default
tank/zvolTest  copies                1                      default
tank/zvolTest  refreservation        1T                     local
tank/zvolTest  primarycache          all                    default
tank/zvolTest  secondarycache        all                    default
tank/zvolTest  usedbysnapshots       0                      -
tank/zvolTest  usedbydataset         1.01T                  -
tank/zvolTest  usedbychildren        0                      -
tank/zvolTest  usedbyrefreservation  0                      -

zvolTest a 1Tb zvol with 7Gb free space is taking up 1.01Tb of space on the NAS. Which is right according to the above formula using an 8K record size.

I'm curious though... does this mean that every block written to a ZFS volume or dataset is 50% larger than the actual block data? This is in addition to the redundancy information stored in the RAID-Z2 vdevs?
No. It's a well known, hmmm, design consequence of ZFS that very small record sizes, 512b being the smallest, generate massive amounts of metadata. For a record size of 512b and a 4Tb zvol you need an additional 2Tb of space for the metadata. If we only go up to 8K it gets a lot better.
Code:
size/block size*(blocksize+checksum)

4096/8192*(8192+256) = .5*8448 =4224
About 4.125Tb of space with metadata. A much more reasonable .125Tb taken by the metadata. A 128Kb record size, the default, does even better: 4104GB or 4.0078125Tb total.
 

Hexland

Contributor
Joined
Jan 17, 2012
Messages
110
I'm running "FreeNAS-8.2.0-BETA4-x64 (r11722)"

Interesting... I appear to be seeing slightly different numbers from you.

I created a 100Gb volume manually from the command line (with an 8K block size)
Code:
zfs create -b 8192 -o compression=off -o copies=1 -o shareiscsi=on -V 100GB ZPOOL01/ZVOL01


'zfs list' initially starts off fine
Code:
NAME             USED  AVAIL  REFER  MOUNTPOINT
ZPOOL01          100G  7.03T   256K  /mnt/ZPOOL01
ZPOOL01/ZVOL01   100G  7.13T   144K  -


But when I've filled up the volume using Windows iSCSI, it would appear to be using 2x the space

Code:
NAME             USED  AVAIL  REFER  MOUNTPOINT
ZPOOL01          202G  6.93T   256K  /mnt/ZPOOL01
ZPOOL01/ZVOL01   202G  6.93T   202G  -





So, just for fun... I created a 100Gb file extent (instead of a ZFS volume), and did the same thing... filled all 100Gb from Windows via iSCSI.

'zfs list' reports only 100Gb used...

Code:
NAME      USED  AVAIL  REFER  MOUNTPOINT
ZPOOL01   100G  7.03T   100G  /mnt/ZPOOL01

-rw-r--r--  1 root  wheel  107374182400 Jun 16 11:12 iscsiExtent





As another test, I created a ZFS Dataset ("Shared"), limited it to both a quota and reservation of both 100Gb. This time, I shared the Dataset via CIFS/SMB and filled it up with files from Windows

'zfs list' reports on 100Gb used...
Code:
NAME             USED  AVAIL  REFER  MOUNTPOINT
ZPOOL01          100G  7.03T   312K  /mnt/ZPOOL01
ZPOOL01/Shared   100G      0   100G  /mnt/ZPOOL01/Shared





I can only surmise that sharing a ZFS Volume via iSCSI creates additional metadata that is not required via a normal network share or even an iSCSI file extent. If I didn't know any better, I'd say that it'd be something silly like each low-level iSCSI block is being treated as a file at a ZFS level, and incurring some metadata overhead with each block write. An iSCSI share via a file extent would probably only incur a single files worth of metadata. That's just pure guesswork though.

I guess I'll either go with a file extent, or a straight CIFS share back to my WHS box. Not what I originally intended, but I can't really afford to lose 2x the space, when I've already dedicated 2x2Tb drives to recovery information via RAID-Z2.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I can only surmise that sharing a ZFS Volume via iSCSI creates additional metadata that is not required via a normal network share or even an iSCSI file extent.
Your surmise is partly wrong :eek:

If I didn't know any better, I'd say that it'd be something silly like each low-level iSCSI block is being treated as a file at a ZFS level, and incurring some metadata overhead with each block write. An iSCSI share via a file extent would probably only incur a single files worth of metadata. That's just pure guesswork though.
Here you are partly right. iSCSI file extent single files worth of metadata isn't 1. With a default/max record size of 128Kb your 100Gb file extent is using 819,200 records each creating metadata for an additional 0.1953125Gb of space.

Now device extents do create more metadata, but that's because they are using a record size of 8Kb. 100Gb - 13,107,200 records for record size of 8K and 3.125Gb of additional space. Do note the space is not double.

From FreeNAS-8.2.0-BETA4-x64 (r11722) I created some 11Gb zvols.

  • First one was created normally and used about 11Gb.

    I then set copies=2 on the main volume.

  • I created the second one and had to set copies=1 as it inherited the settings, filled it and it used about 11Gb.

  • For the third I created it from the cli with your command, filled it and it used about 11Gb.

  • The last one I created normally, with copies=2 on the main volume which it inherited, filled it and it used about 22Gb, the same behavior you are seeing.

Are you sure you don't somehow somewhere have copies=2 on? You could try creating some small zvols , i.e. 11Gb, and setting the copies property to see if you get any different results.
Code:
zfs set copies=1 ZPOOL01/ZVOL01
zfs set copies=2 ZPOOL01/ZVOL02
zfs set copies=3 ZPOOL01/ZVOL03


Hopefully my math is correct above :) I probably should do that metadata post. There are some links I wanted to track down first.
 

Hexland

Contributor
Joined
Jan 17, 2012
Messages
110
Definitely not setting the copies parameter (or inheriting it from anywhere).

For now, I've created a bunch of thin-provisioned file extents, and all is working well. Transfer speeds is still well up in to 70-100Mb/s, so I'm fine with just going with that.

I tried to go with CIFS, but I couldn't get them mounted on the WHS server properly, and still have WHS use them as ServerFolders (I tried mounting them as hard links so the server would see them as folders and not network drives, but it was too clever for its own good)... so I decided to go with a bunch of thin-provisioned iSCSI devices (using file extents on the FreeNAS side)

I'll double check stuff when I get home tonight -- I'll try the device extent thing again, and confirm the 'zfs get all' results -- but I'm absolutely positive that its all default settings.

Could it be a 4K formatting issue perhaps? In my array of 6 x 2Tb drives, I have a single WD-EARS drive which forced me to format the zpool as 4K?

The zfs list command was definitely reporting a 200Gb space usage on a 100Gb dataset.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
For now, I've created a bunch of thin-provisioned file extents, and all is working well. Transfer speeds is still well up in to 70-100Mb/s, so I'm fine with just going with that.
Is that faster than the device extents? Either way great performance and it's working properly. For most pools I seen under 80% utilization being recommended. Certainly keep it under 90% total though.

I'll double check stuff when I get home tonight -- I'll try the device extent thing again, and confirm the 'zfs get all' results -- but I'm absolutely positive that its all default settings.

Could it be a 4K formatting issue perhaps? In my array of 6 x 2Tb drives, I have a single WD-EARS drive which forced me to format the zpool as 4K?
From everything you posted it all looked correct except for using double the space. It was only after settings copies=2 or inheriting it that I saw the same behavior you are experiencing.

My mirror is 4k formatted and it wouldn't cause this problem anyway. At least I don't see how.
 

dtrounce

Cadet
Joined
Jun 21, 2012
Messages
1
This is interesting. I'm experiencing very similar problems with ZFS (though not using FreeNAS).

I'm using ZFSonLinux (latest spl-0.6.0-rc9, pool version 28, fs version 5) on Ubuntu Server 12.04, as I need to run it in Hyper-V and need the Linux Hyper-V integration components to send host pass-through disks from Windows Server 2008 R2 to the VM (which are not available for FreeNAS). This is the best way I've found to have a ZFS-backed iSCSI target for NTFS. I need NTFS volumes rather than a simple samba share on ZFS, so that I can run DFSR between Windows Servers.

I have 8x 3TB disks, set up as a raidz pool (using -o ashift=12 for 4K alignment, as these are 4K disks). I then create a 17,892GB non-sparse zvol (with 26MB left over) and share it using iscsitarget (blockio, blocksize=4096). I then format this from Windows as NTFS using 4K sectors.

Copying data to the NTFS volume from Windows then uses space from the zvol at the rate of 1.71x what is actually written to the volume. Thus it is clear that the pool will run out of capacity at 10,460GB or so. This matches almost exactly with 6TB/3.5TB = 1.71 ratio reported by Hexland.

I have no idea why this is the case. It seems that Hexland managed to avoid the problem finally using file extents rather than zvols as the iscsitarget. So I've created a blank sparse file with "dd if=/dev/zero of=/tank/iscsi bs=1 count=0 seek=17500G" and shared that as fileio with iscsitarget. That uses space from the zpool as expected.

So it's still not clear to me why the zvol is consuming space at 1.71x the data which is written to it over iscsi. A file extent gets around the problem, but it doesn't seem like it should be necessary.

I'm obviously not using copies = 2 or anything like that.

The 1.71x ratio for the zvol is persistent, regardless of whether or not I 4K align the zpool or set the iscsitarget blocksize to be 4k.
 
Status
Not open for further replies.
Top