ZFS capacity issue

Heire · May 27, 2014

Dear,

Recently I noticed we have the message that our ZFS volume is reaching its capacity.
Our current setup: 10x4TB drives in a RAIDZ-2 configuration with ZFS on top.
Hardware: E3-1230 with 8GB RAM

We mainly want to use freenas as an iscsi target device for our other equipment which will store bunch of data. Cause of a limitation on the vendor side we have to create multiple volumes of 1.5T on the ZFS volume and then map them in iscsi.
So I have configured 16x1.5T volumes in our ZFS volume and have thus 24TB of storage available and thus roughly 8TB free storage left.
As ZFS uses additional storage we incalculated the additional 20% of 24TB=4.8TB of reserved space which is below the 8TB that is already available.

Forva long time the 8TB free space was not used but recently I noticed that the free storage was dropping till 2TB today and thus giving the message that only 91% of storage is available for ZFS.

Now i'm afraid that the 2TB remaining will not be sufficient although we reserved roughly 33% additional storage but the result will be that our storage will be full and everything will stop working.
So anyone an idea why ZFS keeps on growing and what we can do to avoid?

solarisguy · May 27, 2014

Your data space is only 8x 3.7TiB = 29.6 TiB

Snapshots?

cyberjock · May 27, 2014

When using iSCSI you shouldn't be filling a pool >50% for performance reasons. The reason why your pool is slowly filling is that as iSCSI writes come in they will be smaller than the block size they were at creation. When your iSCSi file was created it was 128kb(which is both fast and very efficient). But as your smaller writes come in they will use smaller blocks(which is slower and inefficient). Because of the inefficiency of smaller blocks you will always need more disk space than you think you need. Depending on your usage patterns it can actually be more than your quantity of data. In one test if you write 100GB of 512-byte blocks you will need 2.1TB. Yes, that's not a typo. You likely aren't doing 100% 512-byte writes so your issue won't be that extreme.

I will warn you that you can expect your iSCSI extents to perform more and more slowly as the pool fills and fragments your extents to the extreme(especially at 95% full). The performance lost due to fragmentation will not be able to be recovered later because the 'damage'(aka fragmentation) is already done. The only solution will be to destroy the pool and restore from backup.

For a pool that size you should have MUCH more than 8GB of RAM too, but that's a whole different discussion.

Heire · May 28, 2014

cyberjock said:
When using iSCSI you shouldn't be filling a pool >50% for performance reasons. The reason why your pool is slowly filling is that as iSCSI writes come in they will be smaller than the block size they were at creation. When your iSCSi file was created it was 128kb(which is both fast and very efficient). But as your smaller writes come in they will use smaller blocks(which is slower and inefficient). Because of the inefficiency of smaller blocks you will always need more disk space than you think you need. Depending on your usage patterns it can actually be more than your quantity of data. In one test if you write 100GB of 512-byte blocks you will need 2.1TB. Yes, that's not a typo. You likely aren't doing 100% 512-byte writes so your issue won't be that extreme.

I will warn you that you can expect your iSCSI extents to perform more and more slowly as the pool fills and fragments your extents to the extreme(especially at 95% full). The performance lost due to fragmentation will not be able to be recovered later because the 'damage'(aka fragmentation) is already done. The only solution will be to destroy the pool and restore from backup.

For a pool that size you should have MUCH more than 8GB of RAM too, but that's a whole different discussion.

The RAM size I am aware that 8GB is limited but we will keep that in mind. Never the less the storage is more concerning for me.
Is there a way to see this kind of information on how it is defragmented over the storage?

[admin@svstorage01] /mnt/ZFS001# zfs get all ZFS001
NAME PROPERTY VALUE SOURCE
ZFS001 type filesystem -
ZFS001 creation Thu May 8 12:17 2014 -
ZFS001 used 31.5T -
ZFS001 available 1.13T -
ZFS001 referenced 329K -
ZFS001 compressratio 1.00x -
ZFS001 mounted yes -
ZFS001 quota none default
ZFS001 reservation none default
ZFS001 recordsize 128K default
ZFS001 mountpoint /mnt/ZFS001 default
ZFS001 sharenfs off default
ZFS001 checksum on default
ZFS001 compression lz4 local
ZFS001 atime on default
ZFS001 devices on default
ZFS001 exec on default
ZFS001 setuid on default
ZFS001 readonly off default
ZFS001 jailed off default
ZFS001 snapdir hidden default
ZFS001 aclmode passthrough local
ZFS001 aclinherit passthrough local
ZFS001 canmount on default
ZFS001 xattr off temporary
ZFS001 copies 1 default
ZFS001 version 5 -
ZFS001 utf8only off -
ZFS001 normalization none -
ZFS001 casesensitivity sensitive -
ZFS001 vscan off default
ZFS001 nbmand off default
ZFS001 sharesmb off default
ZFS001 refquota none default
ZFS001 refreservation none default
ZFS001 primarycache all default
ZFS001 secondarycache all default
ZFS001 usedbysnapshots 0 -
ZFS001 usedbydataset 329K -
ZFS001 usedbychildren 31.5T -
ZFS001 usedbyrefreservation 0 -
ZFS001 logbias latency default
ZFS001 dedup off local
ZFS001 mlslabel -
ZFS001 sync standard default
ZFS001 refcompressratio 1.00x -
ZFS001 written 329K -
ZFS001 logicalused 13.7T -
ZFS001 logicalreferenced 43.5K -

It doesn't make any sense that we would need to scale our storage over 50% in a RAIDZ2 situation, then it would be better to use hardware raid and just dump it as iscsi target. There must be some kind of improvement possible or changing the block size, ...
Also the 10 disks is indeed not the best stripe size in this case but wouldn't expect it to have such an influence.

cyberjock · May 28, 2014

There is no way I know of to identify fragmentation except to examine the zfs metadata with zdb. Unless you are an elite pro at zdb you aren't likely to be able to figure it out either. :P

Heire said:
It doesn't make any sense that we would need to scale our storage over 50% in a RAIDZ2 situation, then it would be better to use hardware raid and just dump it as iscsi target. There must be some kind of improvement possible or changing the block size, ...
Also the 10 disks is indeed not the best stripe size in this case but wouldn't expect it to have such an influence.

It makes perfect sense if you understand how ZFS works under the hood. It takes every block and checksums it as well as having ZFS metadata. If you have 1MB worth of 128KB blocks you will have fewer checksums and less ZFS metadata than if you have 1MB worth of 512-byte blocks. This is one of many reasons why we recommend 50%. It allows for some future "expansion" as your data becomes less efficiently stored because of block size. Things get even worse with RAIDZ1/Z2/Z3 because any given block also has an instant increase in parity data writes. Those aren't necessarily cleanly organized either.

I'm not sure if you like your iSCSI's performance, but for a whole host of reasons mirrors are the recommended(and virtually the only truly supported design) for iSCSI. I run iSCSI on my RAIDZ3 for experimenting only, and it's painful to use. I definitely get my patience checked when I have to use it.

Keep in mind that if your pool ends up 100% full you can expect your iSCSI extent to crash and potentially become nonviable. In this case all of the data in your extent may become inaccessible. This has happened to a select few that didn't monitor their disk space and expand appropriately.

Heire · May 28, 2014

cyberjock said:
There is no way I know of to identify fragmentation except to examine the zfs metadata with zdb. Unless you are an elite pro at zdb you aren't likely to be able to figure it out either. :p

It makes perfect sense if you understand how ZFS works under the hood. It takes every block and checksums it as well as having ZFS metadata. If you have 1MB worth of 128KB blocks you will have fewer checksums and less ZFS metadata than if you have 1MB worth of 512-byte blocks. This is one of many reasons why we recommend 50%. It allows for some future "expansion" as your data becomes less efficiently stored because of block size. Things get even worse with RAIDZ1/Z2/Z3 because any given block also has an instant increase in parity data writes. Those aren't necessarily cleanly organized either.

I'm not sure if you like your iSCSI's performance, but for a whole host of reasons mirrors are the recommended(and virtually the only truly supported design) for iSCSI. I run iSCSI on my RAIDZ3 for experimenting only, and it's painful to use. I definitely get my patience checked when I have to use it.

Keep in mind that if your pool ends up 100% full you can expect your iSCSI extent to crash and potentially become nonviable. In this case all of the data in your extent may become inaccessible. This has happened to a select few that didn't monitor their disk space and expand appropriately.

I have been looking at ZFS information (and there is a lot about it), but didn't notice the 50% margin on ZFS usage in combination with iSCSI but regarding the 20% margin on ZFS usage.
So in general this looks to be an iSCSI issue in block sizes in combination with ZFS metadata.
Well regarded the performance I did a test before using the system with the iscsi devices and was happy with the result (80-90MB/s) which is enough for us at the moment. Currently we use roughly 70Mbps so we are well enough on that area.
I have looked at the iscsi configuration and indeed notice that default configuration contains 512 block size. So it would make more sense to increase this no?

Since we use the iscsi storage for storing bunch of data (so big data), then it wouldn't make sense to use smaller block sizes as this would create a lot of metadata on it.
What would be the rule of thumb on configuration a more correct block size?

Kind Regards

cyberjock · May 28, 2014

Where are you reading the iSCSI configuration default size is 512? I don't see anything in iSCSI settings regarding block size. Even if it did, block size for iSCSI doesn't necessarily correlate to a block size of ZFS. The 20% margin isn't completely true anymore either. ZFS changes modes at 95% full with the version of ZFS that FreeNAS 9.2.0+ uses(not sure about earlier versions).

I hate to break it to you, but after being a ZFS aficionado for more than 2 years, there's very little documentation on ZFS to find. If you aren't keen on reading the source code or knows someone that does read it and has the answers you're going to find that you are in the dark. Stuff is constantly changing and nobody goes back to update their old blogs from years ago. Even though I consider myself to be pretty knowledgeable in ZFS I've been very disappointed in the lack of good documentation on ZFS.

As block sizes get bigger your storage will be more efficient, but any disk read or write will be more penalizing. This is why ZFS dynamically adjusts it's block size and you shouldn't change it.

Heire · May 28, 2014

Example of configuration where you see "logical" block size on iscsi level. Not sure how this reflects on ZFS level, but I would think that this has nothing to do for ZFS as iscsi is a layer on top of it.
Well the information that I found so far was indeed kind of outdated but gave me some slight idea about ZFS. So ZFS should dynmically adjust its block sizes but is there a way to know how efficient it is in that?

I'm getting more and more afraid now with your answers :) but if ZFS has such an impact on the storage then I fail to see its benefit on it as you would need much more capacity then you could by just using a hardware raid. In the end the cost of a good raid card would be beneficial then using ZFS.

Heire · May 28, 2014

Hm I have to add we have 12x4TB drives in RAIDZ2, important note btw as I know the block strip size has an impact on this.
Is this perhaps the reason why ZFS is growing that much cause it's not using a perfect block size?
Hence it would be nice to know how you can see ZFS is handling such setup.

Heire · May 28, 2014

Hmm I just found out about your presentation slideware regarding iscsi & ZFS.
I must admit that most of it I already now by working with Freenas but this is the first time where we are using iscsi as a medium and didn't heared about this kind impact before, but oh well that's the learning curve :)

Your slideware explains a lot and very useful, however now i'm kind of stuck with our setup.
The reason why we have put 12 disks in RAIDZ2 was that I found some benchmarks on the internet which indicated still a good performance with 12 disks even though it's not perfect alligned (and the server was pre-installed with 12 disks so yeah).
Based on your slides I could say we have following flaws:
- RAM perhaps too low but as we mainly use it for storing data, we don't really need the fast read access but ok it's at least 8GB RAM.
- 12 disks is not a perfect alignment for the 4K boundries, still wonder what is the impact on this setup though?
- iSCSI on ZFS sounds very very bad and certainly with a lot of zvols which is my case. However any idea why this has such an impact, is it cause of fragmentation on the block sizes or ...?
- maximum 11 disks to be used, but if we would need more capacity and each time in RAIDZ2 we loose 2 more disks the next time. This doesn't sound flexible and only adds more disks to the setup. So can we still keep the formule of 2^N + 2 above 11 disks?
- Are there any benchmarks or bad ideas about UFS + iSCSI?

Kind Regards

solarisguy · May 28, 2014

So can we still keep the formule of 2^N + 2 above 11 disks?

That formula describes the best practice only as far as one aspect of performance goes. When keeping number of parity disks at two (2), and you keep increasing number of data drives, then your data security goes down.

P.S.
If you read recommendations from the competing ZFS implementation (Solaris), you get maximum of 8 data disks + parity disk(s) in each RAID-Z* Although one can argue, that it really reads 4 data disks + 2 or 3 parity disks...

jgreco · May 28, 2014

You "found some benchmarks on the internet which indicated still a good performance" ... great.

I've posted information on best configurations for block protocols a bunch of times, one of the recent ones is here

http://forums.freenas.org/index.php?threads/horrible-performance-even-with-sync-disabled.18382/

where you can see my response near the bottom outlining what needs to be done for good performance.

cyberjock · May 28, 2014

Heire,

Sorry but i'll have to keep this somewhat short and sweet. I've got years of my life spent reading stuff about ZFS and I can't always provide short or simple answers. For people that really need the right answers for their situation I tell them to get on skype or call me on the phone and pay for an hour or two of my time I'll talk about whatever you want in as much detail as you want. For many people you'll either have to take my word for it or do significant homework. :(

1. Per the manual(http://doc.freenas.org/index.php/ISCSI) that Logical Block Size represents the physical disk's size. If you have 4k disks you could set this to 4096. The default doesn't set a maximum size, it sets the minimum size. And I don't see any reason to change this off the top of my head. If you have a small write you don't need to artificially inflate it with padded space.
2. People don't go with ZFS because it's cheaper. They go with ZFS because the data integrity cannot be matched with anything else out there. Hardware RAID doesn't have the data integrity that ZFS offers. So your comparison is somewhat invalid. If you went with ZFS because you thought it was cheaper, you weren't informed or you were lied to. ;)
3. Well, if you aren't matching the recommended thumbrules for vdev sizes (2^N + P) and not exceeding 11 disks you're suffering from both a performance penalty as well as a padding penalty. For home users I don't worry about it too much because they usually do larger blocks. But your use case is very much not a "home use" and as such you should have taken the time to follow that thumbrule. I'm personally using an 18 disk RAIDZ3 single vdev pool and I know that there are serious penalties for going really wide. I wanted to know just how much it sucked, and it sucks bad. As a machine for storing backups it would be great. But for any other use I'd never do it again. All that non-alignment you have, that space is all wasted slack space that you can never reclaim. It's exasperated by smaller block sizes, which is exactly what VMs do.
4. You shouldn't ever exceed the 11 disk vdev. period. In your example if a 10 disk RAIDZ2 won't cover it you add another 10 disk RAIDZ2. There's a distinct relationship between how many disks you want "striped" together in a vdev, so going more than 10 would still be a penny wise but pound foolish. Does that mean you buy 10 disks and in a year you buy 10 more and end up with 20% of those 20 disks as redundancy? It sure does. If your IT guy is worth his salt he should know this. And if you are so strapped for cash that you're going to argue that you can't afford 2 more disks for redundancy, then I'd say your data can't possibly be valuable enough to pay to store then. The reality of it is there's loads of data that has very little value, but people won't throw it away. If it's not worth $150x2 then throw it away.
5. UFS is a mess. You can't do huge volumes easily with FreeNAS and UFS support is being removed from FreeNAS as we speak. So if you think that UFS is your savior you should just consider a different base OS.
6. Do not read benchmarks from some blog, think they're good enough for you, and then expect that performance. 99% of people benchmarking ZFS are idiots and have no clue what they are doing. You cannot benchmark ZFS the same way that you benchmark other file systems. Other file systems don't take up 7/8 of your system RAM just for a cache!

ZFS is not a toy. It's extremely complex. Most of the forum users here have no appreciation for just how complex ZFS is. I can't overstate how complex it is. If you don't know what you are doing and/or this is for a business application you should follow every single thumbrule in the forums, the manual, and my presentation to the letter. If this is for home use you can often break one or more, but only certain ones, and only to a certain extent.

Edit: And if you look at that thread jgreco posted in, notice I didn't post. I didn't post on purpose. I didn't post because he was clearly hellbent on doing things wrong, not doing the homework, and expecting miracles. You can't do things wrong and somehow expect things to actually work out. They won't.

jgreco · May 28, 2014

because he was clearly hellbent on doing things wrong

Yeah well after you spend money on big hardware there's an incentive to make your poor choices work.

cyberjock · May 28, 2014

jgreco said:
Yeah well after you spend money on big hardware there's an incentive to make your poor choices work.

But being hellbent on getting it right after you've already done it wrong won't work out well for you or your pocketbook. ;)

jgreco · May 28, 2014

Yeah, well.... having been there, hindsight and all that.

Look at me. I've been screwing around with ZFS since like 2010 trying to figure out what'd be a best fit for VM deployment. Still haven't pulled the trigger. In some ways just doing it but getting it wrong is better than sitting in an indefinite hold pattern forever.

I'm real tempted to do something a little crazy like an Avoton box with 10GbE ports and go all SSD on the storage, but some issues about what to do for SLOG... and even if I could get 64GB for it, I wonder if it'd be enough. Also hella-expensive to get enough SSD ... but an E5-based box would definitely be workable but also lots more watts. Sigh.

Heire · May 29, 2014

Hey Cyberjock, thanks for the answers & explanation into this.

Don't get me wrong but I understand that ZFS is a complex matter. Hence I like more ZFS then hardware raid cause of the improvement and integrity in it. I do understand that it needs some homework on it which I did more or less but there is always a part that you miss :)

Well originally we started with 2x RAIDZ2 (6disks) according to the thumb rule but ended up in the storage issue of our 12 disks. We took a shot and tried for the 12 disks as RAIDZ2 while originally providing enough performance on our benchmark, it didn't say off course anything when reaching its full capacity and there most likely we have now a lot of fragmentation causing it to be filled up more then expected hence the strip size.

I have been using freenas for a very long time already but that is more "home" usage then in this case now, never the less it should be marked somewhere that using a lot of zvols in a ZFS would require more storage no? or is it just in general that instead of 20% to in calculate 50% as margin?
To come back to the stripe & logical size, can I assume that the logical block size is only for the virtual disk and is not related to ZFS? As that is not clear for me.

Never the less to continue on our issue what would the recommendation be now:
- UFS + zvols? But UFS will be removed so that would be a problem in the future.
- ZFS + zvols, we need a lot of zvols cause of vendor restrictions (16 zvols as of today)
What would your disk recommandation/thumb rule/pool be where we would like to use roughly 25TB net storage?

25TB+50% margin = 38TB -> RAIDZ2 -> 10disks = 32TB & 6disks = 16TB --> 48TB available. Would this be a better choice with ZFS & zvols?

cyberjock · May 29, 2014

Honestly Heire, I'm not sure I want to give more advice. Just some anecdotal chatter on a forum isn't enough for me, in my opinion, to give good advice on how to proceed with your situation. There can be small things that you might think are unimportant that are very important and i may not be aware of those things due to the "forum setting". I'd rather give no advice than bad advice, especially since you can easily drop huge amounts of money on the wrong hardware.

-I consider UFS to be dead though. Not even an option in my opinion.
-ZFS should be done in mirrored pairs when handling iSCSI I/O. file based extents are typically faster and better than zvols, so if you were to redo your system I'd do file based extents. They are also easier to move to a new system in the future if you desire.

Normally what I tell people that want iSCSI is that they need to just do like 10 disks in mirrored pairs. If they start seeing disk space getting low or they want more performance add another mirror. Adding mirrors always helps, but at some point you may have to look at more RAM, L2ARC, and/or ZIL.

Heire · May 29, 2014

Hey Cyberjock,

Well I appreciate your advice on this matter and can understand that it would not suffice for you, but I'm here to learn and look in which direction we could best go as storage.
Next week I'll have a spare server on which I can do some tests on hence I would like to look at some of the ideas you have given so far.

I understand that the "occasional" chatter does not suffice on the level of detail but don't know if you still want to give me some advice or if you rather would have some more information on the usage, i'm open to anything and would like to know more about the possibilities or direction that would suffice better for me.

cyberjock · May 29, 2014

If you'll have a spare server definitely play around. Do all the nasty things we tell people not to do if you have time.

Important Announcement for the TrueNAS Community.

ZFS capacity issue

Dabbler

Guru

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Attachments

Dabbler

Dabbler

Guru

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Dabbler

Inactive Account

Dabbler

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS capacity issue"

Similar threads