80% capacity fill rule - How far past that is safe?

jgreco · Jan 14, 2016

I suppose I should point out that pool hackery of these sorts probably voids the FreeNAS "do-via-GUI" warranty-and-warning.

The general idea of the snapshot-to-snapshot thing is that if you've got a highly fragmented pool, the act of rewriting it into another snapshot defrags the data somewhat. The idea is that after a single local disk copy you could then export the new snapshot and clear the old one. I think that'd be ... difficult ... within the GUI framework, but not impossible.

A different technique, where you use zfs send to dump into a local file, and then wipe your dataset, and then restore from the local file, is a little more terrifying (because you're trusting the integrity of the file) but I think could do a better job of defragging. This is not possible to do within the GUI framework, but might be less tinkering with the GUI.

mattlach · Jan 14, 2016

jgreco said:
I suppose I should point out that pool hackery of these sorts probably voids the FreeNAS "do-via-GUI" warranty-and-warning.

Whoah! We have a warranty? :p

jgreco said:
The general idea of the snapshot-to-snapshot thing is that if you've got a highly fragmented pool, the act of rewriting it into another snapshot defrags the data somewhat. The idea is that after a single local disk copy you could then export the new snapshot and clear the old one. I think that'd be ... difficult ... within the GUI framework, but not impossible.

A different technique, where you use zfs send to dump into a local file, and then wipe your dataset, and then restore from the local file, is a little more terrifying (because you're trusting the integrity of the file) but I think could do a better job of defragging. This is not possible to do within the GUI framework, but might be less tinkering with the GUI.

Yeah. How are datasets maintained if you do something like this?

Or do you have to do each one separately and reconfigure them?

I kind of wish there were somewhere I could rent a 128GB server at a reasonable price for a week, dump my volume to it, and dump it back.

I'd do it once every year or two to help with fragmentation, and every time I add a new vdev to my pool in order to rebalance the data across the vdevs for optimum striping performance.

jgreco · Jan 14, 2016

mattlach said:
Whoah! We have a warranty? :p

Yeah, a full money back one. If ever the product should fail to perform, upload the software back and you'll get a full refund.

Yeah. How are datasets maintained if you do something like this?

Carefully, obviously.

Or do you have to do each one separately and reconfigure them?

See, that's the thing. This is a technique ZFS jockeys would use on their Solaris fileservers. It isn't particularly FreeNAS-compatible though.

I kind of wish there were somewhere I could rent a 128GB server at a reasonable price for a week, dump my volume to it, and dump it back.

Yup.

I'd do it once every year or two to help with fragmentation, and every time I add a new vdev to my pool in order to rebalance the data across the vdevs for optimum striping performance.

Rebalancing is mostly a fool's errand. ZFS will take care of it for you in due time.

mattlach · Jan 14, 2016

jgreco said:
Rebalancing is mostly a fool's errand. ZFS will take care of it for you in due time.

Well, correct me if I am wrong but here's how I thought it worked. let's take my system as an example:

I have two 6 disk RAIDz2 vdevs, and I am considering adding another. Currently the pool is about 80% full.

Since my pool is relatively full, if I were to add another vdev, up front it would highly favor the empty vdev at the expense of the other two.

The newly written data would be stored primarily on the new vdev with very little of it on the other two, meaning that when it is read it will heavily load one of the vdevs but very little on the other two.

Similarly, my old data that is already written won't see any benefit from the new vdev.

I mean, if you have a very dynamic load with blocks written and removed all the time, eventually you are right, it would balance out, but mine is a file server. Once written data tends to stay for a long time, and as such it won't balance itself out.

jgreco · Jan 14, 2016

"At the expense of"?

Honestly, a six disk RAIDZ2 vdev should be able to saturate your dual port Intel Pro/1000 all by itself. There's no expense involved.

mattlach · Jan 14, 2016

jgreco said:
"At the expense of"?

Honestly, a six disk RAIDZ2 vdev should be able to saturate your dual port Intel Pro/1000 all by itself. There's no expense involved.

Ahh, my sig is a little outdated. Fixed it.

I have multiple guest operating systems all accessing the same pool over local virtual 10gig adapters, plus a single Brocade BR1020 adapter for a direct link to my workstation for high speed transfers, and a quad port Pro/1000 PT using link aggregation hooked up to my switch.

The VM guests hit the pool hard. The 10 gig adapter is nice to have during massive data dumps, but not really necessary. The quad port adapter is total overkill, as I usually don't have THAT many networked clients hitting it at once. I put it in there because I had it.

Truth is, it's not the sequential speeds that I am the most concerned with. Very little that I do requires more than a ~5-10MB/s read speed. It's when they all hit at once and the drive heads are kicking around like mad, and access latencies go up, I have a potential for problems.

One of the types of data that my FreeNAS server hosts, are all my recorded DVR files from TV, and all my media. Whenever there are latency spikes, it results in video stutter, which is unpleasant, and this gets tricky as I often have as many as three clients pulling data at the same time.

While it technically is a file server, it is in some ways more IOPS dependent than it is sequential transfer dependent.

jgreco · Jan 15, 2016

Well, then, maybe it'd be helpful. For recorded DVR files, this probably shouldn't make a difference. If you're implying that you're running VM's off the pool, you're way too full. If you're merely saying that VM's access the DVR data, not such a big deal.

So what's the "IOPS dependent" part? Because DVR or other media access isn't generally IOPS stressy; it's sequential read, which RAIDZn handles pretty well.

mattlach · Jan 15, 2016

jgreco said:
Well, then, maybe it'd be helpful. For recorded DVR files, this probably shouldn't make a difference. If you're implying that you're running VM's off the pool, you're way too full. If you're merely saying that VM's access the DVR data, not such a big deal.

So what's the "IOPS dependent" part? Because DVR or other media access isn't generally IOPS stressy; it's sequential read, which RAIDZn handles pretty well.

You are correct, no VM runs off the pool. I have separate SSD's for that.

I may be using the wrong terminology, but the part I think is more seek time dependent is when multiple sequential-style operations are going on at the same time.

There may be three clients streaming video files, at the same time as a few more are being recorded, at the same time as crashplan is performing a backup, at the same time my Fiancé's Mac is doing a time machine sync, at the same time I am doing some file operations from my workstation, etc. etc.

I could be wrong, but my understanding is that when you pile tons of sequential operations on top of each other, seek times and thus IOPS start becoming important.

I'd appreciate your thoughts on this.

Mr_N · Jan 15, 2016

on the topic of rebalancing, i'll just be essentially copying a few TB off, deleting it and copying it back on when i add my second vdev in the coming weeks.

jgreco · Jan 15, 2016

Yes, but only if they're actually happening truly concurrently. ZFS is very good about optimizing. For example, if you're sequentially reading a file, ZFS will assume that you will be reading more of the file, and it will not be reading tiny little blocks on demand, but will instead prefetch. Go to your CLI, and type "arc_summary.py" and look under "File Level Prefetch". My filers are usually seeing an ~80-90% hit ratio, except for the filers with a tiny amount of RAM.

In all likelihood, your filer does one thing on the pool for a hundred milliseconds, takes a break, then services something else, takes a break, services a third thing, takes a break, and then maybe goes back to the first thing. For each of those things, ZFS is likely to be predictively prefetching. While it's true that there's I/O going on, and that it could be measured "per second", it isn't the brutally seek heavy sort of thing we usually mean by IOPS.

It's certainly possible to create enough workload where the pool is actually quite busy, but even then, it's not so much IOPS-heavy as it is just ... busy. The big thing that will kill performance is fragmentation and pool full-ness. A fragmented pool, for reads, means that if you're reading data that was laid down nonsequentially, then there are seeks, and yes, that's bad. For writes, it means ZFS struggles to allocate space into little free nooks and crannies. ZFS would *like* to write a transaction group as a long, contiguous string of sectors.

Take a look at http://blog.delphix.com/uday/2013/02/19/78/

There's an interesting question implied by Table 4; that question is "how the f***!!! do you get 1500 IOPS out of a single hard drive". The answer is that ZFS is transforming what appear to be random seeking IOPS into sequential writes, which is easily done at 10% occupancy, because there's so much free space. As the pool gets fuller, the ability to aggregate writes decreases, and eventually works its way down to the 100 IOPS that a hard drive ought to have.

The takeaway here is that ZFS is able to make your pool go FASTER when there are free resources, especially lots of free resources. But it's all sleight of hand.

It's certainly true that ZFS could do better if you rebalance your pool, but I kinda doubt it would have a significant impact in the general case. A 6 disk RAIDZ2 is pretty fast at sequential.

mattlach · Jan 15, 2016

jgreco said:
Yes, but only if they're actually happening truly concurrently. ZFS is very good about optimizing. For example, if you're sequentially reading a file, ZFS will assume that you will be reading more of the file, and it will not be reading tiny little blocks on demand, but will instead prefetch. Go to your CLI, and type "arc_summary.py" and look under "File Level Prefetch". My filers are usually seeing an ~80-90% hit ratio, except for the filers with a tiny amount of RAM.

In all likelihood, your filer does one thing on the pool for a hundred milliseconds, takes a break, then services something else, takes a break, services a third thing, takes a break, and then maybe goes back to the first thing. For each of those things, ZFS is likely to be predictively prefetching. While it's true that there's I/O going on, and that it could be measured "per second", it isn't the brutally seek heavy sort of thing we usually mean by IOPS.

It's certainly possible to create enough workload where the pool is actually quite busy, but even then, it's not so much IOPS-heavy as it is just ... busy. The big thing that will kill performance is fragmentation and pool full-ness. A fragmented pool, for reads, means that if you're reading data that was laid down nonsequentially, then there are seeks, and yes, that's bad. For writes, it means ZFS struggles to allocate space into little free nooks and crannies. ZFS would *like* to write a transaction group as a long, contiguous string of sectors.

Take a look at http://blog.delphix.com/uday/2013/02/19/78/

There's an interesting question implied by Table 4; that question is "how the f***!!! do you get 1500 IOPS out of a single hard drive". The answer is that ZFS is transforming what appear to be random seeking IOPS into sequential writes, which is easily done at 10% occupancy, because there's so much free space. As the pool gets fuller, the ability to aggregate writes decreases, and eventually works its way down to the 100 IOPS that a hard drive ought to have.

The takeaway here is that ZFS is able to make your pool go FASTER when there are free resources, especially lots of free resources. But it's all sleight of hand.

It's certainly true that ZFS could do better if you rebalance your pool, but I kinda doubt it would have a significant impact in the general case. A 6 disk RAIDZ2 is pretty fast at sequential.

Thank you for this post. Very informative. It is rare I learn whis much from one forum post! :)

I did not know ZFS prefetched into the arc like that. As a self taught layman, I've always struggled a bit with the output of that ARC statistics script. Im curious, will it ever prefetch into the L2ARC as well?

I added two striped 128gb SSD's as L2ARC on mine despite knowing that they probably wouldn't help much in my load. I had bought them for my SLOG's, but I didn't do my homework, and wound up wiht pretty bad slog drives. So when I replaced them with better slog drives, I added them as L2ARC more because I had them than anything else.

I figured, sure, I don't have a well defined working set like you would if you were running VM guests or a database, but I have enough RAM to support them, and they might help at the margins.

mattlach · Jan 15, 2016

mattlach said:
Im curious, will it ever prefetch into the L2ARC as well?

This setting seems to suggest you can enable it, but that it defaults to off:

Code:

vfs.zfs.l2arc_noprefetch                1

I wonder why it defaults to disabled. Is it because of limited write cycles of SSD's? If that's th eonly reason I might just enable it, as the 3D NAND in my Samsung 850 Pro's is pretty damned resilient, even if it is only MLC. (Does anyone even use SLC SSD's anymore?)

mattlach said:
I added two striped 128gb SSD's as L2ARC on mine despite knowing that they probably wouldn't help much in my load. I had bought them for my SLOG's, but I didn't do my homework, and wound up wiht pretty bad slog drives. So when I replaced them with better slog drives, I added them as L2ARC more because I had them than anything else.

Code:

L2 ARC Breakdown:                               134.49m
        Hit Ratio:                      0.27%   360.58k
        Miss Ratio:                     99.73%  134.13m
        Feeds:                                  543.76k

Ouch, this is pretty damning.

At least looks like the prefetch is doing its job...

Code:

File-Level Prefetch: (HEALTHY)
DMU Efficiency:                                 350.82m
        Hit Ratio:                      81.26%  285.06m
        Miss Ratio:                     18.74%  65.76m

        Colinear:                               65.76m
          Hit Ratio:                    0.05%   33.11k
          Miss Ratio:                   99.95%  65.73m

        Stride:                                 233.42m
          Hit Ratio:                    99.88%  233.13m
          Miss Ratio:                   0.12%   283.54k

Maybe I can improve the usefullness of the L2ARC by enabling prefetch into it?

mattlach · Jan 15, 2016

Also, on a side note, how highly fragmented can a pool get before we start worrying?

My pool has been running for about a ywear and a half. Until a month or two ago, I was just fuilling it, I had no reason to clear out space, as I had so much of it. Then I hit the 80% warning, so I cleared out a bunch of old stuff I don't need. It reached 88% full before I dealt with it.

Now it looks like this:

Code:

[root@freenas] ~# zpool list zfshome
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zfshome  43.6T  31.1T  12.4T         -    16%    71%  1.00x  ONLINE  /mnt

jgreco · Jan 15, 2016

mattlach said:
I did not know ZFS prefetched into the arc like that. As a self taught layman, I've always struggled a bit with the output of that ARC statistics script. Im curious, will it ever prefetch into the L2ARC as well?

No. The L2ARC contents are based on an analysis of ARC contents that are soon-to-be-evicted from the ARC. Things that seem like they've been useful are eligible to be sent to the L2ARC, but there are various limits.

I added two striped 128gb SSD's as L2ARC on mine despite knowing that they probably wouldn't help much in my load. I had bought them for my SLOG's, but I didn't do my homework, and wound up wiht pretty bad slog drives. So when I replaced them with better slog drives, I added them as L2ARC more because I had them than anything else.

Do be aware that installing L2ARC robs you of ARC; the pointers (think: index) into the L2ARC are stored in the ARC. If you are not seeing good behaviour out of the L2ARC, you're better off using that RAM for the regular ARC.

I figured, sure, I don't have a well defined working set like you would if you were running VM guests or a database, but I have enough RAM to support them, and they might help at the margins.

Go take a look. Just see what the L2ARC breakdown is. arc_summary.py. If you're not seeing a meaningful hit ratio, then maybe ditch it. Also, remember, the place that L2ARC is actually helpful is when your pool is so busy that it is having trouble keeping up with requests. If your pool isn't that busy, then it can be filling requests from the pool, and there's no big advantage to L2ARC.

jgreco · Jan 15, 2016

mattlach said:
This setting seems to suggest you can enable it, but that it defaults to off:

Code:
vfs.zfs.l2arc_noprefetch 1

[...]

Maybe I can improve the usefullness of the L2ARC by enabling prefetch into it?

That's not actually what it's doing. It's flagging prefetched data as ineligible for eviction to L2ARC. If you have lots of data that's being prefetched (and probably more than once) then turning this to zero might be very useful.

Of course you're risking some significant additional wear and tear on your SSD. ZFS disables this because generally it is cheap to prefetch stuff from a pool; you usually want L2ARC to accelerate the stuff that's *hard* to get off your pool (i.e. involves seeks).

jgreco · Jan 15, 2016

mattlach said:
Also, on a side note, how highly fragmented can a pool get before we start worrying?

My pool has been running for about a ywear and a half. Until a month or two ago, I was just fuilling it, I had no reason to clear out space, as I had so much of it. Then I hit the 80% warning, so I cleared out a bunch of old stuff I don't need. It reached 88% full before I dealt with it.

Now it looks like this:

Code:
[root@freenas] ~# zpool list zfshome NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zfshome 43.6T 31.1T 12.4T - 16% 71% 1.00x ONLINE /mnt

If you've mostly been adding to it, I'm not sure where the 16% frag is coming from. I've got 9% frag on a 40T pool that's 50% full and is theoretically archival in nature but in practice may see more deletes than I'd care for.

Still, you're probably fine. Unless you're actually seeing performance issues, I'd strongly suggest you just add a vdev and be done with it. There's a certain amount of danger at poking around at data that's already safely stored and doing fine. It's like putting a sign on your back that says "Kick Me."

mattlach · Jan 15, 2016

Great insights jgreco. Thank you very much.

Now, if anyone knows where I can get super short term (like a week or two) rentals for 30+ TB external storage or storage servers to assist during the infrequent times I need to juggle around data, I'd love to hear it. I've found a few sites that offer services like these, but their monthly costs are ridiculously high. I could practically build my own with new parts for the prices they charge.

I my data backed up externally via the internet, but it would take so long to restore, that I'd rather just not, unless I absolutely have to :p It's a disaster recovery option only. :p

jgreco · Jan 15, 2016

http://www.bestbuy.com/site/seagate...b-external-usb-3-0-hard-drive-black/2240085.p

x4

Followed by "Oh, these didn't work with my iMac G4, return, please."

gpsguy · Jan 15, 2016

Just build an Amazon Snowball wannabe for data transfer.

Sent from my phone

jgreco · Jan 15, 2016

We're having a snowball fight?

Important Announcement for the TrueNAS Community.

80% capacity fill rule - How far past that is safe?

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Patron

Patron

Resident Grinch

Patron

Patron

Patron

Resident Grinch

Resident Grinch

Resident Grinch

Patron

Resident Grinch

Active Member

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "80% capacity fill rule - How far past that is safe?"

Similar threads