Unexpectedly low storage

TomatoSoup · Jul 3, 2016

I recently put together a NAS with 8x 4tb drives in RAIDZ2

However, I'm seeing weird storage behavior. I'd expect to see 24tb available, but instead I see 29TiB (which I understand is due to the differences between powers of two and powers of ten). That makes sense to me.

What I don't understand is how the underlying volume is listing 20TiB. You know, my aggregate of disks is called platter. and so it lists Platter, then nested beneath that another Platter, and then beneath that my datasets.

To be completely explicit, here's what I'm looking at. https://imgur.com/8AXBMYB

What's up with this? And, before I fill it up more, is there a different structure I could use? If striping together two 4x 4tb RAIDZ2's would get the same storage but more IOPS, then I'd like to take steps to move my data before I get too much. After all, there's the old wisdom about totaldrives-paritydrives equalling a power of two. And I've also heard that that's no longer applicable thanks to compression. And on yet another hand, 1.4TiB/6% does give 24TiB, which is much closer to what I'd expect.

So, what should I do?

Spearfoot · Jul 3, 2016

Yeah, I've always been a little confused about the various ways available space is shown, and the variance between them.

If you're using snapshots, those can use up space that isn't easy to account for...

Open a shell and run the zfs list command. This will show detailed usage for your pool and subsidiary datasets and makes more sense to me than the GUI presentation you posted:

Code:

zfs list

Also, specify your pool's name ('platter') and the zfs list command will show the grand total of space used and available:

Code:

zfs list platter

Here is the output of these commands for my pool ('tank'):

Code:

[root@boomer] ~# zfs list
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank/archives  66.7G  3.63T  66.4G  /mnt/tank/archives
tank/atm  38.4G  474G  38.4G  /mnt/tank/atm
tank/backups  2.27T  3.63T  1.36T  /mnt/tank/backups
tank/bertrand  272K  3.63T  272K  /mnt/tank/bertrand
tank/devtools  96.9G  3.63T  96.6G  /mnt/tank/devtools
tank/domains  517M  3.63T  457M  /mnt/tank/domains
tank/hardware  13.0G  3.63T  13.0G  /mnt/tank/hardware
tank/homes  11.5G  3.63T  296K  /mnt/tank/homes
tank/homes/jeddak  208K  3.63T  208K  /mnt/tank/homes/jeddak
tank/homes/keith  11.5G  3.63T  10.8G  /mnt/tank/homes/keith
tank/jails  1.31M  3.63T  192K  /mnt/tank/jails
tank/junk  4.52M  3.63T  4.52M  /mnt/tank/junk
tank/music  4.88G  3.63T  4.88G  /mnt/tank/music
tank/ncs  5.53G  3.63T  5.43G  /mnt/tank/ncs
tank/odllc  6.83G  3.63T  6.81G  /mnt/tank/odllc
tank/opsys  369G  3.63T  369G  /mnt/tank/opsys
tank/photo  23.7G  3.63T  23.7G  /mnt/tank/photo
tank/sysadmin  18.6M  3.63T  11.8M  /mnt/tank/sysadmin
tank/syslog  192K  3.63T  192K  /mnt/tank/syslog
tank/systems  41.3G  3.63T  41.2G  /mnt/tank/systems
tank/systools  31.5G  3.63T  31.5G  /mnt/tank/systools
tank/test  4.77M  1024G  4.64M  /mnt/tank/test
tank/video  257G  3.63T  257G  /mnt/tank/video
tank/vm_block  1.04T  4.64T  31.8G  -
tank/vmware_nfs  262G  3.63T  262G  /mnt/tank/vmware_nfs
tank/web  1.62G  3.63T  1.61G  /mnt/tank/web
[root@boomer] ~# zfs list tank
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank  4.51T  3.63T  256K  /mnt/tank

TomatoSoup · Jul 3, 2016

Yeah. Looking at it again, it looks like everything is reporting a consistent percentage. It still just feels like it's a bit less than it should be, considering windows is only showing that I've used ~1.5tb (I'm still loading this thing up).

Code:

NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT                                                     
platter    29T  1.97T  27.0T         -     3%     6%  1.00x  ONLINE  /mnt

And then df -h!

Code:

platter 19T 205K 19T 0% /mnt/platter
platter/backup 19T 205K 19T 0% /mnt/platter/backup
platter/jaildata 19T 205K 19T 0% /mnt/platter/jaildata
platter/media 20T 1.4T 19T 7% /mnt/platter/media

Spearfoot · Jul 3, 2016

TomatoSoup said:
If striping together two 4x 4tb RAIDZ2's would get the same storage but more IOPS, then I'd like to take steps to move my data before I get too much. After all, there's the old wisdom about totaldrives-paritydrives equalling a power of two. And I've also heard that that's no longer applicable thanks to compression. And on yet another hand, 1.4TiB/6% does give 24TiB, which is much closer to what I'd expect.

So what should I do?

Your current RAIDZ2 configuration is probably the best solution, giving the most available storage while sacrificing the least number of disks to parity. Two RAIDZ2 vdevs would yield even less available space and you'd be using half(!) of your 8 drives for parity - you would get more IOPS with the same available space if you just used 4 mirror vdevs instead.

Question: do you have deduplication turned on? If so, that might be a factor, but I've never used it and don't know that much about it.

TomatoSoup · Jul 3, 2016

Dedup is off. I'm considering turning it on for a small (1tb) dataset for VM backups. Probably not worth it, though.

I would have agreed with you, but this seems to be similar to the non-power-of-two problem, so I was just throwing that out there. After all, 2 is a power of 2. I wouldn't mind using half of my drives for parity if it got me better performance with the same storage. As for why not striped mirrors, I'm just not comfortable with my array ever being in a situation where a single drive failure could kill it.

Clearly I'm new to FreeNAS and it's all just a bit spooky to me.

TomatoSoup · Jul 3, 2016

What kind of ashift should I see if I do zdb? I'm seeing 9 right now which some googling suggests is unusual.

EDIT: Which it is, because I'm wildly incompetent and that was my boot drive. My actual drives both show 12.

TomatoSoup · Jul 3, 2016

Okay, last post for now. I'll never claim to totally understand FreeNAS, but here's what I think it's doing.

Code:

Filesystem Size Used Avail Capacity Mounted on
platter 39875407816 409 39875407407 0% /mnt/platter
platter/backup 39875407816 409 39875407407 0% /mnt/platter/backup
platter/jaildata 39875407816 409 39875407407 0% /mnt/platter/jaildata
platter/media 42891307561 3015900154 39875407407 7% /mnt/platter/media

Notice that the Avail is the same for every one. Now, most importantly, notice that Size=Used+Avail. So it's not actually just a 20TiB datastore. It's just reporting that it has that much free, and after you write more to it it says "Yeah, that much is still free." It's shrugging about the size of the logical filesystem and giving a best effort. That's why there's such a discrepancy. So, basically, the Avail will decrease slower than the Used. At least, that's what I hope.

Spearfoot · Jul 3, 2016

TomatoSoup said:
Dedup is off. I'm considering turning it on for a small (1tb) dataset for VM backups. Probably not worth it, though.

Yes, if you dig around the forum you'll find out that deduplication is a very expensive feature in terms of resources, and really only needed in very special circumstances.

I would have agreed with you, but this seems to be similar to the non-power-of-two problem, so I was just throwing that out there. After all, 2 is a power of 2. I wouldn't mind using half of my drives for parity if it got me better performance with the same storage. As for why not striped mirrors, I'm just not comfortable with my array ever being in a situation where a single drive failure could kill it.

Clearly I'm new to FreeNAS and it's all just a bit spooky to me.

The old 'power of two' recommendation has been superseded by events:

https://forums.freenas.org/index.php?threads/optimal-vdev-size.36467/#post-222745

Quoting the post above:

From http://doc.freenas.org/9.3/freenas_intro.html#zfs-primer:

Some older ZFS documentation recommends that a certain number of disks is needed for each type of RAIDZ in order to achieve optimal performance. On systems using LZ4 compression, which is the default for FreeNAS® 9.2.1 and higher, this is no longer true. See ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ for details.

Note that the reference was written by one of the original ZFS authors and the main OpenZFS maintainer.

The upshot is that you'll do just fine with your 8-disk RAIDZ2 vdev. FWIW, I use a 7-disk RAIDZ2 vdev in my main system and I'm happy with it.

If you really need more IOPS, then mirrors are the way to go. And you're not limited to 2-way mirrors... you could go insanely redundant and use two vdevs, each a 4-way mirror. You'd only have 8TB of space and would be using 3/4 of your disks for redundancy... but you could lose 3 drives in each vdev and still not lose your pool!

A little less over-the-top would be to add a 9th drive and use 3 vdevs, each a 3-way mirror. There's a green fellow around here (@jgreco) who does such things.

But honestly, for most use-cases RAIDZ2 is just fine. It gives the most available storage and is reasonably safe, too, in terms of redundancy.

TomatoSoup · Jul 3, 2016

Certainly. I probably won't bother with it unless it cuts my storage needs for VM backups by 2-3x. And even then, only once I start hurting for storage.

Yeah, I knew that compression defeats the old power-of-two advice. I was just commenting that it looked like that, and wanted someone to sanity check me and go "No, you're right, that's not the problem."

I'm not hurting for IOPS at all. This NAS won't see anything more strenuous than serving media, storing backups, and a MySQL server that'll peak at about a query a second. I only mentioned it because I was asking if I could get similar storage from a more performant structure, then I'd love to do that. But it seems like I do have my full storage available, and I think I've somewhat explained the discrepancies.

maglin · Jul 3, 2016

You also have

Code:

zfs list -o space

which shows just pools and data sets avail and used.

jgreco · Jul 4, 2016

It's always important to remember that the number reported as available space with RAIDZ is an *estimate*. Because RAIDZ does not work like RAID5/6, the amount of space occupied by parity is variable and depends on the size of each stored block. In particular, small (especially 4K) blocks will consume parity at a furious rate.

Robert Trevellyan · Jul 4, 2016

TomatoSoup said:
I recently put together a NAS with 8x 4tb drives in RAIDZ2

However, I'm seeing weird storage behavior. I'd expect to see 24tb available, but instead I see 29TiB (which I understand is due to the differences between powers of two and powers of ten). That makes sense to me.

What I don't understand is how the underlying volume is listing 20TiB.

You started with a basic misconception. The 29TiB is the raw storage, equivalent to 32TB, without any consideration of filesystem overhead.

The 20TiB is what you get from 6 x 4TB = 24TB, converted to TiB, less estimated filesystem overhead.

TomatoSoup · Jul 4, 2016

Okay, now I'm starting to understand. Can you link me to something that explains why RAIDZ uses variable amounts of parity? I think I'm seeing exactly one fourth being used (2tb allocated, 1.5tb of data), which makes sense for 1/4th of my drives being used for parity.

And would it then make sense for me to change my blocksize?

jgreco · Jul 4, 2016

RAIDZ parity is included as part of the ZFS block written. If you write a single 4K block and you have a RAIDZ1, you will need to write a 4K sector of data and a 4K sector of parity. If you write a single 4K block and RAIDZ2, you need to write a 4K sector and two 4K sectors of parity. If you write a single 4K block with RAIDZ3, you need to write a 4K sector and 3 4K sectors of parity.

This means that RAIDZ is not good for small block storage. It is much better at large block storage. And ZFS loves its large blocks.

If you have a six disk RAIDZ2, your ideal block size is some multiple of 16KB, because four data sectors is 16KB plus two parity.

Complicating this all is that ZFS will also pad in certain cases. For example, in this chart, note that ZFS will refuse to allocate three or five sectors and will instead pad to four or six (see "X"'s). A 5 drive RAIDZ1.

TomatoSoup · Jul 4, 2016

I don't understand how larger sectors would prevent it from writing so many extra sectors.

And, going by the chart in http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ it seems to imply that actually lowering my block size from 4k to 2k would increase my storage fraction. Besides, isn't all this wisdom obsolete with compression?

What would you recommend I do?

jgreco · Jul 4, 2016

Forget the ZFS stripe width article for this discussion, not that helpful. If your ashift is less than 12, then it might be possible to go for a smaller block size, but that's totally irrelevant to what I'm showing you here. If you want ashift=9 then just take all the "4KB" above and replace with "512" and you are left with the same exact issue. The numbers are not relevant. The concept is key.

For the remainder of this discussion let's talk in 4KB sectors.

So the point is, how do you write the smallest possible block, a 4KB block? This is represented above in LBA #3 above as a brown band of P0 and D0. One parity and one data block. So to store 4KB of data we're consuming 8KB of raw disk. That's 100% overhead to store that block.

Now look at the salmon starting in LBA #0. We have a 32KB block which has two full stripes of four data sectors and a parity sector in each stripe. This is optimally sized; the overhead is 20% to store this 32KB block. This is kinda what people THINK of when they hear "RAIDZ1" because they're THINKING in their head, "RAID5." Which has exactly 20% overhead for the entire array, because (for this array) it would always be four data blocks and one parity block per stripe. However, RAID5 will require you in many cases to read the existing data off the disk, add the new sectors, recompute parity, and then write back to disk. ZFS never needs to do that. A block's parity is ONLY a function of the data in the block, so ZFS computes the parity and writes it. No reads of existing data.

The downside to this is that for cases like the 4KB block, this is inefficient. But ZFS already has a highly optimized mirror strategy for efficient block storage.

In the end: Use mirrors for block storage (database, VM, etc) or other small/tiny file storage. Use RAIDZ for large contiguous files that are read and written sequentially.

TomatoSoup · Jul 4, 2016

I understand that if I write a small 4kb of data then that's 8kb of disk. But with larger block sizes, suppose I write a 4kb file into a 32kb sector. That consumes the entire sector, right? Whereas if I were to write a 32kb file, then that's 8x4kb sectors. So then I'd expect it to stripe like 16kb data, 4kb parity, twice over. That's 20% still.

Are parity blocks and data blocks differently sized?

How much storage am I losing right now to unnecessary padding and parity? It looks like less than 1%, seeing as how I have 1.52tb written and 2.01 allocated.

maglin · Jul 4, 2016

TomatoSoup said:
And still. What should I do?

Not worry about it. It's that simple. If your AShift is different than what your drives are it will give you a warning in your zpool -list I believe it is about your ashift being different than the HDD sector size. Just let FreeNAS take care of it's self and don't worry about the % free until you start to get over 80% full. That is what you want to pay attention too. Also because of what is mentioned above if you have a lot of files that are very small (ie. 5k-12k 50kb files) they will consume a very large amount of disk space. I have 550 mb of files consuming something like 1.2 GB of disk space. seems like a lot but I have may TiB of space so I just don't worry about it. If it was cold storage I would just zip them up and then it wouldn't take up so much space.

TomatoSoup · Jul 4, 2016

Thank you for saying "not worry about it". If you go to the hardware recommendation side of the forums you might notice that I'm the guy who, on a whim, went from 4x2tb RAIDZ1 to 4x4tb to 8x4tb RAIDZ2. I have no idea how I'm going to fill this thing up even five years out. I just don't like the thought of losing anything that I paid for, the whole difference between expectations and reality.

Thank you for taking the time to explain all this to me.

jgreco · Jul 4, 2016

TomatoSoup said:
I understand that if I write a small 4kb of data then that's 8kb of disk. But with larger block sizes, suppose I write a 4kb file into a 32kb sector. That consumes the entire sector, right?

You can't. There's no such thing as a 32KB sector. They're all 4KB.

I think what you *might* mean is a 32KB block. In that case, also no, because ZFS wouldn't allocate a 32KB block to write 4KB worth of data. This is why ZFS uses variable sized blocks.

Are parity blocks and data blocks differently sized?

No. And for the sake of clarity we should probably call them sectors, when referring to the block-like 4KB thingies that get written to disk (which are often referred to as blocks) but ZFS also has its own blocks, which are variable sized groups of those sectors.

How much storage am I losing right now to unnecessary padding and parity? It looks like less than 1%, seeing as how I have 1.52tb written and 2.01 allocated.

It'll mostly depend on how large your files are. Storing lots of tiny files will waste lots of space. Storing larger files (especially 1MB or larger) will waste as close to no space as is reasonably possible.

Important Announcement for the TrueNAS Community.

Unexpectedly low storage

Dabbler

He of the long foot

Dabbler

He of the long foot

Dabbler

Dabbler

Dabbler

He of the long foot

Dabbler

Patron

Resident Grinch

Pony Wrangler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Patron

Dabbler

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unexpectedly low storage"

Similar threads