ZFS on partitioned disks.

nixgeek · Sep 6, 2015

I am confused I thought there is a major performance loss when using [non-root] ZFS is used on a partitioned.
Things like on-board disk cache, and Disk Features[i.e. NCQ, etc.] are lost/not used in the best way.

If so, why does freenas create 4 partitions per [non-root] ZFS disk?

====
=> 34 2930277101 ada7 GPT (1.4T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 2926082696 2 freebsd-zfs (1.4T)
2930277128 7 - free - (3.5k)
====

How does the Freenas implementation of ZFS get around this?

Please advise.

Tim

dlavigne · Sep 8, 2015

Which version of FreeNAS? And which command did you use to get the output?

nixgeek · Sep 8, 2015

dlavigne said:
Which version of FreeNAS? And which command did you use to get the output?

FreeNAS-9.3-STABLE-201504152200

gpart show

dlavigne · Sep 10, 2015

I'm not sure what you mean by non-root. ZFS likes to be installed to disks, not partitions. When you create a pool in FreeNAS, you have to point it to disks (not partitions) and it automatically creates encrypted swap for the pool.

Linkman · Sep 10, 2015

That only looks like two partitions to me, a 2GB swap and a 1.4TB data partition. The other two are tiny, total way less than a MB, and aren't named. Perhaps that is for block alignment purposes?

nixgeek · Sep 11, 2015

dlavigne said:
I'm not sure what you mean by non-root. ZFS likes to be installed to disks, not partitions. When you create a pool in FreeNAS, you have to point it to disks (not partitions) and it automatically creates encrypted swap for the pool.

Non-Root meaning the zfs pool is not for the OS [root/booting]. If there are 2 partitions then ZFS is not truly using the whole disk but a partition.
I am pretty sure the p2 below means a partition. If not I'd expect to see ada0, ada1, etc. When I was running Opensolaris it was using whole
disks not a partition.

# glabel status
gptid/8199c4a2-dc1d-11e4-bec3-000423b47b30 N/A ada0p2
gptid/81f5d73f-dc1d-11e4-bec3-000423b47b30 N/A ada1p2
gptid/827819f9-dc1d-11e4-bec3-000423b47b30 N/A ada2p2
gptid/6fa756e0-1ab5-11e4-b9e4-000423b47b30 N/A ada3p2
gptid/4240eeb8-1ab5-11e4-b9e4-000423b47b30 N/A ada4p2
gptid/70017e76-1ab5-11e4-b9e4-000423b47b30 N/A ada5p2
gptid/42dfa58c-1ab5-11e4-b9e4-000423b47b30 N/A ada6p2
gptid/82e88245-dc1d-11e4-bec3-000423b47b30 N/A ada7p2

The OS has its own disks not part of the above zfs pool.

Linkman · Sep 11, 2015

Because, as I understand it, FreeNAS (and I assume ZFS itself, not just FreeNAS) automatically puts a 2 GB swap space on each HDD, hence the two partitions you see labeled freebsd-swap and freebsd-zfs. ZFS isn't installed on the partition, it's installed on the HDD, and is creating a data space and a swap space itself.

Someone more knowledgeable feel free to correct my terminology, or total misunderstanding ;-)

dlavigne · Sep 11, 2015

And I'm still confused about your point. The storage disks in the pool are totally separate from the OS which is installed to the boot device. ZFS is happy as it's being fed entire disks. ZFS doesn't like being fed partitions.

solarisguy · Sep 11, 2015

@nixgeek, I am assuming that we are not talking legacy Solaris, but Oracle Solaris 11.2 (Solaris for short).

There are multiple issues that might create confusion when one is comparing ZFS in Solaris to ZFS in FreeNAS (FreeBSD).

ZFS in Solaris and FreeBSD gets deployed on SPARC and x86 platforms. However, as far as I know, any SPARC hardware that can be used to run FreeBSD 10.2 is not capable of running Oracle Solaris 11.2. Thus direct comparisons on SPARC platform are difficult.

On x86 platform, ZFS can be used to create a root pool. And yes, there is a difference, Solaris allocates the swap space and dump space inside the root pool. FreeBSD places swap on a separate partition. Swap has to go somewhere...

Beyond that, I cannot see fundamental differences. Let's see how a disk that is not a part of root pool looks like in Solaris.

Code:

Current partition table (original):
Total disk sectors available: 143358287 + 16384 (reserved sectors)

Part      Tag    Flag   First Sector           Size         Last Sector
0        usr    wm             256         68.36GB          143358320
1 unassigned    wm               0             0               0
2 unassigned    wm               0             0               0
3 unassigned    wm               0             0               0
4 unassigned    wm               0             0               0
5 unassigned    wm               0             0               0
6 unassigned    wm               0             0               0
8   reserved    wm       143358321          8.00MB          143374704

The above is from Using Disks in a ZFS Storage Pool in http://docs.oracle.com/cd/E36784_01/html/E36835/gazdp.html . As you can see, the ZFS area neither starts at sector 0 nor ends at the last sector. And by comparison from my FreeNAS

Code:

[root@freenas /]# gpart show ada2
=>        34  7814037101  ada2  GPT  (3.7T)
          34          94        - free -  (47k)
         128  7814037000     1  freebsd-zfs  (3.7T)
  7814037128           7        - free -  (3.5k)

[root@freenas /]#

There is a very good reason for FreeNAS (FreeBSD) freebsd-zfs partition to not start on sector 0, but that was not your question :) (As mentioned above alignment, and 4K issues, etc.)

Also you can read in the current Solaris documentation (the one I referenced above) that a storage device can be a whole disk or an individual slice. No preference is given to using the whole disk, and no disadvantage is given to using a slice.

I can see where a confusion can arise. In Solaris, when manipulating ZFS, one can use names like c0t0d0 or c0t1d0, and Solaris automagically uses c0t0d0s0 or c0t1d0s0. E.g. zpool create mypool mirror c0t0d0 c0t1d0 is a valid, and somewhat preferred, command. However, one has to remember that some magic (partitioning) would be done in background, not the entire disk would be used, and that zpool create mypool mirror c0t0d0s0 c0t1d0s0 is an equivalent command, if the disks are already properly partitioned.

P.S.
The disk from my system has no swap. But swap has to go somewhere, so I have disks with swap... Just not all of my disks have swap.

gork · Sep 16, 2015

I think there is a tendency to take the advice to use whole disks for ZFS much too literally. What is meant by the advice is that ZFS really prefers a 1:1 relationship between a physical disk and each device in the pool. ZFS generally tries to balance IO across pool members equally, so it works best when the devices in the main pool (ie not L2ARC or ZIL) all have roughly equivalent performance and that the behavior of any one device in the pool is not dependent on another.

Unless there is a misalignment of reads to the underlying disk blocks (512B, 4KB, etc.) there is no difference to ZFS whether it has the whole raw disk drive or only some portion such as a partition. There is no performance overhead, and in fact there are advantages to not using the raw disk devices.

The advantages to using a partition instead of a raw disk device:
1) A partition table documents the disk contents and layout and increases compatibility with other OSs.
2) Using a partition smaller than the raw size of the disk gives extra leeway to replace a failed disk with a drive of a different size/geometry/brand.

Regarding the use of the extra space as swap is in my mind a separate question. I personally remain a bit skeptical that this is a good idea and would like to see a well reasoned response as to why this is done. My own concern comes from the fact that I cannot find any definitive answer about how FreeBSD performs if a drive containing active swap fails. As a practical matter I do not think it is a performance concern because 1) FreeBSD apparently balances swap devices and 2) ideally swap is never used. An option to reserve space but leave the partition unused per-device would make more sense to me. If it is the case that swap is "absolutely necessary" then supporting it through zvol (cant do kernel crash dump) or dedicated partitions on faster and more reliable L2ARC devices seems a better approach.

Can anyone explain how disk failure of a swap partition is tolerated in a FreeNAS configuration? What will happen if the system is swapping and I yank a drive?

dlavigne · Sep 17, 2015

As a practical matter I do not think it is a performance concern because 1) FreeBSD apparently balances swap devices and 2) ideally swap is never used. An option to reserve space but leave the partition unused per-device would make more sense to me. If it is the case that swap is "absolutely necessary" then supporting it through zvol (cant do kernel crash dump) or dedicated partitions on faster and more reliable L2ARC devices seems a better approach.

Can anyone explain how disk failure of a swap partition is tolerated in a FreeNAS configuration? What will happen if the system is swapping and I yank a drive?

It is my understanding that swap is only used for operations such as imports, so that it doesn't have any affect on normal operations but bad things happen if it isn't there for an import.

gork · Sep 17, 2015

Thank you for the response.

Swap cannot be used selectively unless the software is enabling it and disabling it only during operations where it might be necessary (which does not seem to be happening.) It is up to the kernel when swap is used. ZFS import (particularly on pools with dedupe) can of course consume considerable memory, perhaps enough to require swap. However the way swap is configured in FreeNAS does present the a small risk of processes crashing (at best) or kernel panic (at worst) should a drive fail while it contains paged data because the FreeNAS swap configuration provides no means of protection against a drive failure other than trying to make such a failure as unlikely as possible.

I set up a test configuration below. In this example there is very low risk because no drives contain paged data at this time. Should some swap become used on /dev/da0p1.eli for example then the system could panic if da0 fails. One problem with swap is that it favors keeping highly inactive pages in swap.

[root@freenas] ~# swapinfo
Device 1K-blocks Used Avail Capacity
/dev/da1p1.eli 1048576 0 1048576 0%
/dev/da2p1.eli 1048576 0 1048576 0%
/dev/da3p1.eli 1048576 0 1048576 0%
/dev/da4p1.eli 1048576 0 1048576 0%
/dev/da5p1.eli 1048576 0 1048576 0%
/dev/da6p1.eli 1048576 0 1048576 0%
/dev/da7p1.eli 1048576 0 1048576 0%
/dev/da8p1.eli 1048576 0 1048576 0%
/dev/da9p1.eli 1048576 0 1048576 0%
/dev/da10p1.eli 1048576 0 1048576 0%
/dev/da11p1.eli 1048576 0 1048576 0%
/dev/da0p1.eli 1048576 0 1048576 0%
Total 12582912 0 12582912 0%

If some very old process that receives no activity (say getty or moused or something) has pages swapped to disk because of memory pressure from some other large process at some point, then these old pages will not have a chance to be removed from swap until they are accessed again. This could be a very long time. It is not at all out of the ordinary to see some small amount of swap usage on any long-running system. Have a look at another example I pulled from a different (non-FreeNAS) FreeBSD system:

[gork@wopr ~]$ uptime
9:31AM up 62 days, 20:33, 5 users, load averages: 0.00, 0.00, 0.00
[gork@wopr ~]$ swapinfo
Device 1K-blocks Used Avail Capacity
/dev/da0s1b.eli 4194304 11440 4182864 0%

In this example a mere 11MB of swap is used. Probably most of this is long idle processes, but the fact remains that the actual current memory usage indicates 13GB cache and 7GB free. So there is still swap used even though there is 20GB of kernel memory available to processes. The swap can be forcibly purged by disabling then re-enabling the swap partition.

[root@wopr /etc]# /etc/rc.d/swap1 restart
[root@wopr /etc]# swapinfo
Device 512-blocks Used Avail Capacity
/dev/da0s1b.eli 8388608 0 8388608 0%
[root@wopr /etc]# uptime
9:42AM up 62 days, 20:44, 5 users, load averages: 0.00, 0.00, 0.00

Now, da0 contains no swapped memory pages and should it fail the system would be at lower risk than before.

Perhaps it would be beneficial to have a cron job monitor for latent paged memory and periodically clear it if safe to do so.

I would be curious to know if other users running production FreeNAS systems see small amounts of swap being used or not. I am still evaluating FreeNAS here and so do not have my own numbers. I do have a Sun (Oracle) ZFS storage appliance though, and it puts swap on a zvol in the mirrored system zpool. I think this is a superior design decision, however I do not believe the FreeBSD implementation is robust enough yet to do the same thing. Moving swap to use the mirror GEOM provider might be an alternative that provides similar benefits.

Linkman · Sep 17, 2015

My impression was that the swap space was for ZFS proper, not the operating system; it's not backing system memory. Is that wrong?

cyberjock · Sep 17, 2015

Swap should NEVER be used in a properly configured FreeNAS system where you have no memory leaks and such.

That being said, Swap usage > 0 basically means "something is not right".

If you are using swap, and you happen to pull a drive using swap, you will often crash the system. That is one reason why the manual says to never remove a disk in a live system without warning the system beforehand. Now you'll probably respond with "but what if the disk just fails.. it's just like a drive that fails" and I would respond with 'yes, but again, swap shouldn't be used so this shouldn't be a problem".

Basically, with swap it's a seatbelt if something "isn't right". There's two options with swap:

1. Have no swap, but if you run out of RAM the system crashes.
2. Have swap, and if you start needing it the system just gets slower, but doesn't crash and you have the opportunity to recover it.

You can also use swap for situations where you bought a 3TB disk, but when you do an RMA the new disk is just a few hundred KB to small so it won't resilver. You simply shrink the swap space temporarily for that disk and then you can resilver.

jgreco · Sep 26, 2015

Linkman said:
My impression was that the swap space was for ZFS proper, not the operating system; it's not backing system memory. Is that wrong?

It is indeed backing system memory. ZFS doesn't have its own swap space, or, rather, it kinda does if we want to think of L2ARC that way, but this isn't that.

DrKK · Sep 26, 2015

jgreco said:
It is indeed backing system memory.

Indeed. Seconded.

jgreco · Sep 26, 2015

DrKK said:
Indeed. Seconded.

This isn't a vote! We're not voting! We don't vote on facts! Hehehe

DrKK · Sep 26, 2015

jgreco said:
This isn't a vote! We're not voting! We don't vote on facts! Hehehe

Well, a lot of people think they know "facts", and tend to ignore one person telling them a "fact" that is contrary to their "fact" that they learned from "their friend who has been an important IT guy for 12 years".

So I just wanted to second it, to spread doubt to people who don't believe it even though you said it.

jgreco · Sep 26, 2015

Oh dear lord... C&C warnng next time, eh?

Linkman · Sep 26, 2015

jgreco said:
It is indeed backing system memory. ZFS doesn't have its own swap space, or, rather, it kinda does if we want to think of L2ARC that way, but this isn't that.

I stand corrected. :)

So is the idea that the more HDD space FreeNAS has, the more RAM it may potentially need, so it allocates some swap on each disc it's given?

Important Announcement for the TrueNAS Community.

ZFS on partitioned disks.

Cadet

dlavigne

Guest

Cadet

dlavigne

Guest

Patron

Cadet

Patron

dlavigne

Guest

Guru

Dabbler

dlavigne

Guest

Dabbler

Patron

Inactive Account

Resident Grinch

FreeNAS Generalissimo

Resident Grinch

FreeNAS Generalissimo

Resident Grinch

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS on partitioned disks."

Similar threads