How to combine smaller disks for efficient raidz with different sized disks

Status
Not open for further replies.

clinta

Cadet
Joined
Dec 18, 2013
Messages
7
This tutorial will show you how to use geom concat to concatinate smaller disks to appear as one larger disk which can participate in a raidz or zfs mirror with other disk. Steps marked (Encryption) are optional and only necessary if you want an encrpyted pool.

For this example we have 2 2TB disks and 2 1TB disks.

If we just create a raidz of all of these disks, each disk is shrunk down to the size of the smallest disk, giving us only 2TB of usable space.

1. We will start by creating a zpool with all disks in the web gui. In my example it will be created as a pool with two mirrors. Your pool may be different, it doesn't really matter because this is not the pool we will keep. We create it here so that FreeNAS will parition the disks with swap space and add an entry for our zpool into the database that contains all of our physical disks. When we're done we will have this same zpool name but it will have different vdevs. And that's okay, because FreeNAS database contains only a mapping of physical disks to zpools, the vdev members are all loaded dynamically by zfs metadata on the disks.

2. (Encryption) Optionally encrypt the disk at this stage.

5. Add a pre init command
Code:
gconcat load


6. In the CLI destroy the zpool you created in the web gui.
Code:
zpool destroy tank


7. Determine the UUIDs of your partitions to determine which ones need to be combined. I know from creating it in the webgui that my 1TB disks are da3 and da4, so I'll find the UUIDs for the large data partitions on those disks:
Code:
# gpart list da3 | grep 'Name\|Mediasize\|rawuuid'
1. Name: da3p1
   Mediasize: 2147483648 (2.0G)
   rawuuid: cb966fbf-6b2d-11e3-9a2c-000c296ed231
2. Name: da3p2
   Mediasize: 1071594257920 (998G)
   rawuuid: cb9aef83-6b2d-11e3-9a2c-000c296ed231
1. Name: da3
   Mediasize: 1073741824000 (1T)
# gpart list da4 | grep 'Name\|Mediasize\|rawuuid'
1. Name: da4p1
   Mediasize: 2147483648 (2.0G)
   rawuuid: cbb435c7-6b2d-11e3-9a2c-000c296ed231
2. Name: da4p2
   Mediasize: 1071594257920 (998G)
   rawuuid: cbb8b608-6b2d-11e3-9a2c-000c296ed231
1. Name: da4
   Mediasize: 1073741824000 (1T)


Now I know the two 1TB partition I need to concat are cb9aef83-6b2d-11e3-9a2c-000c296ed231 and cbb8b608-6b2d-11e3-9a2c-000c296ed231.

8. Concatinate the disks with gconcat. Use the label option rather than the init option. The label option writes metadata to the end of each volume so that it can be automatically loaded when it's detected. If you are using encrypted disks, concatinate the .eli volumes as I am in this example. If you are not, the commands are the same except without the '.eli'.
Code:
# gconcat label concat1 /dev/gptid/cb9aef83-6b2d-11e3-9a2c-000c296ed231.eli /dev/gptid/cbb8b608-6b2d-11e3-9a2c-000c296ed231.eli


Now you might think that you could just use /dev/concat/concat1 as a member of your zpool, but you shouldn't. The reason is because the metadata for gconcat is stored at the very end of the device, while the metadata for zfs is stored at the beginning of the device. This means that zfs will see the first device in your concatinated device and think that is a member of your zpool rather than the concatinated device itself. To get around this we put a partition table on the concatinated device and partition it, then add the partition as a member of the zpool.

9. Partition the concat device:
Code:
# gpart create -s gpt /dev/concat/concat1
concat/concat1 created
# gpart add -t freebsd-zfs concat/concat1
concat/concat1p1 added


10. Determine the UUIDs of all of your partitions:

Code:
# gpart list | grep 'Name\|Mediasize\|rawuuid'
1. Name: da0s1
   Mediasize: 988291584 (942M)
2. Name: da0s2
   Mediasize: 988291584 (942M)
3. Name: da0s3
   Mediasize: 1548288 (1.5M)
4. Name: da0s4
   Mediasize: 21159936 (20M)
1. Name: da0
   Mediasize: 4294967296 (4.0G)
1. Name: da0s1a
   Mediasize: 988283392 (942M)
1. Name: da0s1
   Mediasize: 988291584 (942M)
1. Name: da1p1
   Mediasize: 2147483648 (2.0G)
   rawuuid: cb449ef2-6b2d-11e3-9a2c-000c296ed231
2. Name: da1p2
   Mediasize: 2145336081920 (2T)
   rawuuid: cb492a04-6b2d-11e3-9a2c-000c296ed231
1. Name: da1
   Mediasize: 2147483648000 (2T)
1. Name: da2p1
   Mediasize: 2147483648 (2.0G)
   rawuuid: cb610284-6b2d-11e3-9a2c-000c296ed231
2. Name: da2p2
   Mediasize: 2145336081920 (2T)
   rawuuid: cb654af1-6b2d-11e3-9a2c-000c296ed231
1. Name: da2
   Mediasize: 2147483648000 (2T)
1. Name: da3p1
   Mediasize: 2147483648 (2.0G)
   rawuuid: cb966fbf-6b2d-11e3-9a2c-000c296ed231
2. Name: da3p2
   Mediasize: 1071594257920 (998G)
   rawuuid: cb9aef83-6b2d-11e3-9a2c-000c296ed231
1. Name: da3
   Mediasize: 1073741824000 (1T)
1. Name: da4p1
   Mediasize: 2147483648 (2.0G)
   rawuuid: cbb435c7-6b2d-11e3-9a2c-000c296ed231
2. Name: da4p2
   Mediasize: 1071594257920 (998G)
   rawuuid: cbb8b608-6b2d-11e3-9a2c-000c296ed231
1. Name: da4
   Mediasize: 1073741824000 (1T)
1. Name: concat/concat1p1
   Mediasize: 2143188455424 (2T)
   rawuuid: b05227bf-6b2f-11e3-9a2c-000c296ed231
1. Name: concat/concat1
   Mediasize: 2143188500480 (2T)


In my case my two larger disks are: cb492a04-6b2d-11e3-9a2c-000c296ed231 and cb654af1-6b2d-11e3-9a2c-000c296ed231, and the uuid for the concat partition is b05227bf-6b2f-11e3-9a2c-000c296ed231. I'm going to create my zpool using UUIDs because that's what FreeNAS does when it creates zpools in the GUI and I want as much consistency as possible bewteen the zfs metadata and what FreeNAS expects.

11. Create your zpool. In my case becasue I'm using encryption I'm going to add .eli to the partitions on the actual disks, but not to the concat parition, since the encryption for that partition is provided at the lower layer on the disk partitions that are concatinated together. You will also likely need to use -f because the disks won't be exaclty the same size, they are probably off by a few kilobytes.

Code:
# zpool create -m /mnt/tank -f tank raidz gptid/cb492a04-6b2d-11e3-9a2c-000c296ed231.eli gptid/cb654af1-6b2d-11e3-9a2c-000c296ed231.eli gptid/b05227bf-6b2f-11e3-9a2c-000c296ed231


That's it, you should now be able to check the status of your volume in the webgui and see members concat/concat1p1 da2p2 and da1p1 and it should be listed as healthy.

Reboot to test that your zpool is mounted automatically and shows up healthy, possibly after entering a passphrase if you're using encryption.

In my testing, this works most of the time, but sometimes fails. The webgui shows "unable to get volume information". Running
Code:
zpool import tank
fixes the issue. I suspect there may be a race condition somewhere where zpool import is trying to run before the concat device shows up as available. Personally it works well enough for me since I very rarely reboot the server.

Beware that if you ever need to detach and auto-import this zpool into the webgui, your physical disks will not be associated with the volume in the database anymore. To fix this you will need to manually edit the FreeNAS database.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
While Ill give you credit for making this work, its stupid to do this to zfs on so many levels I can't believe some would be this careless with their data while at the same time using zfs. Those ideas are antitheses of each other.

Sent from my DROID BIONIC using Tapatalk
 

clinta

Cadet
Joined
Dec 18, 2013
Messages
7
I've been running this way on FreeBSD for years, most of this tutorial is the details how to make it work with FreeNAS. In terms of just ZFS it's no more dangerous than a standard RaidZ. If one of the disks in your geom_concat dies, that is one member of your raidz gone until you replace it.

It's not something I'd recommend to any of my customers in a corporate environment (along with everything in the hacking forum), but it's been working perfectly fine in my home server for years.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I've been running this way on FreeBSD for years, most of this tutorial is the details how to make it work with FreeNAS. In terms of just ZFS it's no more dangerous than a standard RaidZ. If one of the disks in your geom_concat dies, that is one member of your raidz gone until you replace it.

It's not something I'd recommend to any of my customers in a corporate environment (along with everything in the hacking forum), but it's been working perfectly fine in my home server for years.

You know how many people have said "it's been working fine for years" and come back later crying over their lost data? More than I want to count. I'd say it could be much more dangerous as you don't truely have single disk protection from RAIDZ1. You have some messy 2 disk "thing" working as a single disk. Frankly, I'd never even trust something like this for home use as I do like my data. I wouldn't spend lots of money getting appropriate hardware for ZFS, then stupidly try to save a few dollars by concatenating disks with geom.

Don't get me wrong, I'm not goint to delete the thread, but this is something i'd never do. Not even as temporary storage space.
 

clinta

Cadet
Joined
Dec 18, 2013
Messages
7
If you were doing it the opposite way. Say partitioning each 2TB volume and making a raidz1 with 6 1TB partitions, then I'll agree you're reducing your ability to sustain a failure, because one physical disk dieing takes out two members of the raidz and results in data loss. But having a geom_concat as a member of a raid does not reduce your ability to fail a disk. Either member of your geom_concat dies, and your raidz looses the entire geom and operates as if it sustained a single disk failure.

I understand the general avoidance of anything nonconventional. But personally I think you're being overly cautious. This is a good opportunity to issue the standard warning that redundancy is not backup. ZFS is known to have other issues that end up requiring a restore from backup. Personally I was hit by this bug and had to restore from backup: http://blog.simplex-one.com/?p=199, and this bug affects totally standard configurations.

Recognizing that things can go wrong with any storage system, and backups are essential, I don't think the scheme of running zfs on top of geom_concat adds any significant risk.

It is definitely good to have this discussion though so that anyone considering this can see the concerns and better understand the risks.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, if you have 2 disks you are increasing the failure of that one disk. If you are using RAIDZ1 for example and one disk fails you are at an even higher chance of losing data being that you are using 2 disks. In effect a 4 disk RAIDZ1 where you have 2 disks concatenated into 1 is really a 3 disk stripe in the event of data loss instead of a 2 disk stripe. It gives the false illusion of how close to failure are.

Not to mention ZFS expects to be your one and only volume manager and file system manager. You are breaking that by concatenating disks. I doubt anyone actually did extensive testing of ZFS in a setup such as the one you are trying to do. Are you really going to make the argument "its worked for me for a year" as assurance that *other* people should trust their data in this fashion? I sure hope that the answer is a "hell no".

ZFS also isn't designed with the expectation that it will be using 2 disks to pretend to be one disk. That's often called "hardware RAID". If you check the manual and read the forums you'll see it's incredibly stupid to do it. You almost can't overstate how stupid it is. We've seen people do ZFS on hardware RAID in a RAID5 or RAID6 do a disk replacement and billions of errors start cropping up in ZFS despite the RAID controller never reporting any problem and a RAID rebuilding successfully. So there's something not "quite right" when ZFS loses its control. In some cases every remaining disk was in perfect health but ZFS still has a heart attack and poops all over the place.

There's 2 kinds of "cans". Those you should and those you shouldn't. I think this should be classified as "those you shouldn't". It's a "buyer beware".

It's just as I said before... people don't go to ZFS because they want more chances to lose their data. They want the reliability and security provided by ZFS. As soon as you start cutting corners doing things outside those ZFS was designed to handle(non-ECC RAM, hardware RAID, etc) then things often go from "just fine" to "unmountable and unrecoverable pool" in a single step.
 

clinta

Cadet
Joined
Dec 18, 2013
Messages
7
Good point that you are adding another disk. Your risk level should always be evaluated on the number of disks you have. Take a common recommendation that you shouldn't scale a raidz2 beyond 8 disks, that should apply here too,8 disks, not 8 geom devices.

I also fully agree that you should not be running zfs on top of hardware raid. I would also not recommend running it on top of a software raid like GEOM_RAID0, GEOM_RAID1 or GEOM_RAID3. These all introduce their own stripe sizes which should be left to zfs. And running on a hardware raid will hide all SMART data from the OS.

But geom_concat introduces none of this. There is no striping, it's just continuing one disk at the end of the other.

Also ZFS being your one and only volume manager is a recommendation from Solaris. The primary reason for this recommendation was that if you gave ZFS anything but raw disks, it would not utilize write cache in the drive. FreeBSDs implementation of GEOM (which I'm taking advantage for concat), always utilizes write cache. This is why it's possible, and preferred, in FreeBSD to put ZFS on GPT partitions and not allow ZFS to be the one and only volume manager. FreeNAS does just this when you create a volume in the web gui.

When you create a ZFS volume in the web gui for FreeNAS, your zpool members are GPT partitions. All IO from ZFS is going through the geom object exposed in /dev/gptid. You are already using GEOM objects for ZFS in FreeNAS. I'm just using a different GEOM provider that refers to a concatenation of partitions rather than a single partition.
 
Status
Not open for further replies.
Top