Hello,
I'm wanting to "convert" a directory that contains a bunch of data (multiple TBs) into its own dataset. I'd like to do this a way that results in the least possible impact to my replication backups. That is, ideally after moving the data into its own dataset, those multiple TBs will not need to be re-copied during my next replication backup. I'm not sure this is even possible, but I'd like to float an idea out there and see what others with perhaps more experience think of it.
So right now, I have the following datasets:
masterData
masterData/lion
masterData/tiger
And the masterData dataset has a directory in it called "bear" with many TBs of data. Ultimately, I want the contents of the bear directory to belong to its own dataset called masterData/bear. Naturally, bear's data should no longer be stored in the masterData dataset.
I found this thread which explains a possible strategy, which is as follows:
On to my questions:
I'm wanting to "convert" a directory that contains a bunch of data (multiple TBs) into its own dataset. I'd like to do this a way that results in the least possible impact to my replication backups. That is, ideally after moving the data into its own dataset, those multiple TBs will not need to be re-copied during my next replication backup. I'm not sure this is even possible, but I'd like to float an idea out there and see what others with perhaps more experience think of it.
So right now, I have the following datasets:
masterData
masterData/lion
masterData/tiger
And the masterData dataset has a directory in it called "bear" with many TBs of data. Ultimately, I want the contents of the bear directory to belong to its own dataset called masterData/bear. Naturally, bear's data should no longer be stored in the masterData dataset.
I found this thread which explains a possible strategy, which is as follows:
Code:
#Create a snapshot of masterData zfs snapshot masterData@mybearsnap # Clone the snapshot (Should be quick, uses copy-on-write. BUT is this handled smartly in a subsequent recursive replication stream of masterData?) zfs clone masterData@mybearsnap masterData/newbear # Remove all directories that are a part of masterData but are not part of bear. rm -rf /mnt/masterData/newBear/otherDir /mnt/masterData/newBear/otherDir2 # Move contents of the bear subdirectory into the root of the newbear dataset mv /mnt/masterData/newbear/bear/* /mnt/masterData/newbear/ mv /mnt/masterData/newbear/bear/.* /mnt/masterData/newbear/ # No need for this empty directory... rmdir /mnt/masterData/newbear/bear # Remove contents of existing bear directory on masterData (Obviously make sure it's not in use first) rm -rf /mnt/masterData/bear # Rename newbear dataset to bear zfs rename masterData/newbear masterData/bear # Remove snapshot zfs destroy masterData@mybearsnap
On to my questions:
- Does this seem like a workable solution?
- This post mentions inefficiency, so that has me concerned. If I understand correctly, given I remove the old snapshot (last step), this shouldn't leave a bunch of referenced deleted files around. Is this a correct understanding, or is this inefficient in some way that I'm not noticing (i.e. Does it waste a bunch of space that will not be recovered? If yes, why?).
- As mentioned, I backup my volumes using recursive replication of the masterData dataset to a remote server. My bandwidth is metered on a monthly basis, so I don't want this dataset creation to result in the transfer of the many TBs of data contained in the bear dataset. Given snapshot clones are COW (see step two), is this accounted for in the replication stream such that no new data will need to be transferred?
Last edited: