Organizing Data Sanity Check

onlineforums

Explorer
Joined
Oct 1, 2017
Messages
56
I'm looking for a sanity check and some tips on how to best organize data so that my snapshots don't get huge from "deleted" or "moved" data that is actually captured in a different snapshot.

I work with other people on projects that can be 1mb to 100mb large consisting of 5 to 50 files. Each project is unique and there can be dozens of "current" projects. After a project changes status (complete, hold, etc) then that project changes locations outside of the "current" folder to not clutter up the currently working on projects folder.

The current structure, outside of FreeNAS is as follows:

Code:
CURRENT-PROJECTS
  | - Project01
      | - <files>
  | - Project02
      | - <files>
  | - NON-CURRENT
      | - COMPLETE
          | - Project03
               |- <files>
      | - HOLD
          | - Project04
               |- <files>


Basically a project MOVES from a status to another status folder (or in the above illustration, sub folders). So for example, when Project03 got completed it got moved from "CURRENT-PROJECTS" folder to "CURRENT-PROJECTS/NON-CURRENT"

The concern I have is associated with datasets, snapshots and replication tasks and looking for your tips.

Alternatively I was thinking of separating out NON-CURRENT outside of the CURRENT-PROJECTS folder/dataset as follows:
Code:
CURRENT-PROJECTS
  | - Project01
      | - <files>
  | - Project02
      | - <files>
      
NON-CURRENT
  | - COMPLETE
      | - Project03
            |- <files>
  | - HOLD
      | - Project04
            |- <files>


I'm looking for best practices where I can MOVE a Project folder from the "CURRENT-PROJECTS" folder to a NON-CURRENT/<status> folder, whether nested datasets or not.

My concern is that I don't want snapshots and replication tasks to basically increase in size for the moved folders as the data will be accounted for in the snapshot/replication of where the project got moved. For example, if all projects, including all statuses (complete and hold) is 6 GB then I want to move projects around while the new snapshot after moving folders to still be 6GB. This may be 2GB in CURRENT-PROJECTS, 2GB in COMPLETE and 2GB in HOLD. If I move 1GB from CURRENT-PROJECTS to COMPLETE I want to make sure that the totallity of all of the new snapshots is still only 6GB rather than 7GB. In other words, I don't want one snapshot to see the data as deleted, removed or changed but rather moved to a different dataset where that snapshot references the data.

I'll be playing around in a FreeNAS testbed on this but figured I would post this out there for some insight and to help me in my testing.

Thank you!
 

anmnz

Patron
Joined
Feb 17, 2018
Messages
286
How will your users access the data? SMB? NFS? Something else?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If it's all in the same dataset, the mv command will just cause the name pointer to be updated to the new location, no data on the disk is actually changed, so that would avoid the duplication in snapshots you're looking for.

This does mean that all of your data needs to be in the same dataset (not even children of the same nested dataset).

If you're moving between datasets, you'll need to manage the snapshots as you do the move of the data in order to remove the snapshots containing the unwanted data (perhaps along with the "wanted" data too, so not great).
 

onlineforums

Explorer
Joined
Oct 1, 2017
Messages
56
If it's all in the same dataset, the mv command will just cause the name pointer to be updated to the new location, no data on the disk is actually changed, so that would avoid the duplication in snapshots you're looking for.

This does mean that all of your data needs to be in the same dataset (not even children of the same nested dataset).

If you're moving between datasets, you'll need to manage the snapshots as you do the move of the data in order to remove the snapshots containing the unwanted data (perhaps along with the "wanted" data too, so not great).
Is the mv command what takes place when a Windows user, via SMB, moves a folder? I suspect that is how it translates over to FreeBSD.

I'm wondering how I can organize this without unnecessary large snapshots considering we organize data according to status (current, complete, hold).

Below is the ideal situation but I'm not sure how I can complete this with ZFS snapshots while not having a lot of duplicate data between snapshot datasets.
10G - CURRENT status
50G - COMPLETE status
20G - HOLD status
Assume that each "project" folder is 1G, so the CURRENT dataset has 10 projects.

We organize these into three data sets named according to their status (current, complete, hold).
A user works from within the CURRENT status dataset, creating new files in a project, editing files, etc.
After a project is complete, the user would MOVE the project folder to the COMPLETE dataset. Our sizes now looks as follows:
9G - CURRENT status
51G - COMPLETE status
20G - HOLD status
TOTAL: 80G

However, the disk usage would look as follows due to snap shots.
10G - CURRENT status
51G - COMPLETE status
20G - HOLD status
TOTAL: 81G

This isn't ideal because there will end up being a significant amount of duplicated data between dataset snapshots. In the above example the CURRENT dataset snapshot has 1G of data that is also in the COMPLETE dataset (and their subsequent snapshots).

Suggestions on how to go about doing this while not having duplicated data all over the place? My fear with having some custom zfs scripts setup in cron is that deleting a snapshot(s) that reference that 1G project MAY DELETE OTHER DATA that we may need to retrieve.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Is the mv command what takes place when a Windows user, via SMB, moves a folder? I suspect that is how it translates over to FreeBSD.
Probably. It may also be that any operation done by SMB is client side... you should run testing to confirm before you put that in production.

Suggestions on how to go about doing this while not having duplicated data all over the place?
As I already suggested, put the 3 different areas into a single dataset (as directories, not child datasets) and use the mv operation (subject to your testing of SMB operations as I also suggested).

My fear with having some custom zfs scripts setup in cron is that deleting a snapshot(s) that reference that 1G project MAY DELETE OTHER DATA
Destroying a snapshot with dependencies requires an additional switch in the command, so unless your script is using the force and recursive switches by default, it should not be able to break depending snapshots by destroying any.
 
Top