DEDUPE bug (or am I doing something wrong)?

Status
Not open for further replies.

NASbox

Guru
Joined
May 8, 2012
Messages
650
Hi All

TLDR;
I'm hoping someone can tell me if I've found a bug, or if I'm doing something wrong. I have two directory trees with almost exactly the same data, and I don't seem to be getting any space savings. Only 2 of the almost 4700+ files in each directory tree differ. All other files have exactly the same content and directory stat info (owner/group/size/date/time).
EDIT: Test was run with gzip compression and again with no compression.

BACKGROUND/TEST METHOD
I am trying to set up a system for backing up multiple WordPress sites from a public web host. (Since the bulk of the data is WordPress code, which is very compressible and the majority of the content is common across sites, I would expect deduplication to save considerable space, but it doesn't appear to be working.

Tests run on BUILD: FreeNAS-11.1-RELEASE

I created a dataset TANK/SITEBACKUPS which has dedup enabled and uses gzip compression (since gzip does a much better job than the default lz4 compression and in my application the extra execution time is not a problem-saving disk space is the priority).

Code:
TANK/SITEBACKUPS compression	gzip	 local
TANK/SITEBACKUPS dedup		  on	   local


I have a script that backs up the WP Site and the database. To test deduplication under ideal conditions I made the script backup the same site/database twice

/SITEBACKUPS/DUMMY
/SITEBACKUPS/DUMMY2

Here is the disk usage after both backups have been run:

du -h -d 1 .
238M ./DUMMY2
239M ./DUMMY
477M .

The two copies are identical except for two files. All 4 files total about 1500KB, and the differences are:

#>diff -rq DUMMY DUMMY2 | less
Files DUMMY/.BACKUPDATA/sqlbackup.sql and DUMMY2/.BACKUPDATA/sqlbackup.sql differ
Files DUMMY/.BACKUPDATA/sync_copylog.txt and DUMMY2/.BACKUPDATA/sync_copylog.txt differ

The only difference in sqlbackup.sql is a single line with the backup time.
The rsync log file sync_copylog.txt is about 560K, and contains many differences becuse


To determine data usage, I used a script which relies on the the following zfs list command:
zfs list -t all -o name,used,avail,refer,creation,usedds,usedsnap,compression,compressratio,refcompressratio,lused -r "$DATASET"

TEST RESULTS

Empty dataset before backup to ./DUMMY and ./DUMMY2 :
Code:
Initial Empty Dataset
-------------------------------------------------------------------------------------------------------------------
Recent Snapshots:
NAME			 USED AVAIL REFER CREATION			 USEDDS USEDSNAP COMPRESS RATIO REFRATIO LUSED
TANK/SITEBACKUPS 205K 14.5T 205K Mon Apr 9 4:13 2018	205K		 0	 gzip 1.00x	 1.00x 40.5K
-------------------------------------------------------------------------------------------------------------------

After first backup to ./DUMMY:
Code:
After backup to DUMMY
-------------------------------------------------------------------------------------------------------------------
Recent Snapshots:
NAME			 USED AVAIL REFER CREATION			 USEDDS USEDSNAP COMPRESS RATIO REFRATIO LUSED
TANK/SITEBACKUPS 239M 14.5T 239M Mon Apr 9 4:13 2018	239M		 0	 gzip 1.25x	 1.25x 259M
-------------------------------------------------------------------------------------------------------------------

After first backup to ./DUMMY2:
Code:
After second backup of same site to DUMMY2
-------------------------------------------------------------------------------------------------------------------
Recent Snapshots:
NAME			 USED AVAIL REFER CREATION			 USEDDS USEDSNAP COMPRESS RATIO REFRATIO LUSED
TANK/SITEBACKUPS 478M 14.5T 478M Mon Apr 9 4:13 2018	478M		 0	 gzip 1.25x	 1.25x 517M
-------------------------------------------------------------------------------------------------------------------

Two copies appears to be double the space and no savings.

EDIT:
Just in case dedupliation and compression are mutually exclusive, I used the GUI to turn compression to off, deleted the contents of DUMMY/2 with rm -rf, and then recreated them with mkdir.

New Database Properties
Code:
TANK/SITEBACKUPS compression			 off					 local
TANK/SITEBACKUPS dedup					on					 local


Dataset edited from FreeNAS GUI and then DUMMY/2 deleted and recreated with rm -rf/mkdir

Empty Dataset before running backup
Code:
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Recent Snapshots:
NAME			 USED AVAIL REFER CREATION			 USEDDS USEDSNAP COMPRESS RATIO REFRATIO LUSED
TANK/SITEBACKUPS 222K 14.5T 222K Mon Apr 9 4:13 2018	222K		 0	 off 1.00x	 1.00x 44.5K
----------------------------------------------------------------------------------------------------------------------------------------------------------------
#>du -h
512B	./DUMMY2
512B	./DUMMY
1.5K	.

After First Backup Run to DUMMY2
Code:
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Recent Snapshots:
NAME			 USED AVAIL REFER CREATION			 USEDDS USEDSNAP COMPRESS RATIO REFRATIO LUSED
TANK/SITEBACKUPS 293M 14.5T 293M Mon Apr 9 4:13 2018	293M		 0	 off 1.00x	 1.00x 259M
----------------------------------------------------------------------------------------------------------------------------------------------------------------
#>du -h -d1
292M	./DUMMY2
512B	./DUMMY
292M	.

After Second Backup Run to DUMMY
Code:
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Recent Snapshots:
NAME			 USED AVAIL REFER CREATION			 USEDDS USEDSNAP COMPRESS RATIO REFRATIO LUSED
TANK/SITEBACKUPS 586M 14.5T 586M Mon Apr 9 4:13 2018	586M		 0	 off 1.00x	 1.00x 517M
----------------------------------------------------------------------------------------------------------------------------------------------------------------
#>du -h -d1
292M	./DUMMY2
292M	./DUMMY
585M	.

Directory Differences (Same as first run - almost identical)
Code:
#>diff -rq DUMMY DUMMY2 | less
Files DUMMY/.BACKUPDATA/sqlbackup.sql and DUMMY2/.BACKUPDATA/sqlbackup.sql differ
Files DUMMY/.BACKUPDATA/sync_copylog.txt and DUMMY2/.BACKUPDATA/sync_copylog.txt differ
 
Last edited:

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
You should really simulate how much space you will save by having de-duplication switched on.

zdb -U /data/zfs/zpool.cache -S tank

Dedupe is not a magic bullet and doesn't work how you'd expect. You're better off using compression in most cases, as dedupe is memory hungry (5GB RAM per 1TB storage).
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
You should really simulate how much space you will save by having de-duplication switched on.

zdb -U /data/zfs/zpool.cache -S tank

Dedupe is not a magic bullet and doesn't work how you'd expect. You're better off using compression in most cases, as dedupe is memory hungry (5GB RAM per 1TB storage).

Thanks for the reply, maybe you can help me clarify a few things.

Dedupe is not a magic bullet and doesn't work how you'd expect.

That is an understatement! I'm really confused as to how dedupe could get a better opportunity to deduplicate than 2 completely identical copies of a set of files that are identical right down to the metadata! Seriously what am I missing?

zdb -U /data/zfs/zpool.cache -S tank

I checked the man page out on this command, but I'm not sure from the model above how I should apply it.

Is /data/zfs/zpool.cache the actual parameter I should use or just an example? I notice that the file does exist

Code:
-rw-r--r--  1 root  www  6224 Feb 27 01:16 zpool.cache
-rw-r--r--  1 root  www  6224 Feb  2 03:12 zpool.cache.saved

but I have no idea what it is used for or if the zdb command will damage the file or cause other problems.

So would this command zdb -U /data/zfs/zpool.cache -S TANK/SITEBACKUPS be correct?

Is it necessary to use deduplication for the whole pool?
(If yes, that's a deal breaker. Pool is too big, and not enough dupes to make it worthwhile.)

Any hints/assistance would be much appreciated.
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
Does anyone have any success with deduplicate? I would think my use case would be ideal, but it seems to be doing absolutely nothing? Am I missing something, or does this function not work?
 
Status
Not open for further replies.
Top