Asynchronous Deduplication

Status
Not open for further replies.

kayot

Dabbler
Joined
Nov 29, 2014
Messages
36
So I wanted to use deduplication. But everyone and their uncle is telling me it would be a colossal mistake. I'm not stupid enough to ignore that many people.

When I was first conceptualizing my server build, one question I kept asking was what OS should I use?

Windows Server 2012R2 or FreeNAS 9.2 (Now 9.3)

A feature I liked in Server was the asynchronous deduplication that is ran every so often, usually when the system sits idle. Rather than deal with absurd ram requirements, it just does it every now and then. That's perfect for a home server system that only stores data. It's something that gets turned on then ignored since it takes care of itself.

Unfortunately Server sucks almost everywhere else. Sure it's SMB 3.0 is multi-threaded, but that's just nice to have, not a requirement since FreeNAS can easily saturate a gigabit network one a single thread. It's Storage Spaces is a joke to humanity. If a bird sh|ts on a frog in a the woods, the SS will drop a drive for no reason other then %uck you, that's why! It's parity is slower than molasses in the winter time going up hill both ways. Sure, if I had money to blow, I'd just get a SAS Expander, setup a RAID6, and be done with it, but that's costs.

Then there is FreeNAS. Synchronous deduplication is instant, but requires a F*c^ ton of RAM. It's absurd. Sure it can use a Solid State drive and that's all nice and dandy, but that's a drive slot I could put another 3TB on. Base line I'd need 128GB of ECC ram to make a 30TB Raidz2 array (21TB available) even a remote possibility. Everything else on ZFS is nice. It handles JBOD arrays like a champ (no need to use expanders, just get another M1015 card) and with routine maintenance, data will never go bad.

So, how do you the users feel about this idea?

Have an asynchronous deduplication option.

The only downside I can see with this is fragmentation, so it would have to rewrite files to remove the holes it would make. This is fine since this would be done when the system sits idle. This is an option for home servers.

Addition:
I think they are looking into it already -> https://github.com/zfsonlinux/zfs/issues/1071

It's an older ticket, but it wasn't closed which is always a good thing.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
I guess it'd be nice for scenarios with few reads, but of limited usefulness for live data, since the dedup tables would need to be there on read, as well.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
What are you trying to dedupe? Backups or other file-level data? If so, use the traditional solution of something like dupmerge.

This does, of course, make certain assumptions about how careful you are with writing to your data pool (especially that you never modify files, but delete and rewrite them instead). It is extremely effective if leveraged correctly, however.
 

kayot

Dabbler
Joined
Nov 29, 2014
Messages
36
@Ericloewe I think asynchronous deduplication only hits files that haven't been deduplicated or have changed since last deduplication. In Server 2012r2 you can tell it to not touch certain directories, which would probably be things like torrent download directories where the data is ever changing.

@jgreco Nice, I've been using a bash script that manually went through and did hard links. Duplicates in my system tend to be identical files where I'm organizing torrents out of their downloaded directory into where they are to be stored. It adds up fast if I'm not careful.
 
J

jkh

Guest
So, how do you the users feel about this idea?

Have an asynchronous deduplication option.
I think the OpenZFS team would love to see you implement it! Seriously. Let us know when it's ready to test - we'll put together a test rig and see how well it works under load.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I think the OpenZFS team would love to see you implement it! Seriously. Let us know when it's ready to test - we'll put together a test rig and see how well it works under load.

I can't even figure out whether you're serious here or not. If you're serious, there's already the fine example of Phil Karn's dupmerge I linked to above. It's good old fashioned known-to-work UNIX tech. But putting something like this into ZFS itself would be ... complicated.
 
J

jkh

Guest
I'm sort of serious. An "offline dedup", AKA dedup as offered by NetApp, is one of those oft-requested features that are much easier to request than to actually implement. Since any paid ZFS projects are also far more likely to focus on more key enablers like Block Pointer Rewrite (not that there are any such projects I'm familiar with, but I still hope that somebody out there might find an extra million bucks in between the sofa cushions to to toss at such a thing), this will almost certainly never, ever happen unless someone with an itch to scratch actually just goes and does it.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If there was a fundraiser that required a million bucks and was for the OpenZFS team to implement block pointer rewrite, I like to think they'd get 100% funded easily. Is this even a possibility though? Last I read from the ZFS devs was that block pointer rewrite would be virtually impossible to implement today. Is this not true?
 

kayot

Dabbler
Joined
Nov 29, 2014
Messages
36
@cyberjock I wish I knew how to program either of these features. I'd totally do it, though it would be painfully slow since I have no idea how to make a file/pool system. The best I can do is applications to handle annoying tasks.

I'm only half joking when I say, if someone puts together a massive kick starter, I would volunteer to go back to college for the education and would write a new file/pool system from the ground up with all the features of ZFS + the ones that we all want. The challenge would be finding cohorts that would be willing to code for living expenses + tuition.

Back to the dedup issue. I wonder, how possible is it to tap into the file systems checksums and use those? That would save time on the calculations for a program like dupmerge.
 
J

jkh

Guest
If there was a fundraiser that required a million bucks and was for the OpenZFS team to implement block pointer rewrite, I like to think they'd get 100% funded easily. Is this even a possibility though? Last I read from the ZFS devs was that block pointer rewrite would be virtually impossible to implement today. Is this not true?
Nothing is impossible. It would certainly be difficult, however! So much so, I think it will be the great white whale for ZFS.
 

emk2203

Guru
Joined
Nov 11, 2012
Messages
573
What are you trying to dedupe? Backups or other file-level data? If so, use the traditional solution of something like dupmerge.

This does, of course, make certain assumptions about how careful you are with writing to your data pool (especially that you never modify files, but delete and rewrite them instead). It is extremely effective if leveraged correctly, however.
I use rmlint on other (Linux) machines, which gives more choices than dupmerge and is blazingly fast. But I failed to compile it in a jail (it's not in ports yet). The author told me that he compiled it on FreeBSD 10.1 with clang 3.5, and that the jails use GCC 4.2, which is > 7 years old and doesn't support the c11 standard.

Any idea how to use a more modern compiler or clang in a jail?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm not a compiler guy, but you can't do a "pkg install clang"?

# pkg search clang
clang-devel-3.6.r224537_1
clang33-3.3_9
clang34-3.4.2_3
clang35-3.5.0_3


Looks like clang 3.5 is in pkg-ng. As is GCC >4.2....

# pkg search gcc
amd64-gcc-4.9.1_1
amd64-xtoolchain-gcc-0.1
arm-none-eabi-gcc-4.9.1_2
colorgcc-1.3.2
gcc-4.8.4
gcc-arm-embedded-4.8.20140805
gcc-aux-20141023_1
gcc-ecj-4.5
gcc46-4.6.4_4,1
gcc47-4.7.4_2,1
gcc47-aux-20140612_1
gcc48-4.8.5.s20150108
gcc49-4.9.3.s20150107
gcc5-5.0.s20150111

gccmakedep-1.0.2_1
mingw32-gcc-4.7.2_2,1
msp430-gcc-4.6.3.20120406_3,2
powerpc64-gcc-4.9.1_1
powerpc64-xtoolchain-gcc-0.1
psptoolchain-gcc-stage1-4.6.2_3
psptoolchain-gcc-stage2-4.6.2_3
sparc64-gcc-4.9.1_1
sparc64-xtoolchain-gcc-0.1
tigcc-0.96.b8_3
zpu-gcc-1.0
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I use rmlint on other (Linux) machines, which gives more choices than dupmerge and is blazingly fast. But I failed to compile it in a jail (it's not in ports yet). The author told me that he compiled it on FreeBSD 10.1 with clang 3.5, and that the jails use GCC 4.2, which is > 7 years old and doesn't support the c11 standard.

Any idea how to use a more modern compiler or clang in a jail?

Jails pretty much use whatever you put in them, whether it's just a single executable that you need to isolate from the rest of the system, all the way on up to a full FreeBSD OS file tree. In that latter case there are minor caveats, but as long as it is compatible with the host kernel, it should be fine.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I've used "duff" and have been very happy.

"duff -re0 ./ | xargs -0 rm" will keep one copy of every unique file and delete the others.
 

kayot

Dabbler
Joined
Nov 29, 2014
Messages
36
fslint will hardlink copies. I don't need deduplication since this is what I was really after. Plus I can do stuff to files like add subtitles without worrying about deduplication going nuts.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I used bellybuttonlint. :)
 

kayot

Dabbler
Joined
Nov 29, 2014
Messages
36
I use the Thought Entity. It's like the source code for the universe. Problem is, it only lets me change data. I can't make it from scratch.
 

emk2203

Guru
Joined
Nov 11, 2012
Messages
573
I gave the author of rmlint some feedback, and he got rid of the compile errors on FreeNAS.

If you want a blazing fast asynchronous dedupe, install gcc 6 in a jail (pkg install gcc6) and follow the instructions for compilation on the rmlint website. Substitute
Code:
CC=gcc6 CXX=g++6 scons
for every scons command in the instructions, and rmlint compiles and installes without a problem. And it's worth it.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,450
for every scons command in the instructions, and rmlint compiles and installes without a problem. And it's worth it.
Isn't "scons" some kind of New Orleans style smoothies?
 
Status
Not open for further replies.
Top