Compression: lz4 vs gzip-9 for Replicated Storage

Status
Not open for further replies.
Joined
Feb 2, 2016
Messages
574
TL;DR: even when performance doesn't matter and disk space is relatively tight, use lz4 because gzip-9 doesn't save enough space to matter and is a CPU hog. (Assuming our data set.)

We use lz4 compression for our source pools: quick and efficient even though a large portion of our data is not compressible (JPEGs, PDFs). We replicate hourly during the day to an older, less powerful server. A large snapshot is 150 MiB. Most are half that.

Since we're not especially concerned with performance of the target FreeNAS server and have half the number of drive slots as our primary, we looked at what using gzip-9 instead of lz4 would do for our space utilization. Even if replication time doubled, if we could save a meaningful amount of space over lz4, it would be worth the compression change.

We created new volumes so all replicated data would be compressed using gzip-9 and started the replication stream. Immediately, the replication side CPU (dual Xeon E5430) swamped and we hit 700mbps instead of a more typical 850mpbs (gigabit NICs). The source FreeNAS server (Xeon E3-1230, 10-gig) saw 20% CPU utilization so it wasn't even breaking a sweat.

FreeNAS-Repliation-Compression.png

For our data, the best-case scenario is an 8% savings by going gzip-9 instead of lz4 at a huge CPU cost and a loss of 150mpbs. For our primary data set, the savings was just 2%. Going into the test, we guessed that an additional 10% savings would be the point where we'd go gzip.

With more compressible data, gzip-9 might be worth the performance hit. For a 2% savings, I'd rather stick with the recommended lz4 (more heavily tested, deployed and bug free I hope) and have our two systems match configurations.

Finally: I love FreeNAS replication. So easy to setup. So amazingly efficient. So much better than the home-rolled system we had been using.

Cheers,
Matt
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
is a CPU hog
Oh yeah. We had a user with 768GB of RAM and decently powerful Xeon E5 (IIRC) some time back. His writes actually timed-out when using gzip-9.

Also: Thanks for sharing your data, it's always nice to get data on edge cases.
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
I have a test syslog server that uses normal gzip
Code:
NAME         PROPERTY       VALUE    SOURCE
storage/log  compressratio  17.22x   -
storage/log  written        39.4G    -
storage/log  logicalused    673G     -

The compressratio has been as high as 30x :cool:
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
Interesting data. I'm considering deploying this but would like a bit more info. Is there documentation on what types of files would benefit from gzip-9 compression? Also are there accessibility issues that comes up with accessing that data? I'm trying to think ahead and plan for what impact gzip-9 compression would have if I need to access that data or replicate it back to PUSH preceding a restoration from a failure.
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Gzip is fairly CPU intensive, so it only makes sense for REALLY compressible data. Also, LZ4 seems to compress about 50% as well as gzip with no noticeable CPU impact. If I set up another syslog server I will use LZ4 because the space savings just aren't worth the CPU overhead to me.
 
Joined
Feb 2, 2016
Messages
574
Decompressing takes far less time than compressing, @nojohnny101. If you don't run into a performance problem writing the data, you likely won't run into a performance problem reading the data.

For example, 445 meg compresses to 96M using gzip-9 in 33 seconds but decompresses in just three seconds. (For a specific data set that is likely not representative of your own.)

As for what data compresses well, the easiest answer is data that isn't already compressed. High-density data is likely already compressed. Photos, videos and PDFs are likely already compressed and will not benefit from compression.

Low-density and textual data is likely uncompressed and would benefit most from compression. So, word processing documents, log files, databases and code bases are excellent compression targets.

In considering your compression options, keep in mind the rest of your infrastructure. We used striped mirrors for production storage and RAIDz2 for replicated data across a 10GB link. In our case, the bottleneck is just as likely to be the disk subsystem as it is the compression level for the receiving host. So, knowing you're only as fast as your weakest point, there isn't much downside to turning gzip9 on.

Our tertiary storage is currently in a separate building a few hundred feet away from the main building in our office complex. While plenty secure for a fire or a burst water line and on a different power feed, it is still subject to being lost in a hurricane or asteroid hit. We're considering moving it to one of our remote offices 40 miles away. At that point, we'd be bandwidth limited. Turning on compression would, potentially, allow us to use our storage more efficiently while not impacting replication speed.

So, there's your long, non-answer.

Cheers,
Matt
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Just to point out the obvious for future readers, there is nothing stopping you from gzip'ing your own files, and then there will be no CPU read or write overhead in the filesystem. And then you might as well use something better than gzip.

Gzip in the filesystem is a bit like dedupe: don't. LZ4 is a great default, and the other choice is 'off.'
 
Status
Not open for further replies.
Top