94

I do a ton of file compression. Most of the stuff I am compressing is just code, so I need to use lossless compression.

I wondered if there was anything that offers a better size reduction than 7zip. It doesn't matter how long it takes to compress or decompress; size is all that matters.

Does anyone know how the various tools and compression algorithms available in Linux compare for compressing text? Or is 7zip the best for compressing source code?

0

8 Answers 8

118

lrzip is what you're really looking for, especially if you're compressing source code!

Quoting the README:

This is a compression program optimised for large files. The larger the file and the more memory you have, the better the compression advantage this will provide, especially once the files are larger than 100MB. The advantage can be chosen to be either size (much smaller than bzip2) or speed (much faster than bzip2). [...]The unique feature of lrzip is that it tries to make the most of the available ram in your system at all times for maximum benefit.

lrzip works by first scanning for and removing any long-distance data redundancy with an rzip-based algorithm, then compressing the non-redundant data.

Con Kolivas provides a fantastic example in the Linux Kernel Mailing List; wherein he compresses a 10.3GB tarball of forty Linux Kernel releases down to 163.9MB (1.6%), and does so faster than xz. He wasn't even using the most aggressive second-pass algorithm!

I'm sure you'll have great results compressing massive tarballs of source code :)

sudo apt-get install lrzip 

Example (using default for others options):

Ultra compression, dog slow:

lrzip -z file 

For folders, just change lrzip for lrztar

13
  • 1
    I also can contest that lrzip also works really great for backups of tar/cpio/pax'ed system file trees, because those usually contain lots of long range redundancies, something that lrzip is really good at compressing. Commented Nov 27, 2014 at 7:11
  • 15
    I've tried lrzip and pixz on a 19 GB text file. Both took about half an hour to compress it (on a hexa-core machine), but the lrz file was half the size of the xz file (2.7 vs. 4.4 GB). So, another vote for this answer instead. Commented Jan 20, 2015 at 12:04
  • 6
    @Franki by 'contest', do you mean 'attest'? Commented Nov 2, 2015 at 11:07
  • 5
    Feels like Pied Piper! Commented Oct 29, 2016 at 14:02
  • 1
    Do you know what the difference would be between lrzip and rzip? rzip looks like it was released in 1998 designed to do best on very large files with long distance redundancy, so it sounds similar to lrzip -- just wondering if lrzip was derived from rzip? (rzip from rzip.samba.org) Commented Jan 17, 2017 at 9:56
60

7zip is more a compactor (like PKZIP) than a compressor¹. It's available for Linux, but it can only create compressed archives in regular files, it's not able to compress a stream for instance. It's not able to store most of Unix file attributes like ownership, ACLs, extended attributes, hard links...

On Linux, as a compressor, you've got xz that uses the same compression algorithm as 7zip (LZMA2). You can use it to compress tar archives.

Like for gzip and bzip2, there's a parallel variant pixz that can leverage several processors to speed up the compression (xz can also do it natively since version 5.2.0 with the -T option). The pixz variant also supports indexing a compressed tar archive which means it's able to extract a single file without having to uncompress the file from the start.


¹ Compact is archive+compress (possibly with indexing, possibly members compressed separately), archiving doesn't imply compression. It is not a DOS thing, but possibly it was a French thing. Googling usenet archives, I seem to only come across articles of mine, so it could well have been my invention, though I strongly believe it's not.

3
10

(updated answer) If time doesn't matter, use ZPAQ v1.10 (or newer) ex.:
zpaq pvc/usr/share/doc/zpaq/examples/max.cfg file.zpaq file.tar (the max.cfg file location may vary, check on your installed package file list)

zpaq actually compressed more than kgb -9 newFileName.kgb yourFileName.tar.
That is based on older algorithm PAQ6, and is very slow...
I tested with all other compressors like 7zip, lrzip, bzip2, kgb.. and zpaq compressed most!


If kgb still interests you tho: (as it was my initial choice on this answer, so I am keeping the information here)
Ubuntu 14.04 has kgb 1.0b4, run sudo apt-get install kgb to install it.

Below is about a windows version that you can try to run/compile kgb on linux, but I did not succeed.
Version 2 beta2 can be found on SourceForge, but no Linux binaries are available. You can try to run it in console with wine kgb2_console.exe -a7 -m9 (method -a6 -m9 seems to be equivalent to the best method in 1.0b4, -a7 is new in 2 beta2). Though I had better stability by installing .NET 2.0 with winetricks and running wine "KGB Archiver 2 .net.exe" (I don't like a little bit doing that, so I will stick with native Linux 1.0b4 that has almost the same result as 2 beta2).
Anyway, version 2 beta2 seriously deserves a Linux native version too! Maybe something can be accomplished with MinGW, see this, but this command still fails badly: i586-mingw32msvc-g++ kgb2_console.cpp -o kgb. May be try to compile it with dmcs (Mono)? see this tip.

1
  • does zpaq handle streamed data over pipes? i tried - as the input file and it didn't work. i'd like to use this with tar over pipes/with GNU tar -I flag. Commented Sep 5 at 5:21
9

If you're looking for greatest size reduction regardless of compression speed, LZMA is likely your best option.

When comparing the various compressions, generally the tradeoff is time vs. size. gzip tends to compress and decompress relatively quickly while yielding a good compression ratio. bzip2 is somewhat slower than gzip both in compression and decompression time, but yields even greater compression ratios. LZMA has the longest compression time but yields the best ratios while also having a decompression rate outperforming that of bzip2.

Sources: http://bashitout.com/2009/08/30/Linux-Compression-Comparison-GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html

http://tukaani.org/lzma/benchmarks.html

4
  • 1
    I need to disagree on this one! The lossless file compressor providing the greatest reduction factor regardless of compression speed that works on GNU/Linux is probably either zpaq or paq8l. However, they are so slow that they are unpracticle for most real-world usages. Commented Nov 27, 2014 at 7:15
  • @Franki cool sudo apt-get install zpaq, I did some tests, according to wiki, that app would be the newest 2009, but it still loses for kgb (that uses PAQ6), but kgb is MUCH slower... Commented May 3, 2016 at 22:09
  • @Franki actually, I just found that zpaq pvc/usr/share/doc/zpaq/examples/max.cfg file.zpaq file.tar compresses more than kgb -9 Commented May 10, 2016 at 2:17
  • 1
    @Franki How does zpaq or paq81 compare to lrzip? Commented Jan 17, 2020 at 17:43
7

Zstandard deserves a mention. Even though with default settings it doesn't compress as well as xz, it is much faster at both compression and decompression. When Arch Linux switched from xz to zstd, they reported

~0.8% increase in package size on all of our packages combined, but the decompression time for all packages saw a ~1300% speedup

Today I compressed the same 684M text corpus with xz and zstd. I didn't do any rigorous testing, YMMV, but the differences are so huge it doesn't seem necessary:

  • xz took 9m36s to compress that to 71M, decompressing in 9s
  • zstd (default options) took 6s to compress it to 123M, decompressing in <2s
  • zstd -9 took 42s to compress it to 99M, again decompressing in <2s.
  • zstd -19 is slower than xz at 12m40s, but compresses even better to 70M, and still decompresses in <2s.
1
  • Zstd is fast but not what the question is about: "It doesn't matter how long it takes to compress or decompress; size is all that matters." Commented Jun 9 at 5:10
2

7zip is no unique technology, but supports several different compression methods (see wikipedia 7z on that).

A set of tests was performed with different tools specially for C source files. I'm not sure which of the tools exist for Linux if they still exist. However, you may note that the best algorithm was PPM with modifications (PPMII, then PPMZ).

If you are interested in the tools, you can browse the site, it's in Russian but google translate may help. There is a big deposit of binaries, which you may use (or won't be able) from Linux with wine, if really needed.

1
  • 2
    Beware of 7zip that the archive format is Windows-centric and doesn't handle UNIX specials like symlinks or permissions very well and lacks UNIX nanotime and has a very unorthodox Windows-centric command line that could create vulnerabilities if extensive care isn't taken to sanitize file names Commented Jul 10, 2024 at 4:20
0

if anyone needs, I recently had to do a comparison between different compression tools to see if we could improve compression times for our trading logs (binary files)

this is rough benchmark to compress an 80G binary tradeing log on a 48 core Dell rack server with 754GB of RAM

as always the tradeoff is smaller size vs longer compression time (and also how many CPUs you can dedicate to the compression job)

pigz, zstd, bzip3 stand out as most peformant

enter image description here

1
0

I recently needed to transfer a patch (< 5k) for wide distribution across my target network, and so I tried to figure out the smallest compression I could user. My results seem to show that for lzma looks like the best option for compressing smallish text files:

The test file:

$ ls -l 2024-09-04.patch ; file 2024-09-04.patch -rw-r--r-- 1 samveen users 3868 Sep 4 20:46 2024-09-04.patch 2024-09-04.patch: unified diff output, ASCII text, with very long lines (328) 

The compression tested was of all the 4 common compressors available on linux:

  • gzip
  • bzip2
  • xz
  • lzma (xz --format=lzma in my Debian environment).
$ for C in gzip bzip2 xz lzma; do cat 2024-09-04.patch | $C -9 > 2024-09-04.patch.${C/ip/} ; done 

The results, sorted by size descending:

$ ls -lS 2024-09-04.patch.* -rw-r--r-- 1 samveen users 1816 Sep 4 21:00 2024-09-04.patch.bz2 -rw-r--r-- 1 samveen users 1660 Sep 4 21:00 2024-09-04.patch.xz -rw-r--r-- 1 samveen users 1616 Sep 4 21:00 2024-09-04.patch.gz -rw-r--r-- 1 samveen users 1612 Sep 4 21:00 2024-09-04.patch.lzma 

As can be seen here, the best compression results for my patch file are when using lzma.

Notes:

  • This was tested on both an x86_64 laptop, and an AWS EC2 Graviton instance, with identical results.
  • My test methodology applied the compressors to a input stream instead of a file. This saved me 16 bytes of space on file/filesystem metadata for gzip that is usually added to the output in case the input is a file.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.