Precomp: More Compression for your Compressed Files
Posted by Sachin Garg on 28th November 2007 | Permanent Link
So many of today’s files are already compressed (using old, outdated algorithms) that newer algorithms don’t even get a chance to touch them. Christian Schneider’s Precomp comes to rescue by undoing the harm.
Straight from his website:
You can use it to achieve better compression on some filetypes (works on files that are compressed with zLib or the Deflate compression method, and on GIF files). Precomp tries to decompress the streams in those files, and if they can be decompressed and “re”-compressed so that they are bit-to-bit-identical to the original stream, the decompressed stream can be used instead of the compressed one.
The result is a .pcf file (PCF = PreCompressedFile) that contains more decompressed data than the original file. Note that this file is larger than the original file, but if you compress it with a compression method stronger than Deflate, the compression is better than before (or use lprepaq to get it precompressed and compressed in one step).
Deflate/Zlib is the same compression that is used in ever popular Zip format, among many others. Neat thing about Precomp is its ability to detect such compressed streams in any format (eg PDF files have embedded deflated data). The recent versions also support a lot of formats natively, to not have to intelligently detect presence of such data and thus speeding things up.
But the keywords in the description above are ‘bit-to-bit-identical’. That is where most of Christian’s development effort went (and Precomp’s runtime goes too). While its trivial to decompress/recreate zlib streams, its not that easy to recreate the exact same bitstream. Thanks to the many possible ways to create a valid deflate stream. Precomp uses a brute force approach here by trying all combinations (taking 40-50x time, compared to ‘just deflating’ the data).
How important is this feature is upto you and how you plan to use it. While its nice to have, some may argue that its not worth the extra effort as long as the underlying data remains same (Hint:leave your thoughts in comments below).
Christian is studying mathematics with minor subject computer science (you can checkout his other projects at his website). He is planning to release Precomp’s source code under LGPL and add new recompression types (ASCII85Decode for PDFs, Base64 for MIME) as well as further improving the speed in future versions.
Currently, you can download a command-line tool to test it (there are also versions which include Matt’s excellent PAQ compressors for actual compression).