The Data Compression News Blog

All about the most recent compression techniques, algorithms, patents, products, tools and events.

Subscribe

Posts: RSS Feed
Comments: RSS Feed

Recent Posts

  • Google Snaps Up On2 (10 Comments)

  • Bijective BWT (35 Comments)

    David Scott has written a bijective BWT transform, which brings all the advantages of bijectiveness to BWT based compressors. Among other things, making BWT more suitable for compression-before-encryption and also give (slightly) better compression.

  • Asymmetric Binary System (172 Comments)

    Jarek Duda’s “Asymmetric Binary System” promises to be an alternate to arithmetic coding, having all the advantages, but being much simpler. Matt has coded a PAQ based compressor using ABS for back-end encoding. Update: Andrew Polar has written an alternate implementation of ABS.

  • Precomp: More Compression for your Compressed Files (7 Comments)

    So many of today’s files are already compressed (using old, outdated algorithms) that newer algorithms don’t even get a chance to touch them. Christian Schneider’s Precomp comes to rescue by undoing the harm.

  • On2 Technologies is Hiring (5 Comments)

    There aren’t too many companies working on cutting edge codecs, and of those few this one is hiring. Best of luck.

Precomp: More Compression for your Compressed Files

Posted by Sachin Garg on 28th November 2007 | Permanent Link

So many of today’s files are already compressed (using old, outdated algorithms) that newer algorithms don’t even get a chance to touch them. Christian Schneider’s Precomp comes to rescue by undoing the harm.

Straight from his website:

You can use it to achieve better compression on some filetypes (works on files that are compressed with zLib or the Deflate compression method, and on GIF files). Precomp tries to decompress the streams in those files, and if they can be decompressed and “re”-compressed so that they are bit-to-bit-identical to the original stream, the decompressed stream can be used instead of the compressed one.

The result is a .pcf file (PCF = PreCompressedFile) that contains more decompressed data than the original file. Note that this file is larger than the original file, but if you compress it with a compression method stronger than Deflate, the compression is better than before (or use lprepaq to get it precompressed and compressed in one step).

Deflate/Zlib is the same compression that is used in ever popular Zip format, among many others. Neat thing about Precomp is its ability to detect such compressed streams in any format (eg PDF files have embedded deflated data). The recent versions also support a lot of formats natively, to not have to intelligently detect presence of such data and thus speeding things up.

But the keywords in the description above are ‘bit-to-bit-identical’. That is where most of Christian’s development effort went (and Precomp’s runtime goes too). While its trivial to decompress/recreate zlib streams, its not that easy to recreate the exact same bitstream. Thanks to the many possible ways to create a valid deflate stream. Precomp uses a brute force approach here by trying all combinations (taking 40-50x time, compared to ‘just deflating’ the data).

How important is this feature is upto you and how you plan to use it. While its nice to have, some may argue that its not worth the extra effort as long as the underlying data remains same (Hint:leave your thoughts in comments below).

Christian is studying mathematics with minor subject computer science (you can checkout his other projects at his website). He is planning to release Precomp’s source code under LGPL and add new recompression types (ASCII85Decode for PDFs, Base64 for MIME) as well as further improving the speed in future versions.

Currently, you can download a command-line tool to test it (there are also versions which include Matt’s excellent PAQ compressors for actual compression).

7 Responses to “Precomp: More Compression for your Compressed Files”

  1. Benbelkacem Says:

    Bonjour ;
    moi aussi je suis intéresser par la compression des données mais je ne comprend pas comment faire…
    par exemple si je trouve une nouvelle méthode comment la communiqué au mande entier perdre ses droits ??????


  2. Sachin Garg Says:

    I am using google’s translation of your comment as I don’t understand french.

    Best way to protect your rights would be to get a patent on your method, but it will have to be something really new. Best of luck.


  3. Benbelkacem Says:

    am sorry if i wrote it in french but i can write in english.
    please can you tell me how because am really lost in every thing there is in the internet .

    thanks


  4. David Says:

    I don’t really care about recovering a particular .zip file.
    As long as the underlying data (the original uncompressed files) can be recovered, one .zip file is about the same as another to me.

    On the other hand …

    I’ve been playing with something distantly related to this — recognizing “compression” in English text (acronyms, contractions, standard patterns for making words plural, other patterns for adding “ing” or “ed” suffixes, etc.).

    So my decompressor has 2 stages:

    First decompress to a “verbose” text analogous to the “.pcf file”, which has text something like “cooky +s”.

    Then “compress” using standard English patterns:
    “The plural of *y is *ies”.

    The “bit-for-bit identical” is important for this application, because I want the result — after the second stage — to be bit-for-bit identical with the original English text.


  5. Sachin Garg Says:

    The exact bit identical reconstruction of zip files matters when the zip compressed (deflated) data is embedded inside another file (like in PNGs or PDFs). For other uses, I agree that it is not very important.

    There has been some work done on text-filters like the one you mentioned. One of the tricks I can recall is that they ‘grouped’ frequently used 2-3 character combinations and replaced them with a single ‘unused’ character. (eg, ‘the’ can be replaced by 0×01. the is used very frequently in the, them, their, there etc and 0×01 is never used in text files.) Another one was about optimizing encoding of new-line characters and special handling of capitalization of first characters after a ‘.’ period.

    You might want to look up these papers. Let me know if you can’t find them and I will try to see if I have them in my old archives.


  6. cheap nike air max uk Says:

    It’s fantastic that you are getting ideas from this article as well
    as from our argument made at this place.


  7. i need more money fast Says:

    These opportunities become more appealing to the public when unemployment is
    high and wages are being cut. If, for instance, the niche is about designing
    and building septic systems, it will require articles and information about every aspect
    of septic design and construction. You can find state
    and local business license requirements on the Small Business Administration website at [.