The Data Compression News Blog

All about the most recent compression techniques, algorithms, patents, products, tools and events.

Subscribe

Posts: RSS Feed
Comments: RSS Feed

Sponsored Links

Recent Posts

  • Bijective BWT (7 Comments)

    David Scott has written a bijective BWT transform, which brings all the advantages of bijectiveness to BWT based compressors. Among other things, making BWT more suitable for compression-before-encryption and also give (slightly) better compression.

  • Asymmetric Binary System (113 Comments)

    Jarek Duda’s “Asymmetric Binary System” promises to be an alternate to arithmetic coding, having all the advantages, but being much simpler. Matt has coded a PAQ based compressor using ABS for back-end encoding. Update: Andrew Polar has written an alternate implementation of ABS.

  • Precomp: More Compression for your Compressed Files (3 Comments)

    So many of today’s files are already compressed (using old, outdated algorithms) that newer algorithms don’t even get a chance to touch them. Christian Schneider’s Precomp comes to rescue by undoing the harm.

  • On2 Technologies is Hiring

    There aren’t too many companies working on cutting edge codecs, and of those few this one is hiring. Best of luck.

  • China’s AVS Specifications Available (2 Comments)

    Its old news that China has developed their own Advanced Video Standard to avoid high licensing fees. English translation of the standard is now available, along with the IPR policy. Finally something technical that you can get your hands on to feed your appetite.

This is what Google uses for compression

Posted by Sachin Garg on 26th October 2005 | Permanent Link

Given the enormous amount of data they are handling here, I expected to listen/feel/see/read about something more sophisticated.

There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I’m just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn’t compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: “com.cnn.www/index.html:http”. The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.

This text was taken from this description of Google’s Big Table (which was snapped from the following notes by Andrew Hitchcock). Seems like there is a LOT more publically available information on this than what is stated above. I wonder where to look for it.

4 Responses to “This is what Google uses for compression”

  1. Andrew Hitchcock Says:

    You have to watch the video. They just posted the high resolution video. WMP on the Mac really sucks, but I managed to find the location where the compression talk begins. Jump ahead to 46:30 if you just want to hear about compression.

  2. Jim Leonard Says:

    I’ve combed google and usenet for a while and still can’t find any direct references to the “zippy” algorithm — any pointers?

  3. Sachin Garg Says:

    Thanks for link Andrew, very interesting stuff. But I guess you had very well covered the compression part. Couldn’t find anything more. :-)

    Jim, I think zippy is the name they have given to their internal modified version of LZW.

  4. Google, Bigtable, Compression, Zippy and BMDiff « Kevin Burton’s NEW FeedBlog Says:

    [...] I remembered some notes about compression in the original Bigtable paper and decided to dig a bit deeper. [...]

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>