This is what Google uses for compression
Posted by Sachin Garg on 26th October 2005 | Permanent Link
Given the enormous amount of data they are handling here, I expected to listen/feel/see/read about something more sophisticated.
There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I’m just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn’t compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: “com.cnn.www/index.html:http”. The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.
This text was taken from this description of Google’s Big Table (which was snapped from the following notes by Andrew Hitchcock). Seems like there is a LOT more publically available information on this than what is stated above. I wonder where to look for it.
October 27th, 2005 at 1:22 am
You have to watch the video. They just posted the high resolution video. WMP on the Mac really sucks, but I managed to find the location where the compression talk begins. Jump ahead to 46:30 if you just want to hear about compression.
October 27th, 2005 at 6:21 pm
I’ve combed google and usenet for a while and still can’t find any direct references to the “zippy” algorithm — any pointers?
October 31st, 2005 at 12:30 pm
Thanks for link Andrew, very interesting stuff. But I guess you had very well covered the compression part. Couldn’t find anything more. :-)
Jim, I think zippy is the name they have given to their internal modified version of LZW.
October 12th, 2008 at 9:41 pm
[...] I remembered some notes about compression in the original Bigtable paper and decided to dig a bit deeper. [...]