The Data Compression News Blog

All about the most recent compression techniques, algorithms, patents, products, tools and events.

Subscribe

Posts: RSS Feed
Comments: RSS Feed

Sponsored Links

New Large Text Compression Benchmark

Posted by Sachin Garg on 31st May 2006 | Permanent Link

Matt Mahoney (author of the excellent PAQ series of compressors) started a new data compression benchmark, made up of really large text files.

Here is what he said in his comp.compression announcement:

The goal of the benchmark is to solve the problem of natural language text compression, which I believe is equivalent to passing the Turing test for AI. Currently, no algorithm can compress text to 1 bpc, the entropy estimated by Shannon in 1950 based on text prediction experiments by humans. The benchmark is 1GB of text from Wikipedia, about the amount it would take a lifetime to read. The entropy of the resulting model is roughly the same as the long term memory capacity of the human brain by some estimates. The web page explains the rules and rationale for this benchmark.

The goal of most benchmarks is to try to find the “best overall” algorithm. We know that the most general solution to encoding a string x is the shortest program that outputs x. But Kolmogorov proved this is an impossible goal. Instead, the top ranked programs are usually those with lots of specialized models for different formats like exe, pdf, bmp, wav, etc., whatever is included in the benchmark. Every time a new data type is added to a benchmark (e.g. the digits of pi), then compression developers have to add another model (a program that computes pi). I think this distracts from the problem and leads to overly complex systems. Compressing text in a single language is a hard enough problem.

This pretty much sums up everything, so I don’t have much to add, other than that, just like his work on PAQ, its a great thing :-)

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>