The Data Compression News Blog

All about the most recent compression techniques, algorithms, patents, products, tools and events.

Subscribe

Posts: RSS Feed
Comments: RSS Feed

XML-WRT 3.0

Posted by Sachin Garg on 29th September 2006 | Permanent Link

Przemyslaw Skibinski recently released version 3.0 of XML-WRT. This version adds internal PPMVC and FastPAQ8 compression.

XML-WRT is a high-performance XML compressor (actually it works with all textual files). It transforms XML to more compressible form and uses zlib (default), LZMA, PPMVC, or FastPAQ8 as back-end compressor. This idea is based on well-known XML compressor - XMill. Moreover, XML-WRT creates a semi-dynamic dictionary and replaces frequently used words with shorter codes. There are additional techniques to improve compression ratio:

* word alphabet can consist of start tags (like ”), urls, e-mails
* special model for numbers encoding
* input XML file is split into containers
* there are special containers for dates, time, pages and fractional numbers
* end tags (”) are replaced with a single char
* end tags + EOL symbols can also be replaced with a single char
* spaceless words model
* very effective methods for white-space preserving
* quotes modeling (’=”‘ and ‘”>’ replaced with a single char)

Matt Mahoney compared results from version 2.0 to 3.0.

On enwik8 and enwik9 from large text benchmark.

xml-wrt 2.0 -l6 -b255 -m255 -s -f8 23,199,202 196,914,328
xml-wrt 3.0 -l11 -b255 -m255 -f24 19,663,305 165,274,422

On enwik8, as a preprocessor to ppmonstr:

ppmonstr J -m1700 -o16 = 19,055,092
xml-wrt 2.0 -l0 -w -s -c -b255 -m100 -e2300 | ppmonstr J -m1650 -o64 = 18,625,624
xml-wrt 3.0 -l0 -b255 -m255 -3 -s -e7000 | ppmonstr J -m1650 -o64 = 18,494,374

On enwik9, as a preprocessor to ppmonstr, it goes from 150,651,873 to 150,004,636.