Bijective BWT
Posted by Sachin Garg on 15th January 2008 | Permanent Link
David Scott has written a bijective BWT transform, which brings all the advantages of bijectiveness to BWT based compressors. Among other things, making BWT more suitable for compression-before-encryption and also give (slightly) better compression.
For those who are among the majority who don’t understand bijectiveness and its advantages, I will start with a brief introduction.
Bijectiveness, in data compression context, means that it should be possible to feed any random file to the decompressor and it should generate a valid ‘uncompressed’ file. Decompressors of non-bijective compressors usually crash (or get stuck in infinite loops) if they are made to decompress any data which was not created by their compressor. The bijective property is particularly important for compression done before encryption.
And as bijective compression result in a more complete one-to-one mapping, they should theoretically give better compression ratios when compared to their no-bijective counterparts. Practically however this gain is very small and the added runtime and code complexity makes it debatable if its ‘worth the effort’. But anyway, that doesn’t means that the gain doesn’t exists.
OK, enough about bijectiveness, lets get to David’s ‘Scottified’ BWT.
With BWT, we need to store the last column of rotated and sorted strings, along with a index of original string’s location. With normal BWT, its not possible to reverse the transform without knowing the index. But with David’s BWT, we don’t need to store that extra index. Without the need of index, any stream can be considered a BWTed string and its reverse transform can be computed.
It might sound a little confusing how the transform can still be reversible without that index. Its got something to do with the cycles in BWTed strings which are followed during reverse BWT transform but you will need to go through all the gory details at the pages linked below to understand that.
You can get the code and a brief description of the algorithm at David’s website. You can find a better description of the algorithm in the comp.compression discussion on this, look for posts by Klaus Stengel.
January 16th, 2008 at 3:24 pm
Briefly, a normal BWT is inverted by constructing a linked list and traversing it from the start location, which is transmitted separately. This linked list consists of one cycle. A bijective BWT has multiple cycles, each of which starts with the first element.
January 16th, 2008 at 5:18 pm
Less briefly, Usually a normal BWT consists of one cycle if one uses a special EOF character it’s always one cycle if one does not it will either be one cycle or several identical cycles. Example ABC goes to CAB:2 one cyle and a single index. There is more than one way to define the index I placed on the start character in range of 1 to 3. However the file ABCABCABC goes to in normal BWT to CCCAAABB:index of 4 or 5 or 6.
Actually both examples above are rare. In fact both examples above have same BWTS transform as the normal BWT transform except no index. In fact for every valid BWT transform if you pick the correct index you have a valid BWTS transform that doesn’t need an index.
For a typical file that is not made of identical repeats if you rotate it you get the same BWT transform all that changes is the index. If you use a BWTS at one rotation it matches the BWT at the other rotations it will have multiple cycles not all identical. The resultant transform don’t exist in the BWT output space so there is no reverse BWT regardless of index for the other files. In a typical case if a random file is N bytes long only one rotation of the file is able to be reversed in normal BWT. Sometimes no rotation leads to an inverse sometimes more than one. However in BWTS each has a unique inverse. The fact that its bijective and uses all the output space gives us a shot at someday beating or tying PPM style compressors.
One point even though the BWTS usually has several cycles for the most part when sorted together it still produces long repeated strings. Example I could have a hundred cycles each containing the sub string “the dog is blue” the letters will srill sort together.
Hope This Helps
David A. Scott
P.S. one more point
if you keep doing a BWTS on a file you do not vist all the permutations however if you do it enough you eventually get back to the starting file.