The Data Compression News Blog

All about the most recent compression techniques, algorithms, patents, products, tools and events.

Subscribe

Posts: RSS Feed
Comments: RSS Feed

Recent Posts

  • Google Snaps Up On2 (10 Comments)

  • Bijective BWT (36 Comments)

    David Scott has written a bijective BWT transform, which brings all the advantages of bijectiveness to BWT based compressors. Among other things, making BWT more suitable for compression-before-encryption and also give (slightly) better compression.

  • Asymmetric Binary System (172 Comments)

    Jarek Duda’s “Asymmetric Binary System” promises to be an alternate to arithmetic coding, having all the advantages, but being much simpler. Matt has coded a PAQ based compressor using ABS for back-end encoding. Update: Andrew Polar has written an alternate implementation of ABS.

  • Precomp: More Compression for your Compressed Files (7 Comments)

    So many of today’s files are already compressed (using old, outdated algorithms) that newer algorithms don’t even get a chance to touch them. Christian Schneider’s Precomp comes to rescue by undoing the harm.

  • On2 Technologies is Hiring (9 Comments)

    There aren’t too many companies working on cutting edge codecs, and of those few this one is hiring. Best of luck.

Bijective BWT

Posted by Sachin Garg on 15th January 2008 | Permanent Link

David Scott has written a bijective BWT transform, which brings all the advantages of bijectiveness to BWT based compressors. Among other things, making BWT more suitable for compression-before-encryption and also give (slightly) better compression.

For those who are among the majority who don’t understand bijectiveness and its advantages, I will start with a brief introduction.

Bijectiveness, in data compression context, means that it should be possible to feed any random file to the decompressor and it should generate a valid ‘uncompressed’ file. Decompressors of non-bijective compressors usually crash (or get stuck in infinite loops) if they are made to decompress any data which was not created by their compressor. The bijective property is particularly important for compression done before encryption.

And as bijective compression result in a more complete one-to-one mapping, they should theoretically give better compression ratios when compared to their no-bijective counterparts. Practically however this gain is very small and the added runtime and code complexity makes it debatable if its ‘worth the effort’. But anyway, that doesn’t means that the gain doesn’t exists.

OK, enough about bijectiveness, lets get to David’s ‘Scottified’ BWT.

With BWT, we need to store the last column of rotated and sorted strings, along with a index of original string’s location. With normal BWT, its not possible to reverse the transform without knowing the index. But with David’s BWT, we don’t need to store that extra index. Without the need of index, any stream can be considered a BWTed string and its reverse transform can be computed.

It might sound a little confusing how the transform can still be reversible without that index. Its got something to do with the cycles in BWTed strings which are followed during reverse BWT transform but you will need to go through all the gory details at the pages linked below to understand that.

You can get the code and a brief description of the algorithm at David’s website. You can find a better description of the algorithm in the comp.compression discussion on this, look for posts by Klaus Stengel.

36 Responses to “Bijective BWT”

  1. Matt Mahoney Says:

    Briefly, a normal BWT is inverted by constructing a linked list and traversing it from the start location, which is transmitted separately. This linked list consists of one cycle. A bijective BWT has multiple cycles, each of which starts with the first element.


  2. David A. Scott Says:

    Less briefly, Usually a normal BWT consists of one cycle if one uses a special EOF character it’s always one cycle if one does not it will either be one cycle or several identical cycles. Example ABC goes to CAB:2 one cyle and a single index. There is more than one way to define the index I placed on the start character in range of 1 to 3. However the file ABCABCABC goes to in normal BWT to CCCAAABB:index of 4 or 5 or 6.

    Actually both examples above are rare. In fact both examples above have same BWTS transform as the normal BWT transform except no index. In fact for every valid BWT transform if you pick the correct index you have a valid BWTS transform that doesn’t need an index.

    For a typical file that is not made of identical repeats if you rotate it you get the same BWT transform all that changes is the index. If you use a BWTS at one rotation it matches the BWT at the other rotations it will have multiple cycles not all identical. The resultant transform don’t exist in the BWT output space so there is no reverse BWT regardless of index for the other files. In a typical case if a random file is N bytes long only one rotation of the file is able to be reversed in normal BWT. Sometimes no rotation leads to an inverse sometimes more than one. However in BWTS each has a unique inverse. The fact that its bijective and uses all the output space gives us a shot at someday beating or tying PPM style compressors.

    One point even though the BWTS usually has several cycles for the most part when sorted together it still produces long repeated strings. Example I could have a hundred cycles each containing the sub string “the dog is blue” the letters will srill sort together.

    Hope This Helps
    David A. Scott
    P.S. one more point
    if you keep doing a BWTS on a file you do not vist all the permutations however if you do it enough you eventually get back to the starting file.


  3. David A. Scott Says:

    I haven’t been able to test the code but it looks like Yuta Mori Added Burrows-Wheeler Transform Scottifed. (BWTS) to his openbwt library http://www.geocities.com/zxcb33/openbwt-v1.3.zip

    BWTS as I wrote is not as fast or space efficient as it could be. Not sure if he improved on the space and speed since my code only written to do the job and could be much faster and use far less space, But it is first major library to include my simple bijective BWT chech it out if you wish I know I will over the next few days if I get a chance

    David A. Scott


  4. Sachin Garg Says:

    This is interesting. Yuta has also written one of the fastest suffix sorting algorithms, I wouldn’t be surprised if he would have done wonders to BWTS’ speed.


  5. David A. Scott Says:

    Well he has an executable in it that seems to work but it adds the block size to front of output of course you need the block size when you unBWTS stuff. But I prefer not to make that part of the output. But thats me.

    Even though his executable runs on my machine when I try to build the executable using his library I can do it but the code always seems to fail with an illegal instruction so not really sure how to use it. Any suggestions would be welcome. Maybe the library he supplied is not the same as the one he used in creating his executable or maybe there is some gcc option I am missing. I know of others that have used the library sucsssfully be then maybe its the AMD Turion64 that can’t heck it.
    Take Care


  6. David A. Scott Says:

    IT WORKS!! so far
    I was going crazy I kept trying to use the openbwt.a to resolve the problem
    know I see there are 2 bwt.c one in the “samples” directory and one in the “src”
    and the openbwt.a is not needed

    The trick to get it to work was to use the bwts.c from the samples directory
    and the bwt.c from the src directory.

    Anyway I was going crazy even downloaded
    a new for me C complier BCC32 since a friend
    got it to work with it. It took me a long time
    to realize there are two diferent source of bwt.c
    in the zip file. Anyway back to sleep.


  7. David A. Scott Says:

    There is an article at http://arxiv.org/PS_cache/cs/pdf/0502/0502073v1.pdf that is very close to talking about the indexless BWT transform. It may be worth reading if one is interested in that sort of thing.


  8. Franc Jarnovich Says:

    Hello,

    Here is a new aspect to data compression. Random(or chaotic) data have insight them some ordered ‘islands’ or ‘attractors’. If written above is not true, then chaotic data are NOT ordered. Paradox, isn’t it? Gaussian distributions is almost complete, but Gauss himself have the impact on it , because he was physical part of live universe.

    nice regards

    Franc Jarnovich
    computer programmmer


  9. Franc Jarnovich Says:

    Errata …then chaotic data are NOT ordered…
    correct is …then chaotic data IS ordered…

    Franc Jarnovic
    computer programmmer


  10. Andrew Polar Says:

    David, that is very interesting. I will definitely try your BWT. But before I do that I wanna ask one question:
    Do all BWT sorters produce different results or the same. To my understanding if algorithm is same and alphabet is same why the result is different. I want to try to experiment with some technique close to BZIP and too lazy to write my own BWT. If I take yours how effective is that. Is it possible that another BWT does it better that means provide more smooth result.
    Thanks.


  11. David A. Scott Says:

    Not all BWT’s the same. Some treat the whole file as an endless loop. Others add a unique character at the end that has a special weight either lighter or heavier than the other characters. I have even read about using a weights that are half way. You could even sort in a special order. You could even do a gray code kind of thing that actually seems useful.
    But the current standard way is to add a special character at end and make it heavier than the rest. Then when you done you drop it and then have an index to where that character would have appeared thus the current and original methods need in index thus changing the length. My BWTS methods and there is more than one don’t change the lenght of the file.

    Even when two methods appear the same the INDEX could be to where first or least character is and where the index is placed in file can be different
    some times at front along with maybe block size

    Before my BWTS I did the full file BWT without adding
    any symbols and than bijectively added an index at end of file. This was more efficient then using any common universal numbering system. Since if one starts with a 256 byte file after BWTS Plus an index the file would at most be 257 bytes long but with my method of adding the index bijectively it could be less.

    Hope this helps


  12. David A. Scott Says:

    Sorry I really didn’t anwser you question?
    “How does one measure a “smooth result”?
    I think one could design files that make mine
    look better or worse. The question is which method
    leads to better compresses results and even here it
    will depend heavily on the files actually tested and
    any tuning done.

    Look at this way. With my method every file is either a plain file or file that is the result of a BWT you can treat any file as either. So therefore if it helps some files compress it will not help others compress. It how you roll the dice I’m beating that is helps most useful files.


  13. David A. Scott Says:

    Here is a draft of the paper on the bijective form of BWT

    http://bijective.dogma.net/00yyy.pdf


  14. Maloeran Says:

    Well done David!

    Last time I played with BWT, I just thought that there had be a way to reconstruct the original sequence without that index… Of course, it wasn’t that simple, but you solved it very nicely!

    I believe I’ll be reading the paper and trying to squeeze a few extra bytes out of my compressor, thanks David.


  15. David Scott Says:

    The paper was rejected. As some one commented you can do a BWT using a special EOF symbol so that an index is not necessary.
    I don’t think some of the reviews had a clue what it what it really meant.
    Since most blocks of data regardless if you allow a free EOF or an index still don’t have a reverse BWT.
    The concept was not well understood by the reviewers.
    During the time we started writing this another guy
    who asked to join us started writing us own that also
    had bijective ST transfomrs of various orders. His paper
    was accepted but maybe those running the conferences
    in the Czech Republic have a better understanging of
    just what the concept was anyway here is his paper

    http://arxiv.org/abs/0908.0239

    Have a nice day
    dave

    P.S. as for Gil I have not heard from him since
    Oct of this year. But I emailed him earlier this
    month and no reply.


  16. David Scott Says:

    It figures just when I said I have not heard from Yossi I get an email. So maybe we will do more next year.


  17. alan Says:

    Re: Franc Jarnovic,

    do you get it ? (”honour thy error” - Brian Eno; musician)


  18. alan Says:

    Clue everyone:

    what is “sound”?


  19. Maloeran Says:

    David,

    That’s ridiculous, the paper was rejected on the basis that you can just add an EOF symbol? This is so utterly stupid…

    If you were really to first to implement a index-free bijective BWT, I sincerely hope you manage to get credits for it.


  20. David Scott Says:

    Of course its impossible to tell why it was rejected. I am definitely not part of the old boy network. But when I see that really good papers and people being punished for not bowing to the gods of global warming don’t get published its almost an honor not to get published. I don’t know how good the paper is but I feel confident that the really good innovative papers never make the light of day first time up in todays political strangle hold on science. Fact its getting colder the last few years yet the so called scientists that are peer reviewed are doing there best to hid those facts.
    Buy the way the german guy mentions my name several times so in way I think I do get credit for the bijective BWT even though most idiots think the regular BWT is bijective.

    Sorry alan I don’t get it what are you ranting about.


  21. Kragen Javier Sitaker Says:

    Hey, for people who aren’t that familiar with the BWT, if you want to play with an interactive BWT (the standard version, not Scott’s bijective version), I wrote one a while back in JS at http://canonical.org/~kragen/sw/bwt.html. You can fiddle with changing the BWTed string or saved index and watch the poor algorithm struggle to invert it.

    I wonder if I’ll get around to implementing the Scottified version. It looks like fun!


  22. David Scott Says:

    Yes it looks great and for simple strings when you
    have an index equal to zero your UNBWT is the same
    as UNBWTS so since not using an index for inverse on the bottom part you could always just do the UNBWTS since most of the time UNBWT does not exist. But When it does
    its the same as UNBWTS
    Where you see the difference is if you BWT a string and
    get an index of nonzero from your code.
    When you UNBWTS the resultant string you get something else. That however when BWTed goes to a string with an index that is not zero.
    go ahead do the BWTS and UNBWTS
    David


  23. Vivien Greene Says:

    Blog looks great and reads even better! You share some great opinions and insight here. Always looking for motivating blogs to keep mine going!


  24. Jarek Duda Says:

    Hi all,
    I’m just reading Matt Mahoney’s book and started appreciating BWT - my comment is that it’s much more general transformation than just using standard (..ABCDE..) order among chars as I’ve seen in the book.
    If we are fighting for a few bytes like in BBWT, while focusing on text in a given language, it could help a bit to use some better BWT, optimized for this language:
    - DIFFERENT SYMBOL ORDER, like accordingly to phonetics: for example separating vowels and consonants (like ..AEBCD..) - thanks of it lexicographic order could group contexts a bit better and so the BWT could be less ‘ragged’. It’s cheap - just bijectively remap characters before and after BWT,
    - DIFFERENT ALPHABET: maybe grouping a few symbols into a new one (like ‘AE’), or splitting them (like vowel nr. 1 instead of ‘A’) could also make it a bit more efficient…

    It would require a lot of experiments, but using BWT optimized for a given purpose could cheaply improve ratio a few percents(?)
    Hope it help,
    Jarek


  25. Maloeran Says:

    Jarek, reodering characters to group them by typical context has been done, for text it does usually provide a small gain over plain BWT.

    I don’t think making groups of symbols would be a wise thing to do given the nature of BWT. With its string ordering, it exploits very well the prediction of a byte given all previous ones.

    If you want to look at something fancier and a *lot* more computationally expensive than BWT, have a look at PPM or PAQ algorithms.

    Have fun!


  26. Jarek Says:

    Thanks - it was only a comment to the book - there is a lot about BWT there, but I haven’t seen this cheap generalization of this transformation: looking useful preparation for the main compression …
    About symbol grouping - my thought was that language evolution works with phonetics rather - so maybe it might be more effective to block groups of letters corresponding to phonemes …
    Generally for a supercompressors I would think about some dynamically evolving DMC-like structure for local correlations, ’supervised’ by synchronously evolving NN searching for larger scale relations … but I agree that it’s huge work - I’m focusing on physics now and I’m generally too theoretical for such .. marathons :)
    Have fun!


  27. David Scott Says:

    There are many things BWTS could be used for. I also have a bijective distance coder. If I can get off my lazy ass and actually code the two together it would make a nice bijective compressor especially for short files where it would be a good compressor to use before an encryption pass. I have not written much here lately I thought this thread was dead.


  28. David A. Scott Says:

    I have place the BWTS in a package that I hope does a full bijective BWTS-DC-FIB there may be some bugs use with caution but its a first cut at this kind of compression using DC for me. Its at

    http://bijective.dogma.net/bwtsdcV1.7z

    take care


  29. kickasstorrents Says:

    If you wish for to take a great deal from this piece of writing
    then you have to apply such methods to your won webpage.


  30. dental clinic Says:

    Howdy! This post could not be written any better! Looking through this post reminds me of my previous roommate!
    He always kept preaching about this. I’ll forward this
    information to him. Fairly certain he will have a great
    read. Thank you for sharing!


  31. algo para bajar de peso Says:

    I had to give you a quick thank you for this nice content!


  32. azygos Says:

    I’ve been looking at your blog site for a time, but until recently I decided to post at the very least
    a hello there.


  33. http://www.tripleclicks.com/13742129/ Says:

    When someone writes an post he/she maintains the thought of a user in his/her brain that how a
    user can understand it. So that’s why this article is outstdanding.
    Thanks!


  34. Jerrold Says:

    I really like the caliber of this blog, is really great to browse through the
    articles written right here.


  35. Claribel Says:

    Just wish to say your article is as astounding.
    The clarity in your put up is just great and that i could think you’re an expert in this subject.
    Well along with your permission allow me to seize your RSS
    feed to keep updated with impending post. Thanks 1,000,
    000 and please keep up the enjoyable work.


  36. phentermine Says:

    Fastin is a true “mood promoter” whose stimulant effects are
    rapid, yet exceptionally even throughout each dose. Weight
    loss design includes increased lipolysis (fat loss) accompanied by elevated mood and high-energy levels in combination with a reduced caloric
    intake and increased metabolic expenditure typical of
    an effective overall weight loss plan.We love inspiring our followers and
    fans, whether it’s on facebook, pinterest or twitter, and we especially like to hear.

    It’s Friday again! How has your weight loss journey been going this last week?

    Great, we hope! The side effects of phentermine, like lack of sleep
    and low mood, as well as pushing yourself to exercise.
    As well as setting the goals you want to achieve and planning out how you will approach
    each of the weight loss stages. Since we know that weight loss is a
    journey (we’ve been there, too!), we have developed a series of tools
    to help you reach your goal! Weight loss tools to lose more
    pounds. Lose weight, Celebrate your success! Phentemine.com has helped thousands of people lose weight!
    We offer FREE weight loss tools and supportive online community!
    Get inspired by the success of our users by checking out our 90+ success stories,
    by people just like YOU! Adipex (sometimes known as Adipex-P) is the most popular brand of phentermine diet
    pills, manufactured by Teva Pharmaceuticals USA.
    It is a phentermine 37.5 mg dosage (the strongest) that you can buy
    in capsule or tablet form. Adipex is not cheap phentermine.
    It is generally considered high quality. Phentermine 37.5 mg
    is a very popular dosage of phentermine. Popular dosages include 15mg, 30 mg and 37.5 mg.
    All manufacturers of phentermine diet pills have their own 37.5 mg dosage.
    Phentermine 37.5 mg tablets or capsules are generally cheap and are manufactured by generic pharmaceutical groups.
    Qsymia (previously known as Qnexa) is a new diet pill combo made
    of phentermine and Topiramate. It was approved by
    the FDA in 2012 and recently launched in September 2012.
    Because it uses phentermine as an appetite suppressant, it is considered as a very effective diet pill but cannot be
    prescribed to everybody. Phen Caps are an alternative to the prescription drug, phentermine.

    Phen Caps give similar benefits as phentermine, such as supressing appetite,
    stimulating your metabolism and increasing energy,
    without the side effects. Phen Caps do not contain phentermine, making them available without a
    prescription to anyone.