A Lightweight External Memory BWT Construction Algorithm

Welcome to the home page of bwte: the first tool for the direct construction of the BWT of large files in external memory. Direct construction means that the BWT is built directly, without first computing the suffix array. bwte accesses the disk data only by sequential scans so it takes full advantage of the modern caching architectures. In addition, it stores all files (input, output, and intermediate) in compressed form. As a result, for real world inputs its total space usage is less than the size of the input file uncompressed. Read the full story in this draft (the final version will appear in the proceeding of the LATIN 2010 conference).

The bwte algorithm is part of the bwtdisk library. Using the bwtdisk library you can compute large BWT's from within your application and, more importantly, you can extend the bwte tool adding the support for ad-hoc compression formats.

The source code and the documentation of the bwtdisk library is available under GNU General Public License Version 3. It has been fully tested only under Linux, but it should work on any platform supporting GCC. You are encouraged to download it and report any problem.

Note that bwte/ bwtdisk computes the BWT of the input text T reversed. This is a feature not a bug! With this BWT inversion is easier since now the standard inversion procedure generates the text T from the beginning to end. Also, the backward search procedure of the FM-index and variants becomes a forward search: to search for "pattern" you first search for the occurrences of "p", "pa", "pat", and so on.