A Lightweight Suffix Array and BWT Construction Algorithm


(Update: June 2009) The algorithm described here is no longer the state-of-the-art for Suffix Array and BWT construction. As an alternative I currently recommend libdivsufsort by Yuta Mori, which is fast, lightweight, and easy to use. For a complete review of the currently available algorithms see A taxonomy of suffix array construction algorithms by S. Puglisi, W. Smyth, and A. Turpin. (End Update)

Welcome to the home page of the deep-shallow suffix array and BWT construction algorithm. The construction of the suffix array is equivalent to the problem of lexicographically sorting all the suffixes of the input string. Our algorithm is based on a new approach called deep-shallow sorting: we use a ``shallow'' sorter for the suffixes with a small common prefix, and a ``deep'' sorter for the suffixes with a large common prefix.

The suffix array has many applications. One of the most important is the computation of the Burrows-Wheeler Transform (BWT). The BWT is widely used in data compression and is the basis of bzip2 the current standard for general purpose lossless compression.

All the known algorithms for building the suffix array either require a large amount of space or are inefficient when the input string contains many repeated substrings. Our algorithm was designed to overcome this unpleasant dichotomy. Our algorithm is lightweight in the sense that it uses a very small amount of space---in addition to the space required by the suffix array itself. Moreover, extensive experiments with inputs of size up to 110MB show that our algorithm is fast, even when the input contains many repetitions.

The deep-shallow suffix array construction algorithm has been developed by Giovanni Manzini (Università del Piemonte Orientale) and Paolo Ferragina (Università di Pisa). An extended abstract describing the algorithm and reporting some preliminary results has been published in the proceedings of the of the 10th European Symposium on Algorithms (ESA '02). A more complete journal version appeared on Algorithmica, Vol. 40 (2004), pagg. 33-50.

The source code of the lightweight suffix-sorting/BWT-construction algorithm is available under GNU General Public License Version 2 or Mozilla Public License Version 1.1 (at your choice, see README file). It has been fully tested only under Linux, but it should work on any platform supporting GCC. You are encouraged to download it and report any problem to the authors.

(Update: July 2004) The archive now includes the LCP array construction algorithms described in the paper Two Space Saving Tricks for Linear Time LCP Array Computation Proc. SWAT '04, Springer-Verlag LNCS n. 3111 and in this Technical Report. More documentation on these LCP algorithms will be included soon. (End Update)

The material below this paragraph is made available for research purposes only. Copyright and all rights therein are maintained by the authors or by other copyright holders. It is understood that all persons accessing this information will adhere to the appropriate copyright rules.
Anyone interested in testing his/her own algorithms on our data set is encouraged to download the collection of files used in our experiments. Also available are the source files of the suffix array construction algorithms (by Jesper Larsson, Kunihiko Sadakane and by Julian Seward) tested in our papers. Sorry, at the moment there is no documentation for these algorithms: look at the source code or type the executable name without arguments for a minimal help.


Viewable With Any Browser Valid HTML 3.2!