--------------------

Lecture Notes in Computer Science
Volume 2461, pages 698-710
© Springer Verlag, 2002

Engineering a Lightweight Suffix Array Construction Algorithm

Giovanni Manzini (Università del Piemonte Orientale and IIT-CNR, Pisa)
Paolo Ferragina (Università di Pisa)

Abstract. We consider the problem of computing the suffix array of a text T[1,n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array consists of n integers in the range [1,n]. This means that in theory it uses O(nlogn) bits of storage. However, in most applications the size of the text is smaller than 232 and it is customary to store each integer in a four byte word; this yields a total space occupancy of 4n bytes. For what concerns the cost of constructing the suffix array, the theoretically best algorithms run in Theta(n) time. These algorithms work by first building the suffix tree and then obtaining the sorted suffixes via an in-order traversal of the tree. However, suffix tree construction algorithms are both complex and space consuming since they occupy at least 15n bytes of working space. This makes their use impractical even for moderately large texts. For this reason, suffix arrays are usually built using algorithms which run in O(nlog n) time but have a smaller space occupancy. Among these algorithms the current ``leader'' is the qsufsort algorithm by Larsson and Sadakane. qsufsort uses 8n bytes and despite the O(nlog n) worst case bound it is faster than the algorithms based on suffix tree construction. In this paper we describe and extensively test a new lightweight suffix sorting algorithm. Our algorithm uses 5n + cn bytes, where c is a user tunable parameter (in our tests c was at most 0.03). Our algorithm is faster than qsufsort for all files except for the one with the largest LCP. For this file our algorithm is 15% slower than qsufsort on an Athlon machine and 50% slower on a Pentium. We believe that this slowdown on a single file---with an unusually large LCP---is an acceptable price to pay in exchange for the reduced space occupancy offered by our algorithm.

The source code of our lightweight suffix sorting algorithm is available under GPL.

Retrieve full article in PDF (219655 bytes)

Retrieve full article in compressed PostScript (108053 bytes)

--------------------