Lecture Notes in Computer Science

Volume 2461, pages 698-710

© Springer Verlag, 2002

## Engineering a Lightweight Suffix Array Construction Algorithm

###
Giovanni Manzini
(Università del Piemonte Orientale and IIT-CNR, Pisa)

Paolo Ferragina
(Università di Pisa)

**Abstract.**
We consider the problem of computing the **suffix array** of a text
*T[1,n]*. This problem consists in sorting the suffixes of *T* in
lexicographic order.
The suffix array consists of *n* integers in the range *[1,n]*. This
means that in theory it uses *O(n*log*n)* bits of storage. However,
in most applications the size of the text is smaller than 2^{32} and
it is customary to store each integer in a four byte word; this yields a total
space occupancy of *4n* bytes. For what concerns the cost of constructing
the suffix array, the theoretically best algorithms run in *Theta(n)*
time. These algorithms work by first building the suffix tree and then
obtaining the sorted suffixes via an in-order traversal of the tree. However,
suffix tree construction algorithms are both complex and space consuming since
they occupy at least *15n* bytes of working space. This makes their use
impractical even for moderately large texts. For this reason, suffix arrays
are usually built using algorithms which run in *O(n*log* n)* time
but have a smaller space occupancy. Among these algorithms the current
``leader'' is the **qsufsort** algorithm by Larsson and
Sadakane. **qsufsort** uses *8n* bytes and despite the *O(n*log*
n)* worst case bound it is faster than the algorithms based on suffix tree
construction.
In this paper we describe and extensively test a new lightweight suffix
sorting algorithm. Our algorithm uses *5n + cn* bytes, where *c* is
a user tunable parameter (in our tests *c* was at most *0.03*). Our
algorithm is faster than **qsufsort** for all files except for the one with
the largest LCP. For this file our algorithm is 15% slower than
**qsufsort** on an Athlon machine and 50% slower on a Pentium. We believe
that this slowdown on a single file---with an unusually large LCP---is an
acceptable price to pay in exchange for the reduced space occupancy offered by
our algorithm.

The source code of our lightweight suffix sorting algorithm
is available under GPL.

Retrieve full article in PDF (219655 bytes)

Retrieve full article in compressed PostScript (108053 bytes)