IMC-CNR
Technical Report number B4-00-04
To appear in Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA '01).
Abstract.
The size of electronic data is currently growing at a faster rate than
computer memory and disk storage capacities. For this reason
compression appears always as an attractive choice, if not mandatory.
However space overhead is not the only resource to be optimized when
managing large data collections; in fact data turn out to be useful only when
properly indexed to support search operations that efficiently extract
the user-requested information.
Approaches to combine compression and indexing techniques are nowadays
receiving more and more attention. A first step towards the design of
a compressed full-text index achieving guaranteed performance in the
worst case has been recently done in
[Ferragina-Manzini FOCS 2000].
The novelty of that index resides in the careful combination of the
compression algorithm proposed by Burrows and Wheeler with the suffix
array data structure. The index is opportunistic in that,
although no assumption on a particular fixed distribution is made, it
takes advantage of the compressibility of the input data by decreasing
the space occupancy at no significant asymptotic slowdown in the query
performance.
In this paper we present an implementation of this index and perform
an extensive set of experiments on various text collections. These
experiments show that our index is compact (its space occupancy is
close to the one achieved by the best known compressors), it is fast
in counting the number of occurrences of a pattern, and the cost of
their retrieval is reasonable when they are few (i.e. in case of a
selective query).
Retrieve technical report in PDF (225785 bytes)