Proc. 41st IEEE Symposium on Foundations of Computer Science
(FOCS '00), pag. 390-398.
Abstract. In this paper we address the issue of compressing
and indexing data. We devise a data structure whose space occupancy
is a function of the entropy of the underlying data set. We call the
data structure opportunistic since its space occupancy is
decreased when the input is compressible and this space reduction is
achieved at no significant slowdown in the query performance. More
precisely, its space occupancy is optimal in an information-content
sense because a text T[1,u] is stored using
O(Hk(T)) + o(1) bits per input symbol in the worst
case, where Hk(T) is the k-th order empirical
entropy of T (the bound holds for any fixed k). Given
an arbitrary string P[1,p], the opportunistic data structure
allows to search for the occ occurrences of P in
T in O(p + occ (logeps u)) time (for any
fixed eps >0). If data are uncompressible we achieve the best
space bound currently known [Grossi-Vitter, STOC '00]; on compressible
data our solution improves the succinct suffix array of
[Grossi-Vitter, STOC '00] and the classical suffix tree and suffix
array data structures either in space or in query time or both.
We also study our opportunistic data structure in a dynamic setting
and devise a variant achieving effective search and update time
bounds. Finally, we show how to plug our opportunistic data structure
into the Glimpse tool. The result is an indexing tool
which achieves sublinear space and sublinear query time complexity.
The data structure described in this paper has been implemented and extensively tested. The results are reported in [Ferragina Manzini, SODA '01].
Retrieve a preliminary version in PDF (265456 bytes)