Compressed Full-Text Search Engine

^{artwork by Tony Savon}
Full-text index in Minute space

Keywords: Compressed full-text index, Information Retrieval on compressed texts, Pattern-matching on compressed texts, Lossless fingerprint, SGML-XML search engine, Burrows-Wheeler transform, Suffix Array, Suffix Tree.

What is it?

We have devised an algorithm which combines compression and indexing. Our algorithm can be used as a compressor, just like gzip, pkzip, etc. But unlike these compressors our algorithm produces a compressed file that can be used as an information retrieval index. We call this index the FM-index since it is a Full-text index which occupies Minute space. Using the FM-index one can count and locate the occurrences in the original file of a pattern string by looking at only a small portion of the compressed file. The time required for these operations is a few milliseconds even for files of several megabytes (see below for the results of extensive experiments). Since the FM-index is a full-text index one can search any pattern, not necessarily a word or a word-prefix. Furthermore, the FM-index is a sort of self-indexing tool because it encapsulates a compressed version of the original file.

How does it work?

The novelty of the FM-index resides in the careful combination of the Burrows-Wheeler compression algorithm and the suffix array data structure to obtain a sort of compressed suffix array. The resulting index is opportunistic since it takes advantage of the compressibility of the input data for decreasing the space occupancy. This compression is achieved at no significant asymptotic slowdown in the query time with respect to the fastest known indices (i.e. the suffix tree and the suffix array).

In [FOCS '00] it has been proven that for a text T of size n the FM-index uses space proportional to the entropy of T. In mathematical terms, the space occupancy is O(n H_k(T)) + o(n) bits, where H_k(T) is the k-th order empirical entropy of T (the bound holds for any fixed k).

The FM-index supports two basic operations:

Operation count determines the number of occurrences in the text T of an arbitrary pattern P in time proportional to the length of P. Note that the running time of count does not depend on the length of the input text.
Operation locate retrieves the position in T of a pattern occurrence in O(log^e n) time, where e is a positive constant. e must be chosen when we build the index, but it can be arbitrarily small.

In practice both operations decompress just a tiny portion of the compressed file (typically a few kilobytes), and take a few milliseconds even for files of several megabytes.

The structure of the FM-index consists of two parts: a core containing the basic data structure, and an auxiliary block of information which is needed only to support the locate operation. The FM-index allows to trade space occupancy for search time. This means that by increasing the size of the core and/or of the auxiliary information we can speed up the count and locate operations.

Experimental Results

In the following we report the results of some experiments designed to test the performances of the FM-index (the results of a much more extensive set of experiments are reported in [SODA'01]). We ran all the experiments on a machine equipped with a 600Mhz Pentium III, 1 Gb RAM and a 9.1 Gb SCSI hard disk. The operating system was Gnu\Linux Debian 2.2. The table below shows the size and content of the files used in our experiments.

Name	Size	Content
bible	4,047,392	King James Bible
e.coli	4,638,690	DNA sequence
world192	2,473,400	1992 CIA world fact book
cantrbry	2,821,120	Canterbury corpus
jdk13	69,728,899	html and java sources
ap90	67,108,864	SGML-tagged text

The files bible, e.coli, world192 are the large files of the Canterbury Corpus. The file cantrbry is a tar archive containing the small files of the Canterbury Corpus. This tar archive includes both binary files and text files in various formats (plain text, html, c, and lisp code). The file jdk13 consists of the concatenation of the .java and .html files taken from the Jdk 1.3 documentation. The file ap90 consists of the first 64Mb of the concatenation of the files ap90MMDD.txt from the TREC collection. We have also used truncated versions of the file ap90: in the following we write ap90-N to denote the file consisting of the first N Megabytes of ap90.

We have already pointed out that the FM-index allows to trade space occupancy for search time . In the following we report the results of extensive experiments on two instances of the FM-index:

The tiny FM-index which is designed to achieve high compression and support very fast count operations.
The fat FM-index which is designed to support both the count and locate operations, at the cost of a slightly larger space occupancy.

The next table compares the compression ratio and the speed of the FM-index with those of gzip (with option -9 for maximum compression) and bzip2 (version 1.0.1). The compression ratio is given as a percentage (i.e. 25% compression means that the size of the compressed file is one quarter of the original file size). The compression and decompression speeds are given in microseconds per input byte.

File:		bible	e.coli	world	cantrbry	jdk13	ap90-8	ap90-16	ap90-32	ap90
FM-index (tiny)	Compression ratio	21.09	26.92	19.62	24.02	5.87	25.06	23.93	22.96	22.14
	Construction time	2.24	2.19	2.26	2.21	3.43	2.49	2.64	2.81	3.04
	Decompression time	0.45	0.49	0.44	0.38	0.42	0.48	0.50	0.52	0.57
FM-index (fat)	Compression ratio	32.28	33.61	33.23	46.10	17.02	38.69	37.43	36.36	35.49
	Construction time	2.28	2.21	2.33	2.39	3.50	2.59	2.74	2.88	3.10
	Decompression time	0.46	0.51	0.46	0.41	0.43	0.51	0.52	0.55	0.59
bzip2	Compression ratio	20.90	26.97	19.79	20.24	7.03	27.38	27.33	27.32	27.36
	Compression time	1.16	1.28	1.17	0.89	1.52	1.16	1.17	1.16	1.16
	Decompression time	0.39	0.48	0.39	0.31	0.28	0.43	0.43	0.43	0.43
gzip	Compression ratio	29.07	28.00	29.17	26.10	10.79	37.21	37.28	37.31	37.35
	Compression time	1.74	10.48	0.87	5.04	0.39	0.96	0.97	0.97	0.97
	Decompression time	0.07	0.07	0.06	0.06	0.04	0.07	0.07	0.07	0.07

We can see that the tiny FM-index occupies significantly less space than gzip, and compresses better than bzip2 on all files except bible and cantrbry. This compression improvement is paid in terms of speed, the difference being more noticeable for larger files. The fat FM-index achieves a compression ratio similar to gzip.

The next table shows the compression ratio and search time of the FM-index compared with those of the search tools grep and zgrep. Each test consisted in searching 100 randomly chosen English words of length between 4 and 8 (for e.coli we used random DNA sequences of length between 8 and 15). The table reports the average time (in milliseconds) of a single search operation.

File		bible	e.coli	world	cantrbry	jdk13	ap90-8	ap90-16	ap90-32	ap90
FM-index (tiny)	Compr. ratio	21.09	26.92	19.62	24.02	5.87	25.06	23.93	22.96	22.14
FM-index (tiny)	Ave. count time	4.3	12.3	4.7	8.1	3.2	6.4	6.1	5.9	5.6
FM-index (fat)	Compr. ratio	32.28	33.61	33.23	46.10	17.02	38.69	37.43	36.36	35.49
	Ave. count time	1.0	2.3	1.5	2.7	1.3	1.7	1.7	1.7	1.6
	Ave. locate time	7.5	7.6	9.4	7.1	21.7	5.4	5.3	5.5	5.3
zgrep	Compr. ratio	29.07	28.00	29.17	26.10	10.79	37.21	37.28	37.31	37.35
zgrep	Ave. search time	343.8	33,642.6	213.4	516.9	3,824.8	744.1	1,468.8	2,939.0	5,854.7
grep	Compr. ratio	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
grep	Ave. search time	40.9	96.8	26.0	32.3	497.6	79.3	154.5	302.4	597.2

Note that, unlike the FM-index, the algorithms zgrep and grep fully scan the compressed (zgrep) or uncompressed (grep) file. Hence their search performance grows linearly with the text size independently of the number of retrieved occurrences. We point out also that the significant time gap between zgrep and grep is due to the inefficiency of the Unix pipe used by zgrep.

Summing up

The FM-index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). These features make the FM-index interesting in various contexts, for example:

In CD-ROM production the space issue is the primary concern. Hence the compactness of the FM-index may result useful in squeezing in small space both the text and a full-text index for it. This can be done in a flexible way since the FM-index allows to trade space occupancy for search time.
In spell checkers and virus detectors we are only interested in retrieving few occurrences, possibly a single one, from a dictionary of words. The literature about dictionary implementations is huge, and basic tools are tries, hash-tables and ternary search trees (TST). None of them, however, offers any kind of compression. Some preliminary experiments comparing the TST with the FM-index have shown that the former is up to 200 times faster in searching but occupies about 15 times more space (see [SODA '01] for details).
The FM-index can be used as a basic block in more complex indexing tools. In [SODA '01] it is shown how to plug the FM-index into the Glimpse tool thus achieving both sublinear space occupancy and sublinear search-time complexity in the worst case. We point out that inverted lists allow to achieve only the second goal, whereas Glimpse achieves both goals but under some restrictive conditions (click here for details on this subject). We are currently implementing an extended version of this text retrieval system aimed at managing XML and SGML text collections. A very preliminary prototype of this text-retrieval system has been built upon a set of texts from the Italian Literature (©CIBIT).

Demo

By clicking here you can enjoy a simple demo of the FM-index on two large files: 100 MB of AP-news in SGML format, and the Java 1.3 documentation. You can perform a search for an arbitrary pattern string, not necessarily a word or a word-prefix. The key feature you should check is the fast counting of the number of pattern occurrences.

Executables

By clicking here you can dowload a tar file containing an implementation of the FM-index consisting of a software for building the compressed index, and a software for querying it. Please read carefully the Copyright.

References

[FOCS'00] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st IEEE Symposium on Foundations of Computer Science (FOCS), Redondo Beach, CA, (USA), 2000. [abstract] [pdf]

[SODA '01] Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index . In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington DC, (USA), 2001. [abstract] [pdf]

[DDJ '01] Mark Nelson editor, web page of Dr. Dobbs Journal, Data Compression section, 23th March 2001. [page copy]