artwork by Tony Savon
  Full-text index in Minute space


Keywords: Compressed full-text index, Information Retrieval on compressed texts, Pattern-matching on compressed texts, Lossless fingerprint, SGML-XML search engine, Burrows-Wheeler transform, Suffix Array, Suffix Tree.



What is it?

We have devised an algorithm which combines compression and indexing. Our algorithm can be used as a compressor, just like gzip, pkzip, etc. But unlike these compressors our algorithm produces a compressed file that can be used as an information retrieval index. We call this index the FM-index since it is a Full-text index which occupies Minute space. Using the FM-index one can count and locate the occurrences in the original file of a pattern string by looking at only a small portion of the compressed file. The time required for these operations is a few milliseconds even for files of several megabytes (see below for the results of extensive experiments). Since the FM-index is a full-text index one can search any pattern, not necessarily a word or a word-prefix. Furthermore, the FM-index is a sort of self-indexing tool because it encapsulates a compressed version of the original file.



How does it work?

The novelty of the FM-index resides in the careful combination of the Burrows-Wheeler compression algorithm and the suffix array data structure to obtain a sort of compressed suffix array. The resulting index is opportunistic since it takes advantage of the compressibility of the input data for decreasing the space occupancy. This compression is achieved at no significant asymptotic slowdown in the query time with respect to the fastest known indices (i.e. the suffix tree and the suffix array).

In [FOCS '00] it has been proven that for a text T of size n the FM-index uses space proportional to the entropy of T. In mathematical terms, the space occupancy is O(n Hk(T)) + o(n) bits, where Hk(T) is the k-th order empirical entropy of T (the bound holds for any fixed k).

The FM-index supports two basic operations:

In practice both operations decompress just a tiny portion of the compressed file (typically a few kilobytes), and take a few milliseconds even for files of several megabytes.

The structure of the FM-index consists of two parts: a core containing the basic data structure, and an auxiliary block of information which is needed only to support the locate operation. The FM-index allows to trade space occupancy for search time. This means that by increasing the size of the core and/or of the auxiliary information we can speed up the count and locate operations.

Experimental Results

In the following we report the results of some experiments designed to test the performances of the FM-index (the results of a much more extensive set of experiments are reported in [SODA'01]). We ran all the experiments on a machine equipped with a 600Mhz Pentium III, 1 Gb RAM and a 9.1 Gb SCSI hard disk. The operating system was Gnu\Linux Debian 2.2. The table below shows the size and content of the files used in our experiments.

Name 

Size 

Content 

bible

4,047,392 

King James Bible 

e.coli

4,638,690 

DNA sequence 

world192

2,473,400 

1992 CIA world fact book 

cantrbry

2,821,120 

Canterbury corpus 

jdk13

69,728,899 

html and java sources 

ap90

67,108,864 

SGML-tagged text 



The files bible, e.coli, world192 are the large files of the Canterbury Corpus. The file cantrbry is a tar archive containing the small files of the Canterbury Corpus. This tar archive includes both binary files and text files in various formats (plain text, html, c, and lisp code). The file jdk13 consists of the concatenation of the .java and .html files taken from the Jdk 1.3 documentation. The file ap90 consists of the first 64Mb of the concatenation of the files ap90MMDD.txt from the TREC collection. We have also used truncated versions of the file ap90: in the following we write ap90-N to denote the file consisting of the first N Megabytes of ap90.


We have already pointed out that the FM-index allows to trade space occupancy for search time . In the following we report the results of extensive experiments on two instances of the FM-index:

The next table compares the compression ratio and the speed of the FM-index with those of gzip (with option -9 for maximum compression) and bzip2 (version 1.0.1). The compression ratio is given as a percentage (i.e. 25% compression means that the size of the compressed file is one quarter of the original file size). The compression and decompression speeds are given in microseconds per input byte.


File: 

bible 

e.coli 

world 

cantrbry 

jdk13 

ap90-8 

ap90-16 

ap90-32 

ap90 

FM-index
(tiny)

Compression ratio 

21.09

26.92 

19.62 

24.02 

5.87 

25.06 

23.93 

22.96 

22.14 

Construction time

2.24 

2.19 

2.26 

2.21 

3.43 

2.49 

2.64 

2.81 

3.04 

Decompression time

0.45 

0.49 

0.44 

0.38 

0.42 

0.48 

0.50 

0.52 

0.57 

FM-index
(fat)

Compression ratio

32.28 

33.61 

33.23 

46.10 

17.02 

38.69 

37.43 

36.36 

35.49 

Construction time 

2.28 

2.21

2.33 

2.39 

3.50 

2.59 

2.74 

2.88 

3.10 

Decompression time 

0.46 

0.51 

0.46 

0.41 

0.43 

0.51 

0.52 

0.55 

0.59 

bzip2

Compression ratio

20.90 

26.97 

19.79 

20.24 

7.03 

27.38 

27.33 

27.32 

27.36 

Compression time

1.16 

1.28 

1.17 

0.89 

1.52 

1.16 

1.17 

1.16 

1.16 

Decompression time 

0.39 

0.48 

0.39 

0.31 

0.28 

0.43 

0.43 

0.43 

0.43 

gzip

Compression ratio 

29.07 

28.00 

29.17 

26.10

10.79 

37.21 

37.28 

37.31 

37.35 

Compression time 

1.74 

10.48 

0.87 

5.04 

0.39 

0.96 

0.97 

0.97 

0.97 

Decompression time 

0.07 

0.07 

0.06 

0.06 

0.04 

0.07 

0.07 

0.07 

0.07 

We can see that the tiny FM-index occupies significantly less space than gzip, and compresses better than bzip2 on all files except bible and cantrbry. This compression improvement is paid in terms of speed, the difference being more noticeable for larger files. The fat FM-index achieves a compression ratio similar to gzip.


The next table shows the compression ratio and search time of the FM-index compared with those of the search tools grep and zgrep. Each test consisted in searching 100 randomly chosen English words of length between 4 and 8 (for e.coli we used random DNA sequences of length between 8 and 15). The table reports the average time (in milliseconds) of a single search operation.


File 

bible 

e.coli 

world 

cantrbry 

jdk13 

ap90-8 

ap90-16 

ap90-32 

ap90 

FM-index
(tiny)

Compr. ratio 

21.09 

26.92 

19.62 

24.02 

5.87 

25.06 

23.93 

22.96 

22.14 

Ave. count time

4.3 

12.3 

4.7 

8.1 

3.2 

6.4 

6.1 

5.9 

5.6 

FM-index
(fat)

Compr. ratio 

32.28 

33.61 

33.23 

46.10 

17.02 

38.69 

37.43 

36.36 

35.49 

Ave. count time

1.0 

2.3 

1.5 

2.7 

1.3 

1.7 

1.7 

1.7 

1.6 

Ave. locate time

7.5 

7.6 

9.4 

7.1 

21.7 

5.4 

5.3 

5.5 

5.3 

zgrep

Compr. ratio 

29.07 

28.00 

29.17 

26.10 

10.79 

37.21 

37.28 

37.31 

37.35 

Ave. search time

343.8 

33,642.6 

213.4 

516.9 

3,824.8 

744.1 

1,468.8 

2,939.0 

5,854.7 

grep

Compr. ratio 

100.00 

100.00 

100.00 

100.00 

100.00 

100.00 

100.00 

100.00 

100.00 

Ave. search time

40.9 

96.8 

26.0 

32.3 

497.6 

79.3 

154.5 

302.4 

597.2 

Note that, unlike the FM-index, the algorithms zgrep and grep fully scan the compressed (zgrep) or uncompressed (grep) file. Hence their search performance grows linearly with the text size independently of the number of retrieved occurrences. We point out also that the significant time gap between zgrep and grep is due to the inefficiency of the Unix pipe used by zgrep.

Summing up

The FM-index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). These features make the FM-index interesting in various contexts, for example:

Demo

By clicking here you can enjoy a simple demo of the FM-index on two large files: 100 MB of AP-news in SGML format, and the Java 1.3 documentation. You can perform a search for an arbitrary pattern string, not necessarily a word or a word-prefix. The key feature you should check is the fast counting of the number of pattern occurrences.

Executables

By clicking here you can dowload a tar file containing an implementation of the FM-index consisting of a software for building the compressed index, and a software for querying it. Please read carefully the Copyright.

References

[FOCS'00] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st IEEE Symposium on Foundations of Computer Science (FOCS), Redondo Beach, CA, (USA), 2000. [abstract] [pdf]

[SODA '01] Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index . In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington DC, (USA), 2001. [abstract] [pdf]

[DDJ '01] Mark Nelson editor, web page of Dr. Dobbs Journal, Data Compression section, 23th March 2001. [page copy]


©2000 FM-index by P. Ferragina and G. Manzini