Gap algorithm for BWT/LCP merging

Copyright 2017 Lavinia Egidi, Giovanni Manzini

Installation

gap1, gap2, gap4

Tools for merging the BWTs, and optionally the LCP arrays, of a collection of strings using the Gap algorithm.

The BWT/LCP can be in individual files or concatenated in a pair of .bwt/.lcp files together with a .len file giving the length of each sequence. Lengths in the .len file take 4 bytes (32 bit unsigned int’s little endian), BWT symbols take one byte; each LCP entries take 1, 2, or 4 bytes according to the executable (gapK uses K bytes for each LCP entry). The final merged LCP file uses the same number of bytes per entry.

Note that all LCP files have an extension K.lcp (K=1,2,4) indicating the size of a LCP entry, and executable gapK only works for files with K.lcp extension (this is why there are 3 of them).

Sample usage

gap is only a merging algorithm so it cannot (for now) build BWT/LCP arrays from the plain documents. To provide the BWT/LCP arrays to be merged we use the gsais algorithm.

Assume folder pacbio contains the files d.N d.N.len for N = 0 .. 6. Each file d.N contains 100,000 million sequences; d.N.len contains the lengths of the sequences in 32-bit little endian format (they have indeed size 400,000). To compute the BWT and LCP of the first 100 sequences of each file type:

for j in {0..6}; do gsa/gsais -g2 -Lb pacbio/d."$j" 100; done

The above command generates the files d.N.bwt and d.N.2.lcp containing the multi-string (generalized) BWT and LCP of the four groups of 100 sequences. The extension 2.lcp means that LCP values were written using 2 bytes (we used the option -g2 because all sequences are shorter than 65536).

To build the multi-string BWT of all the 700 sequences type:

gap2 -l -s50 pacbio/d 7

This will create two files d.out.7.bwt and d.out.7.2.lcp containing the merged BWT and LCP arrays.

Checking the result

We can check the result comparing it with the one provided by gsais alone. Type

for j in {0..6}; do gsa/gsais -XL pacbio/d."$j" 100; done
cat pacbio/d.?.100.cat > d700
cat pacbio/d.?.100.len > d700.len

At this point d700 contains the 700 sequences considered above concatenated together, and d700.len contains their lengths. To compute the BWT and LCP with gsais, and compare it with gap output, type

gsa/gsais -g2 -bL d700 700
cmp d700.bwt pacbio/d.out.7.bwt
cmp d700.2.lcp pacbio/d.out.7.2.lcp

Comparing sizes you will notice that both gsais and gap add an extra symbol (with value 0x00) for each sequence, so the size of the multi-string BWT is equal to the sum of the sequences’ sizes plus the number of sequences.

Collections

Name SizeGBs Alpha Max DocLen Ave DocLen Max LCP Ave LCP Download Link
pacbio 6.24 5 40212 9567.43 1055 17.99 .tar.xz
human 7.60 6 103 102.00 102 27.53 .tar.xz
itwiki 4.01 210 553975 4302.84 93537 61.02 .tar.xz
proteins 6.11 26 35991 410.22 25065 100.60 .tar.xz

pacbio

Seven files each containing 100,000 sequences, from a Pacbio Drosofila dataset.

illumina

Four files each containing 20,000,000 sequences, from an Illumina Human dataset

itwiki

Five files containing 200,000 documents from itwiki dump of February 2017; file itwiki-20170201-pages-articles.xml. Documents were obtained splitting at the occurrences of the <page> and </page> tags.

proteins

Four files each containing 4,000,000 sequences, from the Uniprot Knowledge base release 2017_04, file uniprot_trembl.fasta.gz