Copyright 2017 Lavinia Egidi, Giovanni Manzini
make
(creates the executables gap1, gap2, and gap4) cd gsa; make; cd ..
(creates the gsais executable).tar.xz
collectionsTools for merging the BWTs, and optionally the LCP arrays, of a collection of strings using the Gap algorithm.
The BWT/LCP can be in individual files or concatenated in a pair of .bwt/.lcp files together with a .len file giving the length of each sequence. Lengths in the .len file take 4 bytes (32 bit unsigned int’s little endian), BWT symbols take one byte; each LCP entries take 1, 2, or 4 bytes according to the executable (gapK uses K bytes for each LCP entry). The final merged LCP file uses the same number of bytes per entry.
Note that all LCP files have an extension K.lcp (K=1,2,4) indicating the size of a LCP entry, and executable gapK only works for files with K.lcp extension (this is why there are 3 of them).
gap is only a merging algorithm so it cannot (for now) build BWT/LCP arrays from the plain documents. To provide the BWT/LCP arrays to be merged we use the gsais algorithm.
Assume folder pacbio contains the files d.N d.N.len for N = 0 .. 6. Each file d.N contains 100,000 million sequences; d.N.len contains the lengths of the sequences in 32-bit little endian format (they have indeed size 400,000). To compute the BWT and LCP of the first 100 sequences of each file type:
for j in {0..6}; do gsa/gsais -g2 -Lb pacbio/d."$j" 100; done
The above command generates the files d.N.bwt and d.N.2.lcp containing the multi-string (generalized) BWT and LCP of the four groups of 100 sequences. The extension 2.lcp
means that LCP values were written using 2 bytes (we used the option -g2 because all sequences are shorter than 65536).
To build the multi-string BWT of all the 700 sequences type:
gap2 -l -s50 pacbio/d 7
This will create two files d.out.7.bwt and d.out.7.2.lcp containing the merged BWT and LCP arrays.
We can check the result comparing it with the one provided by gsais alone. Type
for j in {0..6}; do gsa/gsais -XL pacbio/d."$j" 100; done
cat pacbio/d.?.100.cat > d700
cat pacbio/d.?.100.len > d700.len
At this point d700 contains the 700 sequences considered above concatenated together, and d700.len contains their lengths. To compute the BWT and LCP with gsais, and compare it with gap output, type
gsa/gsais -g2 -bL d700 700
cmp d700.bwt pacbio/d.out.7.bwt
cmp d700.2.lcp pacbio/d.out.7.2.lcp
Comparing sizes you will notice that both gsais and gap add an extra symbol (with value 0x00) for each sequence, so the size of the multi-string BWT is equal to the sum of the sequences’ sizes plus the number of sequences.
Name | SizeGBs | Alpha | Max DocLen | Ave DocLen | Max LCP | Ave LCP | Download Link |
---|---|---|---|---|---|---|---|
pacbio | 6.24 | 5 | 40212 | 9567.43 | 1055 | 17.99 | .tar.xz |
human | 7.60 | 6 | 103 | 102.00 | 102 | 27.53 | .tar.xz |
itwiki | 4.01 | 210 | 553975 | 4302.84 | 93537 | 61.02 | .tar.xz |
proteins | 6.11 | 26 | 35991 | 410.22 | 25065 | 100.60 | .tar.xz |
Seven files each containing 100,000 sequences, from a Pacbio Drosofila dataset.
Four files each containing 20,000,000 sequences, from an Illumina Human dataset
Five files containing 200,000 documents from itwiki dump of February 2017; file itwiki-20170201-pages-articles.xml
. Documents were obtained splitting at the occurrences of the <page>
and </page>
tags.
Four files each containing 4,000,000 sequences, from the Uniprot Knowledge base release 2017_04, file uniprot_trembl.fasta.gz