DNA compression corpus

A corpus for DNA compression algorithms

This page contains the collection of DNA sequences used for testing the compression algorithms described in A Simple and Fast DNA compression algorithm, by Giovanni Manzini and Marcella Rastero (Software: Practice and Experience, Vol. 34, 1397-1411). Anyone interested in developing a DNA compression algorithm is encouraged to download these sequences and use them for testing. The DNA compression algorithms described in the paper are available under the GNU GPL license. Email me if you are interested.

The corpus consists of DNA sequences belonging to four different organisms: yeast (Saccharomyces Cerevisiae), mouse (Mus Musculus), arabidopsis (Arabidopsis Thaliana), and human. We have selected the largest, the smallest, and the median chromosome of each organism together with the X and Y chromosomes (which are usually more compressible than the autosomes) and the mitochondrial DNA of yeast. We have also included in the corpus a collection of "historical" sequences which has been used for testing by almost every paper devoted to DNA compression.

Each sequence only contains the characters a, c, g, t. To save space and bandwidth we have stored four bases in one byte and compressed the result with gzip; these compressed files have extension .gz4. To get the original sequence back you must uncompress the .gz4 file with gunzip and feed the result to the program dnau provided in this archive. For those using *nix the archive also contains the script gunzip4 which will handle the decompression in one step.

That's all. The sequences are in the directories yeast, mouse, athaliana, human, and historical. Note that the md5sums refer to the uncompressed sequences. Note also that some browsers automatically gunzip the files as you download them: in this case you only have to process the downloaded files with dnau.