Compressing and Indexing Stock Market Data

Source code and dataset for the paper A Compact Index for Order-Preserving Pattern Matching, Data compression Conference 2017, pp 72-81 and the Technical Report Compressing and Indexing Stock Market Data by Gianni Decaroli, Travis Gagie, and Giovanni Manzini.

Building an index

Recall window size must be at least 3 and at most 255. Available encoders are: lsk lsd and dlt as described in the technical report.

Testing the search algorithms

To check the correctness of the index search procedure:

The commands make search_scheck and make search_stime automate the task of comparing the smsearch and bruteforce algorithms. See Makefile for details.

Non-strict index

We have developed and tested a version of the algorithm in which no two values are considered equal. This version is called smi in the technical report where the index called smi in the DCC paper is named ssmi (strict smi, sorry for the change of naming). The commands for creating, checking, and testing this non-strict index are:

make index
make index_check
make search_check
make search_time

see again the Makefile for details.

Test files

In the papers we used the following files for the experiments.

Name Values Description
ibm 2,167,147 ibm stock tick data from Sep. 2011 to Feb. 2016 downloaded from http://www.finam.ru/analysis/profile041CA00007/default.asp
prices 31,559,990 daily, hourly, and 5min US stock prices downloaded from http://stooq.com/db/h/
tmax 15,476,000 max daily temperature from 424 US stations over 100 years downloaded from http://cdiac.ornl.gov/ftp/us_recordtemps/sta424/
ecg 20,140,000 22 hours and 23 minutes of ECG data downloaded from http://www.cs.ucr.edu/~eamonn/UCRsuite.html
rwalk 50,000,000 random walk with integer steps in the range [-20,20]
rand 50,000,000 random integers in the range [-20,20]

The above test files are availabe in this archive. All values are stored as 32 bit signed integers in little endian format, hence file sizes are 4 times the number of values.