Compressing and Indexing Stock Market Data

Source code and dataset for the paper A Compact Index for Order-Preserving Pattern Matching, Data compression Conference 2017, pp 72-81 and the Technical Report Compressing and Indexing Stock Market Data by Gianni Decaroli, Travis Gagie, and Giovanni Manzini.

Building an index

download and install the sdsl-lite library
extract the smi.zip file in an empty directory
edit the first two lines of the Makefile with the location of the sdsl-lite lib and include directory. Hint: look at the Make.helper file in the sdsl-lite directory. Change any other compiler option if you know what are you doing.
(recommended) store the test files in a subdirectory (eg data)
to build an index for the file data/hal with block size 32 and window size 8 using the lsk encoder type:
```
make index_strict BASE=data/hal BSIZE=32 WSIZE=8 ENC=lsk
```
This will build the index: a compressed suffix array of the order component data/hal.8.yo.32.csa and the compressed delta component data/hal.lsk.8.32.ez
to check the index by retrieving the indexed text type:
```
make index_scheck BASE=data/hal BSIZE=32 WSIZE=8 ENC=lsk
```
This will build the order/delta components (if not already available) and also decompress them and check that the decompressed file (with extension .ezi) matches the original data/hal file. This command will also compress the input file using gzip and xz but this step can be safely omitted.

Recall window size must be at least 3 and at most 255. Available encoders are: lsk lsd and dlt as described in the technical report.

Testing the search algorithms

To check the correctness of the index search procedure:

Extract 100 patterns of lenght 20 from data/hal
```
 python3 patextract.py data/hal -p 100 -l 20 > halpatterns
```
The values in the pattern file are in plain ascii so you can look at them or create your own by hand
Search with the scan (aka bruteforce) algorithm
```
 make bruteforce.x 
 bruteforce.x -s data/hal halpatterns
```
The output will report the number of matches and the running time:
```
 102 OP matches found. Elapsed 0.058258 seconds 0.58258 millisecXpat
```
If you provide a filename as a third argument the location of the matches will be written (in plain ascii) to that file using one line for each pattern. Eg:
```
 bruteforce.x -s data/hal halpatterns bf.out
```
Search with smsearch using the stock market index:
```
 smsearch.32.x data/hal 8 lsd halpatterns
```
Note that the block size is part of the executable name, while window size and encoder are parameters, together with the name of the pattern file. The name of the original file data/hal is only used to generate the name of the .csa and .dz files forming the index: the actual file data/hal is not used by the search algorithm. The output will be something like
```
 Found 102 Ph2 112 Ph1 507 Elapsed 0.007583 seconds
```
which means 102 occurrences were found, there were 112 candidates after Phase 2 ad 507 after Phase 1. These values are the number of candidates summed over all patterns in the pattern file. The overall running time was 0.007583 seconds (not including the time to load the index and the patterns). If you provide a filename as a fourth argument the location of the matches will be written (in plain ascii) to that file using one line for each pattern. Eg:
```
 smsearch.32.x data/hal 8 lsk halpatterns sm.out
```
You can check that bruteforce’s and smsearch’s output agree:
```
 cmp bf.out sm.out
```

The commands make search_scheck and make search_stime automate the task of comparing the smsearch and bruteforce algorithms. See Makefile for details.

Non-strict index

We have developed and tested a version of the algorithm in which no two values are considered equal. This version is called smi in the technical report where the index called smi in the DCC paper is named ssmi (strict smi, sorry for the change of naming). The commands for creating, checking, and testing this non-strict index are:

make index
make index_check
make search_check
make search_time

see again the Makefile for details.

Test files

In the papers we used the following files for the experiments.

Name	Values	Description
ibm	2,167,147	ibm stock tick data from Sep. 2011 to Feb. 2016 downloaded from http://www.finam.ru/analysis/profile041CA00007/default.asp
prices	31,559,990	daily, hourly, and 5min US stock prices downloaded from http://stooq.com/db/h/
tmax	15,476,000	max daily temperature from 424 US stations over 100 years downloaded from http://cdiac.ornl.gov/ftp/us_recordtemps/sta424/
ecg	20,140,000	22 hours and 23 minutes of ECG data downloaded from http://www.cs.ucr.edu/~eamonn/UCRsuite.html
rwalk	50,000,000	random walk with integer steps in the range [-20,20]
rand	50,000,000	random integers in the range [-20,20]

The above test files are availabe in this archive. All values are stored as 32 bit signed integers in little endian format, hence file sizes are 4 times the number of values.