Source code and dataset for the paper A Compact Index for Order-Preserving Pattern Matching, Data compression Conference 2017, pp 72-81 and the Technical Report Compressing and Indexing Stock Market Data by Gianni Decaroli, Travis Gagie, and Giovanni Manzini.
download and install the sdsl-lite library
extract the smi.zip file in an empty directory
edit the first two lines of the Makefile with the location of the sdsl-lite lib and include directory. Hint: look at the Make.helper file in the sdsl-lite directory. Change any other compiler option if you know what are you doing.
(recommended) store the test files in a subdirectory (eg
to build an index for the file
data/hal with block size 32 and window size 8 using the
lsk encoder type:
make index_strict BASE=data/hal BSIZE=32 WSIZE=8 ENC=lsk
This will build the index: a compressed suffix array of the order component
data/hal.8.yo.32.csa and the compressed delta component
to check the index by retrieving the indexed text type:
make index_scheck BASE=data/hal BSIZE=32 WSIZE=8 ENC=lsk
This will build the order/delta components (if not already available) and also decompress them and check that the decompressed file (with extension
.ezi) matches the original
data/hal file. This command will also compress the input file using
xz but this step can be safely omitted.
Recall window size must be at least 3 and at most 255. Available encoders are:
dlt as described in the technical report.
To check the correctness of the index search procedure:
Extract 100 patterns of lenght 20 from
python3 patextract.py data/hal -p 100 -l 20 > halpatterns
The values in the pattern file are in plain ascii so you can look at them or create your own by hand
Search with the scan (aka bruteforce) algorithm
make bruteforce.x bruteforce.x -s data/hal halpatterns
The output will report the number of matches and the running time:
102 OP matches found. Elapsed 0.058258 seconds 0.58258 millisecXpat
If you provide a filename as a third argument the location of the matches will be written (in plain ascii) to that file using one line for each pattern. Eg:
bruteforce.x -s data/hal halpatterns bf.out
Search with smsearch using the stock market index:
smsearch.32.x data/hal 8 lsd halpatterns
Note that the block size is part of the executable name, while window size and encoder are parameters, together with the name of the pattern file. The name of the original file
data/hal is only used to generate the name of the
.dz files forming the index: the actual file
data/hal is not used by the search algorithm. The output will be something like
Found 102 Ph2 112 Ph1 507 Elapsed 0.007583 seconds
which means 102 occurrences were found, there were 112 candidates after Phase 2 ad 507 after Phase 1. These values are the number of candidates summed over all patterns in the pattern file. The overall running time was 0.007583 seconds (not including the time to load the index and the patterns). If you provide a filename as a fourth argument the location of the matches will be written (in plain ascii) to that file using one line for each pattern. Eg:
smsearch.32.x data/hal 8 lsk halpatterns sm.out
You can check that bruteforce’s and smsearch’s output agree:
cmp bf.out sm.out
make search_scheck and
make search_stime automate the task of comparing the
bruteforce algorithms. See
Makefile for details.
We have developed and tested a version of the algorithm in which no two values are considered equal. This version is called
smi in the technical report where the index called
smi in the DCC paper is named
ssmi (strict smi, sorry for the change of naming). The commands for creating, checking, and testing this non-strict index are:
make index make index_check make search_check make search_time
see again the
Makefile for details.
In the papers we used the following files for the experiments.
|ibm||2,167,147||ibm stock tick data from Sep. 2011 to Feb. 2016 downloaded from http://www.finam.ru/analysis/profile041CA00007/default.asp|
|prices||31,559,990||daily, hourly, and 5min US stock prices downloaded from http://stooq.com/db/h/|
|tmax||15,476,000||max daily temperature from 424 US stations over 100 years downloaded from http://cdiac.ornl.gov/ftp/us_recordtemps/sta424/|
|ecg||20,140,000||22 hours and 23 minutes of ECG data downloaded from http://www.cs.ucr.edu/~eamonn/UCRsuite.html|
|rwalk||50,000,000||random walk with integer steps in the range [-20,20]|
|rand||50,000,000||random integers in the range [-20,20]|
The above test files are availabe in this archive. All values are stored as 32 bit signed integers in little endian format, hence file sizes are 4 times the number of values.