Source code and dataset for the paper A Compact Index for Order-Preserving Pattern Matching, Data compression Conference 2017, pp 72-81 and the Technical Report Compressing and Indexing Stock Market Data by Gianni Decaroli, Travis Gagie, and Giovanni Manzini.
download and install the sdsl-lite library
extract the smi.zip file in an empty directory
edit the first two lines of the Makefile with the location of the sdsl-lite lib and include directory. Hint: look at the Make.helper file in the sdsl-lite directory. Change any other compiler option if you know what are you doing.
(recommended) store the test files in a subdirectory (eg data
)
to build an index for the file data/hal
with block size 32 and window size 8 using the lsk
encoder type:
make index_strict BASE=data/hal BSIZE=32 WSIZE=8 ENC=lsk
This will build the index: a compressed suffix array of the order component data/hal.8.yo.32.csa
and the compressed delta component data/hal.lsk.8.32.ez
to check the index by retrieving the indexed text type:
make index_scheck BASE=data/hal BSIZE=32 WSIZE=8 ENC=lsk
This will build the order/delta components (if not already available) and also decompress them and check that the decompressed file (with extension .ezi
) matches the original data/hal
file. This command will also compress the input file using gzip
and xz
but this step can be safely omitted.
Recall window size must be at least 3 and at most 255. Available encoders are: lsk
lsd
and dlt
as described in the technical report.
To check the correctness of the index search procedure:
Extract 100 patterns of lenght 20 from data/hal
python3 patextract.py data/hal -p 100 -l 20 > halpatterns
The values in the pattern file are in plain ascii so you can look at them or create your own by hand
Search with the scan (aka bruteforce) algorithm
make bruteforce.x
bruteforce.x -s data/hal halpatterns
The output will report the number of matches and the running time:
102 OP matches found. Elapsed 0.058258 seconds 0.58258 millisecXpat
If you provide a filename as a third argument the location of the matches will be written (in plain ascii) to that file using one line for each pattern. Eg:
bruteforce.x -s data/hal halpatterns bf.out
Search with smsearch using the stock market index:
smsearch.32.x data/hal 8 lsd halpatterns
Note that the block size is part of the executable name, while window size and encoder are parameters, together with the name of the pattern file. The name of the original file data/hal
is only used to generate the name of the .csa
and .dz
files forming the index: the actual file data/hal
is not used by the search algorithm. The output will be something like
Found 102 Ph2 112 Ph1 507 Elapsed 0.007583 seconds
which means 102 occurrences were found, there were 112 candidates after Phase 2 ad 507 after Phase 1. These values are the number of candidates summed over all patterns in the pattern file. The overall running time was 0.007583 seconds (not including the time to load the index and the patterns). If you provide a filename as a fourth argument the location of the matches will be written (in plain ascii) to that file using one line for each pattern. Eg:
smsearch.32.x data/hal 8 lsk halpatterns sm.out
You can check that bruteforce’s and smsearch’s output agree:
cmp bf.out sm.out
The commands make search_scheck
and make search_stime
automate the task of comparing the smsearch
and bruteforce
algorithms. See Makefile
for details.
We have developed and tested a version of the algorithm in which no two values are considered equal. This version is called smi
in the technical report where the index called smi
in the DCC paper is named ssmi
(strict smi, sorry for the change of naming). The commands for creating, checking, and testing this non-strict index are:
make index
make index_check
make search_check
make search_time
see again the Makefile
for details.
In the papers we used the following files for the experiments.
Name | Values | Description |
---|---|---|
ibm | 2,167,147 | ibm stock tick data from Sep. 2011 to Feb. 2016 downloaded from http://www.finam.ru/analysis/profile041CA00007/default.asp |
prices | 31,559,990 | daily, hourly, and 5min US stock prices downloaded from http://stooq.com/db/h/ |
tmax | 15,476,000 | max daily temperature from 424 US stations over 100 years downloaded from http://cdiac.ornl.gov/ftp/us_recordtemps/sta424/ |
ecg | 20,140,000 | 22 hours and 23 minutes of ECG data downloaded from http://www.cs.ucr.edu/~eamonn/UCRsuite.html |
rwalk | 50,000,000 | random walk with integer steps in the range [-20,20] |
rand | 50,000,000 | random integers in the range [-20,20] |
The above test files are availabe in this archive. All values are stored as 32 bit signed integers in little endian format, hence file sizes are 4 times the number of values.