quickSpan very fast parallel implementation for gSpan algorithm in data mining with many tunable options for different domains
quickSpan has gained 35x speedup in comparison to Yan's original implementation with multi-threading, even 1-6 times faster with a single thread. It also reduces more than 100 folds memory usage, which make it feasible to be adopted on personal computers.
quickSpan is fast:
- Adopt task parallel programming.
- Incorporate C++11 hastable and hashset.
- Use contiguous memory storage.
- Use partial pruning.
quickSpan is memory efficient:
- Incorporate C++11 emplace_back method.
- Reconstruct a graph with frequent edges and nodes before mining.
- Reuse memory to store temporal data.
quickSpan is correct:
- We ran experiment for
extern/data/Compound_422
andextern/data/Chemical_340
with minimal support from 0.1 to 0.9, and the results are exactly the same as Yan's gSpan-64.
- g++ or icc with C++11 support.
- glog:
sudo apt-get install libgoogle-glog-*
or install from source. - gflags:
sudo apt-get install libgflags-*
or install from source. - cmake:
sudo apt-get install cmake
or install from source. - openmp environment.
mkdir build && cd build
cmake ..
make
Run an example:
./build/gspan -input_file extern/data/gs/Chemical_340 -support 0.2
Arguments help:
./build/gspan -help
Multi-thread support:
Using the option: -threads N to run with N threads.
Alternatively by environment variable: export OMP_NUM_THREADS=
Memory safety limits:
export OCLENI_MALLOC_MODE=1
export OCLENI_MAX_MEMORY=<maximum memory for quickspan to use in KB>
The project features a Dockerfile allowing the construction of a docker image to be used in the unlikely event that the code does not compile on your machine. You need to have docker up and running on your machine to use this approach.
The standard docker command works:
docker build -t quickspan .
Let's assume that you have your data in /path/to/data (and want to write your results there). For testing, you can the full path to the external/data folder of this repository. First start the docker:
docker run -u `id -u`:`id -g` -v /path/to/data:/data -it quickspan
You will be shown a bash prompt. Within that prompt, you can use gspan on your data that you will find in /data. For instance you can type:
cd /data/
gspan -input_file Chemical_340 -pattern -support 0.5 -output_file Chemical_340_subgraph.gs
All experiments were run on a server with two 14-core processors running at 2GHz, allowing up to 56 parallel threads, and with 128 GB of RAM.
The comparison runs over the accompanying data sets sourced from various academic works. The implementations below are:
- gBolt - original gBolt implementation
- GLP - Graph Learning Package/mathlab library
- quickSpan - the implementation here
- quiskSpan1 - implementation here with "-threads 1" for single threading forced
- SFS - the implementation used by the following two papers:
- B. Bringmann, A. Zimmermann, L. D. Raedt, and S. Nijssen. Don't be afraid of simpler patterns. In PKDD, pages 55-66, 2006.
- B. Bringmann and S. Nijssen. What is frequent in a single graph? In PAKDD, pages 858-863, 2008.
In the following table, ⌛ denotes a timeout and 💥 denotes a crash.
Data set name | Support | gBolt | GLP | quickSpan | quickSpan1 | SFS |
---|---|---|---|---|---|---|
BrainNet ADHD | 0.10 | 0.27 | 1.77 | 0.11 | 0.13 | ⌛ |
0.15 | 0.27 | 0.14 | 0.10 | 0.03 | ⌛ | |
0.20 | 0.28 | 0.07 | 0.08 | 0.02 | ⌛ | |
0.50 | 0.13 | 0.07 | 0.07 | 0.02 | ⌛ | |
Hyper/Impulsive | 0.10 | 14.12 | ⌛ | 11.48 | 117.54 | ⌛ |
0.15 | 0.67 | 54.06 | 0.67 | 1.44 | ⌛ | |
0.20 | 0.39 | 2.47 | 0.19 | 0.44 | ⌛ | |
0.50 | 0.19 | 0.11 | 0.07 | 0.06 | ⌛ | |
Gender | 0.10 | 0.34 | 11.16 | 0.16 | 0.52 | ⌛ |
0.15 | 0.30 | 1.35 | 0.10 | 0.12 | ⌛ | |
0.20 | 0.30 | 0.10 | 0.11 | 0.05 | ⌛ | |
0.50 | 0.16 | 0.05 | 0.08 | 0.03 | ⌛ | |
Chemical 340 | 0.10 | 0.37 | 1.06 | 0.18 | 0.21 | 0.23 |
0.15 | 0.34 | 0.68 | 0.14 | 0.12 | 0.14 | |
0.20 | 0.39 | 0.37 | 0.13 | 0.09 | 0.13 | |
0.50 | 0.28 | 0.18 | 0.10 | 0.06 | 0.06 | |
Mutagen | 0.10 | 0.40 | 0.70 | 0.23 | 0.28 | ⌛ |
0.15 | 0.40 | 0.40 | 0.16 | 0.24 | ⌛ | |
0.20 | 0.37 | 0.32 | 0.15 | 0.19 | ⌛ | |
0.50 | 0.22 | 0.13 | 0.12 | 0.15 | ⌛ | |
NCI AIDS | 0.10 | 5.73 | 90.96 | 5.75 | 15.15 | ⌛ |
0.15 | 5.21 | 48.11 | 5.36 | 11.39 | ⌛ | |
0.20 | 4.84 | 34.45 | 4.78 | 9.75 | ⌛ | |
0.50 | 4.19 | 4.33 | 4.19 | 5.16 | ⌛ | |
AIDS Active | 0.10 | 0.59 | 18.62 | 0.36 | 1.20 | 2.53 |
0.15 | 0.45 | 4.70 | 0.28 | 0.58 | 0.96 | |
0.20 | 0.43 | 1.52 | 0.25 | 0.26 | 0.53 | |
0.50 | 0.32 | 0.29 | 0.18 | 0.11 | 0.12 | |
AIDS + Cancer | 0.10 | 3.34 | 57.28 | 3.32 | 8.37 | ⌛ |
0.15 | 3.10 | 29.51 | 3.05 | 6.32 | ⌛ | |
0.20 | 2.80 | 20.66 | 2.86 | 5.26 | ⌛ | |
0.50 | 2.55 | 6.78 | 2.47 | 3.13 | ⌛ | |
Cancer | 0.10 | 19.05 | 363.91 | 20.44 | 89.04 | ⌛ |
0.15 | 15.11 | 170.93 | 15.50 | 51.63 | ⌛ | |
0.20 | 13.60 | 108.72 | 14.04 | 38.19 | ⌛ | |
0.50 | 10.42 | 29.90 | 10.77 | 14.99 | ⌛ | |
Social DBLP | 0.10 | 0.84 | 1.51 | 0.80 | 0.77 | ⌛ |
0.15 | 0.84 | 1.48 | 0.85 | 0.78 | ⌛ | |
0.20 | 0.85 | 1.51 | 0.82 | 0.75 | ⌛ | |
0.50 | 0.82 | 1.50 | 0.89 | 0.75 | ⌛ | |
0.10 | 1.61 | 3.01 | 1.61 | 1.62 | ⌛ | |
0.15 | 1.66 | 3.06 | 1.61 | 1.52 | ⌛ | |
0.20 | 1.63 | 3.06 | 1.60 | 1.49 | ⌛ | |
0.50 | 1.70 | 3.04 | 1.78 | 1.48 | ⌛ | |
Pathological 2 | 1.00 | 💥 | 💥 | 84.54(24.16) | 81.21(23.38) | ⌛ |
Pathological 20 | 0.10 | 💥 | 💥 | 86.51(291.51) | 86.11(289.36) | ⌛ |
0.15 | 💥 | 💥 | 88.98(291.98) | 84.91(288.22) | ⌛ | |
0.20 | 💥 | 💥 | 88.08(293.15) | 85.86(288.75) | ⌛ | |
0.50 | 💥 | 💥 | 87.22(292.51) | 86.58(289.31) | ⌛ | |
1.00 | 💥 | ⌛ | 88.11(292.68) | 84.24(295.06) | ⌛ |