Official code of CALLEE: Recovering Call Graphs for Binaries with Transfer and Contrastive Learning.
For ease of use, we have made some changes to the original implementation in the paper.
Status: We have substituted the doc2vec model with transformers and released a new dataset.
We have decided to deprecate the old dataset since it was collected several years ago on older versions of Firefox and the Linux kernel.
Tested on Ubuntu 18.04 with
- Python3 (python-magic, gensim, numpy, torch, tqdm, capstone)
- IDA Pro 7.6
- CUDA 10.2
NOTE: This is a single-thread demo, consider multiprocessing for production or batch processing
a. Slice target binary with IDA
python3 run-slice.py -i /path/to/binary -o /path/to/slices -n <num_workers> --ida_path /path/to/idat64
The script invokes IDA Pro to analyze the binary and perform slicing for indirect callsites and candidate callees.
b. Tokenize the slices
python3 preprocess.py -i /path/to/slices -o /path/to/tokenized_slices
The script tokenizes assembly instructions of slices.
c. Generate embeddings with doc2vec
python3 store_emb.py -i /path/to/tokenized_slices -o /path/to/embeddings --doc2vec_model /path/to/doc2vec_model
The script transforms slices into embeddings with pretrained doc2vec model.
d. Predict with the Siamese network
python3 pred.py -i /path/to/embeddings
The script outputs scores for each (indirect callsite, candidate callee).
Here is a qemu tcg plugin we've modified to collect indirect calls on x86_64: ibresolver