Skip to content

Latest commit

 

History

History
77 lines (59 loc) · 4.61 KB

README.md

File metadata and controls

77 lines (59 loc) · 4.61 KB

Synerise at KDD Cup 2021: Predicting papers’ subject areas in a heterogeneous academic graph

Implementation of our solution to KDD CUP Challenge. The goal of the challenge was to predict the subject areas of papers situated in the heterogeneous graph in MAG240M-LSC dataset.

Practical Relevance: The volume of scientific publication has been increasing exponentially, doubling every 12 years. Currently, subject areas of arXiv papers are manually determined by the paper’s authors and arXiv moderators. An accurate automatic predictor of papers’ subject categories not only reduces the significant burden of manual labeling, but can also be used to classify the vast number of non-arXiv papers, thereby allowing better search and organization of academic papers.

Graph: 121M academic papers in English extracted from MAG to construct a heterogeneous academic graph. The resultant paper set is written by 122M author entities, who are affiliated with 26K institutes. Among these papers, there are 1.3B citation links captured by MAG. Each paper is associated with its natural language title and most papers’ abstracts are also available. We concatenate the title and abstract by period and pass it to a RoBERTa sentence encoder [2,3], generating a 768-dimensional vector for each paper node. Among the 121M paper nodes, approximately 1.4M nodes are arXiv papers annotated with 153 arXiv subject areas, e.g., cs.LG (Machine Learning).

Requirements

  • Python 3.8
  • Install requirments: pip install -r requirements.txt
  • GPU for training
  • SSD drive for fast reading memmap files
  • 400 GB RAM
  • Download binary Cleora release. Then add execution permission to run it. Refer to cleora github webpage for more details about Cleora.

Getting Started

Steps 1-4 can be run simultaneously

  1. Data preparation. The MAG240M-LSC dataset will be automatically downloaded if not exists to the path denoted in root.py. This takes a while (several hours to a day) in the first run, so please be patient. After decompression, the file size will be around 202GB. Please change its content accordingly if you want to download the dataset to a custom hard-drive or folder. This script creates preprocessed data that is used then during training:

    • data/edges_paper_cites_paper_sorted_by_second_column.npy - numpy array with paper->cite->paper edges sorted by cited paper. Used then for fast retrieval of the papers that cited selected paper.
    • data/edge_author_paper_sorted_by_paper.npy - numpy array with author->writes->paper sorted by paper. Used then for fast retrieval of all paper authors.
    • data/paper2thesameauthors_papers - pickled dict that contains all other papers of the same authors as selected paper
    • data/edge_author_paper_small - edges author->paper but only for authors with labelled papers (for faster searching during training)
    python preprocessing.py
    

    Estimated time of preprocessing, without downloading data: 60 minutes

  2. Compute paper sketches from bert features using EMDE

    python compute_paper_sketches.py
    

    It creates memmap file with paper sketches: data/codes_bert_memmap

    Estimated time of computing paper sketches: 105 minutes

  3. Compute institutions sketches using Cleora and EMDE

    python compute_institutions_sketches.py
    

    It creates:

    • data/inst_codes.npy - memmap file with institutions sketches
    • data/paper2inst - pickled dict that contains all institutions for given paper
    • data/codes_inst2id - pickled dict that maps institution to its index in data/inst_codes.npy

    Estimated time of computing institutions sketches: 55 minutes

  4. Create adjency matrix of graph with paper and author nodes

    python create_graph.py
    

    It creates data/adj.pt file that represents sparse adjency matrix

    Estimated time: 60 minutes

  5. Training model for 2 epochs

    python train.py
    

    Final model was trained with 60 ensembles python train.py --num-ensembles 60

    Predictions for test set for each ensemble are saved as data/ensemble_{ensemble_id}

    Two epochs training time: 40 minutes per one ensemble on Tesla V100 GPU

    Inference time for all test data: 7 minutes

  6. Merging ensemble predictions and save the test submission to file y_pred_mag240m.npz

    python inference.py