Repository for rephetio-algoritim repurposing on the WikiData knowledge repository.
In order to run all of the notbooks in this repo, you must first download historical Wikidata Dumps and process them into a local blazegraph instance. These dumps, while generally <50GB in file size, expand to upwards of 500GB once loaded into blazegraph.
Recent Wikidata entity dumps are avalibe for download from wikimedia.org. Older wikidata entity dumps can be found at archive.org.
The wikimedia repository wikidata-query-rdf contains all the sofware required to process a wikidata entitiy dump and load the resultant data into blazegraph. Please read the getting started page for help in using this software.
Anaconda envrionment file environment.yml
is provided. In addition to this file, the
hetnet_ml repo is requred to be in your python path for
feature extraction to properly work.
Notebooks are numbered in the folder 1_code
and to be run in number order.
The machine learning portion of this repo was run on a workstation with 32 Cores and 378 GB ram.
Be sure and edit the n_jobs
paramter in 1_code/n_fold_CV_training.py
(line 93) and
1_code/full_dataset_training.py
(line 80)
if your machine has fewer cores.